使用长上下文嵌入的分块+文档混合检索(Together.ai)¶
本笔记本展示了如何使用长上下文together.ai嵌入模型进行高级RAG。我们通过在整个文档文本上运行嵌入模型来为每个文档建立索引,同时也对每个分块进行嵌入。然后,我们定义一个自定义的检索器,可以计算节点相似性和文档相似性。
访问https://together.ai 并注册以获取API密钥。
设置和下载数据¶
我们加载我们的文档。为了加快速度,我们只加载了10页,但是如果你想对模型进行压力测试,当然应该加载所有数据。
In [ ]:
Copied!
%pip install llama-index-embeddings-together
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file
%pip install llama-index-embeddings-together
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file
In [ ]:
Copied!
domain = "docs.llamaindex.ai"
docs_url = "https://docs.llamaindex.ai/en/latest/"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}
domain = "docs.llamaindex.ai"
docs_url = "https://docs.llamaindex.ai/en/latest/"
!wget -e robots=off --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows --domains {domain} --no-parent {docs_url}
In [ ]:
Copied!
from llama_index.readers.file import UnstructuredReader
from pathlib import Path
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
from llama_index.readers.file import UnstructuredReader
from pathlib import Path
from llama_index.llms.openai import OpenAI
from llama_index.core import Document
In [ ]:
Copied!
reader = UnstructuredReader()# all_files_gen = Path("./docs.llamaindex.ai/").rglob("*")# all_files = [f.resolve() for f in all_files_gen]# all_html_files = [f for f in all_files if f.suffix.lower() == ".html"]# 筛选一个子集all_html_files = [ "docs.llamaindex.ai/en/latest/index.html", "docs.llamaindex.ai/en/latest/contributing/contributing.html", "docs.llamaindex.ai/en/latest/understanding/understanding.html", "docs.llamaindex.ai/en/latest/understanding/using_llms/using_llms.html", "docs.llamaindex.ai/en/latest/understanding/using_llms/privacy.html", "docs.llamaindex.ai/en/latest/understanding/loading/llamahub.html", "docs.llamaindex.ai/en/latest/optimizing/production_rag.html", "docs.llamaindex.ai/en/latest/module_guides/models/llms.html",]# TODO: 如果需要更多文档,请设置更高的值doc_limit = 10docs = []for idx, f in enumerate(all_html_files): if idx > doc_limit: break print(f"索引 {idx}/{len(all_html_files)}") loaded_docs = reader.load_data(file=f, split_documents=True) # 硬编码的索引。这之前的所有内容都是所有页面的目录 # 根据需要调整start_idx start_idx = 64 loaded_doc = Document( id_=str(f), text="\n\n".join([d.get_content() for d in loaded_docs[start_idx:]]), metadata={"path": str(f)}, ) print(str(f)) docs.append(loaded_doc)
reader = UnstructuredReader()# all_files_gen = Path("./docs.llamaindex.ai/").rglob("*")# all_files = [f.resolve() for f in all_files_gen]# all_html_files = [f for f in all_files if f.suffix.lower() == ".html"]# 筛选一个子集all_html_files = [ "docs.llamaindex.ai/en/latest/index.html", "docs.llamaindex.ai/en/latest/contributing/contributing.html", "docs.llamaindex.ai/en/latest/understanding/understanding.html", "docs.llamaindex.ai/en/latest/understanding/using_llms/using_llms.html", "docs.llamaindex.ai/en/latest/understanding/using_llms/privacy.html", "docs.llamaindex.ai/en/latest/understanding/loading/llamahub.html", "docs.llamaindex.ai/en/latest/optimizing/production_rag.html", "docs.llamaindex.ai/en/latest/module_guides/models/llms.html",]# TODO: 如果需要更多文档,请设置更高的值doc_limit = 10docs = []for idx, f in enumerate(all_html_files): if idx > doc_limit: break print(f"索引 {idx}/{len(all_html_files)}") loaded_docs = reader.load_data(file=f, split_documents=True) # 硬编码的索引。这之前的所有内容都是所有页面的目录 # 根据需要调整start_idx start_idx = 64 loaded_doc = Document( id_=str(f), text="\n\n".join([d.get_content() for d in loaded_docs[start_idx:]]), metadata={"path": str(f)}, ) print(str(f)) docs.append(loaded_doc)
[nltk_data] Downloading package punkt to /Users/jerryliu/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /Users/jerryliu/nltk_data... [nltk_data] Package averaged_perceptron_tagger is already up-to- [nltk_data] date!
Idx 0/8 docs.llamaindex.ai/en/latest/index.html Idx 1/8 docs.llamaindex.ai/en/latest/contributing/contributing.html Idx 2/8 docs.llamaindex.ai/en/latest/understanding/understanding.html Idx 3/8 docs.llamaindex.ai/en/latest/understanding/using_llms/using_llms.html Idx 4/8 docs.llamaindex.ai/en/latest/understanding/using_llms/privacy.html Idx 5/8 docs.llamaindex.ai/en/latest/understanding/loading/llamahub.html Idx 6/8 docs.llamaindex.ai/en/latest/optimizing/production_rag.html Idx 7/8 docs.llamaindex.ai/en/latest/module_guides/models/llms.html
使用块嵌入 + 父级嵌入构建混合检索¶
定义一个自定义的检索器,实现以下功能:
- 首先根据嵌入相似性检索相关的块
- 对于每个块,查找源文档的嵌入
- 通过 alpha 进行加权
这本质上是向量检索,具有重新排名步骤,重新调整节点相似性的权重。
In [ ]:
Copied!
# 您可以在嵌入或环境中设置API密钥# 导入os# os.environ["TOEGETHER_API_KEY"] = "your-api-key"from llama_index.embeddings.together import TogetherEmbeddingfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.llms.openai import OpenAIapi_key = "<api_key>"embed_model = TogetherEmbedding( model_name="togethercomputer/m2-bert-80M-32k-retrieval", api_key=api_key)llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
# 您可以在嵌入或环境中设置API密钥# 导入os# os.environ["TOEGETHER_API_KEY"] = "your-api-key"from llama_index.embeddings.together import TogetherEmbeddingfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.llms.openai import OpenAIapi_key = ""embed_model = TogetherEmbedding( model_name="togethercomputer/m2-bert-80M-32k-retrieval", api_key=api_key)llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
In [ ]:
Copied!
from llama_index.core.storage.docstore import SimpleDocumentStore
for doc in docs:
embedding = embed_model.get_text_embedding(doc.get_content())
doc.embedding = embedding
docstore = SimpleDocumentStore()
docstore.add_documents(docs)
from llama_index.core.storage.docstore import SimpleDocumentStore
for doc in docs:
embedding = embed_model.get_text_embedding(doc.get_content())
doc.embedding = embedding
docstore = SimpleDocumentStore()
docstore.add_documents(docs)
构建向量索引¶
让我们构建块的向量索引。每个块还将通过其 index_id
引用其源文档(然后可以使用该索引_id在文档存储中查找源文档)。
In [ ]:
Copied!
from llama_index.core.schema import IndexNodefrom llama_index.core import ( load_index_from_storage, StorageContext, VectorStoreIndex,)from llama_index.core.node_parser import SentenceSplitterfrom llama_index.core import SummaryIndexfrom llama_index.core.retrievers import RecursiveRetrieverimport osfrom tqdm.notebook import tqdmimport pickledef build_index(docs, out_path: str = "storage/chunk_index"): nodes = [] splitter = SentenceSplitter(chunk_size=512, chunk_overlap=70) for idx, doc in enumerate(tqdm(docs)): # print('Splitting: ' + str(idx)) cur_nodes = splitter.get_nodes_from_documents([doc]) for cur_node in cur_nodes: # ID will be base + parent file_path = doc.metadata["path"] new_node = IndexNode( text=cur_node.text or "None", index_id=str(file_path), metadata=doc.metadata # obj=doc ) nodes.append(new_node) print("num nodes: " + str(len(nodes))) # save index to disk if not os.path.exists(out_path): index = VectorStoreIndex(nodes, embed_model=embed_model) index.set_index_id("simple_index") index.storage_context.persist(f"./{out_path}") else: # rebuild storage context storage_context = StorageContext.from_defaults( persist_dir=f"./{out_path}" ) # load index index = load_index_from_storage( storage_context, index_id="simple_index", embed_model=embed_model ) return index
from llama_index.core.schema import IndexNodefrom llama_index.core import ( load_index_from_storage, StorageContext, VectorStoreIndex,)from llama_index.core.node_parser import SentenceSplitterfrom llama_index.core import SummaryIndexfrom llama_index.core.retrievers import RecursiveRetrieverimport osfrom tqdm.notebook import tqdmimport pickledef build_index(docs, out_path: str = "storage/chunk_index"): nodes = [] splitter = SentenceSplitter(chunk_size=512, chunk_overlap=70) for idx, doc in enumerate(tqdm(docs)): # print('Splitting: ' + str(idx)) cur_nodes = splitter.get_nodes_from_documents([doc]) for cur_node in cur_nodes: # ID will be base + parent file_path = doc.metadata["path"] new_node = IndexNode( text=cur_node.text or "None", index_id=str(file_path), metadata=doc.metadata # obj=doc ) nodes.append(new_node) print("num nodes: " + str(len(nodes))) # save index to disk if not os.path.exists(out_path): index = VectorStoreIndex(nodes, embed_model=embed_model) index.set_index_id("simple_index") index.storage_context.persist(f"./{out_path}") else: # rebuild storage context storage_context = StorageContext.from_defaults( persist_dir=f"./{out_path}" ) # load index index = load_index_from_storage( storage_context, index_id="simple_index", embed_model=embed_model ) return index
In [ ]:
Copied!
index = build_index(docs)
index = build_index(docs)
定义混合检索器¶
我们定义了一个混合检索器,它可以首先通过向量相似性获取块,然后根据与父文档的相似性重新加权(使用alpha参数)。
In [ ]:
Copied!
from llama_index.core.retrievers import BaseRetrieverfrom llama_index.core.indices.query.embedding_utils import get_top_k_embeddingsfrom llama_index.core import QueryBundlefrom llama_index.core.schema import NodeWithScorefrom typing import List, Any, Optionalclass HybridRetriever(BaseRetriever): """混合检索器。""" def __init__( self, vector_index, docstore, similarity_top_k: int = 2, out_top_k: Optional[int] = None, alpha: float = 0.5, **kwargs: Any, ) -> None: """初始化参数。""" super().__init__(**kwargs) self._vector_index = vector_index self._embed_model = vector_index._embed_model self._retriever = vector_index.as_retriever( similarity_top_k=similarity_top_k ) self._out_top_k = out_top_k or similarity_top_k self._docstore = docstore self._alpha = alpha def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: """给定查询检索节点。""" # 首先检索块 nodes = self._retriever.retrieve(query_bundle.query_str) # 获取文档,并计算查询和文档之间的嵌入相似性 ## 获取文档嵌入 docs = [self._docstore.get_document(n.node.index_id) for n in nodes] doc_embeddings = [d.embedding for d in docs] query_embedding = self._embed_model.get_query_embedding( query_bundle.query_str ) ## 计算文档相似性 doc_similarities, doc_idxs = get_top_k_embeddings( query_embedding, doc_embeddings ) ## 计算最终相似性,包括文档相似性和原始节点相似性 result_tups = [] for doc_idx, doc_similarity in zip(doc_idxs, doc_similarities): node = nodes[doc_idx] # 权重 alpha * 节点相似性 + (1-alpha) * 文档相似性 full_similarity = (self._alpha * node.score) + ( (1 - self._alpha) * doc_similarity ) print( f"文档 {doc_idx} (节点相似性,文档相似性,完整相似性): {(node.score, doc_similarity, full_similarity)}" ) result_tups.append((full_similarity, node)) result_tups = sorted(result_tups, key=lambda x: x[0], reverse=True) # 更新分数 for full_score, node in result_tups: node.score = full_score return [n for _, n in result_tups][:out_top_k]
from llama_index.core.retrievers import BaseRetrieverfrom llama_index.core.indices.query.embedding_utils import get_top_k_embeddingsfrom llama_index.core import QueryBundlefrom llama_index.core.schema import NodeWithScorefrom typing import List, Any, Optionalclass HybridRetriever(BaseRetriever): """混合检索器。""" def __init__( self, vector_index, docstore, similarity_top_k: int = 2, out_top_k: Optional[int] = None, alpha: float = 0.5, **kwargs: Any, ) -> None: """初始化参数。""" super().__init__(**kwargs) self._vector_index = vector_index self._embed_model = vector_index._embed_model self._retriever = vector_index.as_retriever( similarity_top_k=similarity_top_k ) self._out_top_k = out_top_k or similarity_top_k self._docstore = docstore self._alpha = alpha def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]: """给定查询检索节点。""" # 首先检索块 nodes = self._retriever.retrieve(query_bundle.query_str) # 获取文档,并计算查询和文档之间的嵌入相似性 ## 获取文档嵌入 docs = [self._docstore.get_document(n.node.index_id) for n in nodes] doc_embeddings = [d.embedding for d in docs] query_embedding = self._embed_model.get_query_embedding( query_bundle.query_str ) ## 计算文档相似性 doc_similarities, doc_idxs = get_top_k_embeddings( query_embedding, doc_embeddings ) ## 计算最终相似性,包括文档相似性和原始节点相似性 result_tups = [] for doc_idx, doc_similarity in zip(doc_idxs, doc_similarities): node = nodes[doc_idx] # 权重 alpha * 节点相似性 + (1-alpha) * 文档相似性 full_similarity = (self._alpha * node.score) + ( (1 - self._alpha) * doc_similarity ) print( f"文档 {doc_idx} (节点相似性,文档相似性,完整相似性): {(node.score, doc_similarity, full_similarity)}" ) result_tups.append((full_similarity, node)) result_tups = sorted(result_tups, key=lambda x: x[0], reverse=True) # 更新分数 for full_score, node in result_tups: node.score = full_score return [n for _, n in result_tups][:out_top_k]
In [ ]:
Copied!
top_k = 10
out_top_k = 3
hybrid_retriever = HybridRetriever(
index, docstore, similarity_top_k=top_k, out_top_k=3, alpha=0.5
)
base_retriever = index.as_retriever(similarity_top_k=out_top_k)
top_k = 10
out_top_k = 3
hybrid_retriever = HybridRetriever(
index, docstore, similarity_top_k=top_k, out_top_k=3, alpha=0.5
)
base_retriever = index.as_retriever(similarity_top_k=out_top_k)
In [ ]:
Copied!
def show_nodes(nodes, out_len: int = 200):
for idx, n in enumerate(nodes):
print(f"\n\n >>>>>>>>>>>> ID {n.id_}: {n.metadata['path']}")
print(n.get_content()[:out_len])
def show_nodes(nodes, out_len: int = 200):
for idx, n in enumerate(nodes):
print(f"\n\n >>>>>>>>>>>> ID {n.id_}: {n.metadata['path']}")
print(n.get_content()[:out_len])
In [ ]:
Copied!
query_str = "Tell me more about the LLM interface and where they're used"
query_str = "Tell me more about the LLM interface and where they're used"
In [ ]:
Copied!
nodes = hybrid_retriever.retrieve(query_str)
nodes = hybrid_retriever.retrieve(query_str)
Doc 0 (node score, doc similarity, full similarity): (0.8951729860296237, 0.888711859390314, 0.8919424227099688) Doc 3 (node score, doc similarity, full similarity): (0.7606735418349336, 0.888711859390314, 0.8246927006126239) Doc 1 (node score, doc similarity, full similarity): (0.8008658562229534, 0.888711859390314, 0.8447888578066337) Doc 4 (node score, doc similarity, full similarity): (0.7083936595542725, 0.888711859390314, 0.7985527594722932) Doc 2 (node score, doc similarity, full similarity): (0.7627518988051541, 0.7151744680533735, 0.7389631834292638) Doc 5 (node score, doc similarity, full similarity): (0.6576277615091234, 0.6506473659825045, 0.654137563745814) Doc 7 (node score, doc similarity, full similarity): (0.6141130778320664, 0.6159139530209246, 0.6150135154264955) Doc 6 (node score, doc similarity, full similarity): (0.6225339833394525, 0.24827341793941335, 0.43540370063943296) Doc 8 (node score, doc similarity, full similarity): (0.5672766061523489, 0.24827341793941335, 0.4077750120458811) Doc 9 (node score, doc similarity, full similarity): (0.5671131641337652, 0.24827341793941335, 0.4076932910365893)
In [ ]:
Copied!
show_nodes(nodes)
show_nodes(nodes)
>>>>>>>>>>>> ID 2c7b42d3-520c-4510-ba34-d2f2dfd5d8f5: docs.llamaindex.ai/en/latest/module_guides/models/llms.html Contributing: Anyone is welcome to contribute new LLMs to the documentation. Simply copy an existing notebook, setup and test your LLM, and open a PR with your results. If you have ways to improve th >>>>>>>>>>>> ID 72cc9101-5b36-4821-bd50-e707dac8dca1: docs.llamaindex.ai/en/latest/module_guides/models/llms.html Using LLMs Concept Picking the proper Large Language Model (LLM) is one of the first steps you need to consider when building any LLM application over your data. LLMs are a core component of Llam >>>>>>>>>>>> ID 7c2be7c7-44aa-4f11-b670-e402e5ac35a5: docs.llamaindex.ai/en/latest/module_guides/models/llms.html If you change the LLM, you may need to update this tokenizer to ensure accurate token counts, chunking, and prompting. The single requirement for a tokenizer is that it is a callable function, that t
In [ ]:
Copied!
base_nodes = base_retriever.retrieve(query_str)
base_nodes = base_retriever.retrieve(query_str)
In [ ]:
Copied!
show_nodes(base_nodes)
show_nodes(base_nodes)
>>>>>>>>>>>> ID 2c7b42d3-520c-4510-ba34-d2f2dfd5d8f5: docs.llamaindex.ai/en/latest/module_guides/models/llms.html Contributing: Anyone is welcome to contribute new LLMs to the documentation. Simply copy an existing notebook, setup and test your LLM, and open a PR with your results. If you have ways to improve th >>>>>>>>>>>> ID 72cc9101-5b36-4821-bd50-e707dac8dca1: docs.llamaindex.ai/en/latest/module_guides/models/llms.html Using LLMs Concept Picking the proper Large Language Model (LLM) is one of the first steps you need to consider when building any LLM application over your data. LLMs are a core component of Llam >>>>>>>>>>>> ID 252fc99b-2817-4913-bcbf-4dd8ef509b8c: docs.llamaindex.ai/en/latest/index.html These could be APIs, PDFs, SQL, and (much) more. Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume. Engines provide natural language a
在这个部分,我们将运行一些查询来检索数据并进行分析。
In [ ]:
Copied!
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(hybrid_retriever)
base_query_engine = index.as_query_engine(similarity_top_k=out_top_k)
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(hybrid_retriever)
base_query_engine = index.as_query_engine(similarity_top_k=out_top_k)
In [ ]:
Copied!
response = query_engine.query(query_str)
print(str(response))
response = query_engine.query(query_str)
print(str(response))
Doc 0 (node score, doc similarity, full similarity): (0.8951729860296237, 0.888711859390314, 0.8919424227099688) Doc 3 (node score, doc similarity, full similarity): (0.7606735418349336, 0.888711859390314, 0.8246927006126239) Doc 1 (node score, doc similarity, full similarity): (0.8008658562229534, 0.888711859390314, 0.8447888578066337) Doc 4 (node score, doc similarity, full similarity): (0.7083936595542725, 0.888711859390314, 0.7985527594722932) Doc 2 (node score, doc similarity, full similarity): (0.7627518988051541, 0.7151744680533735, 0.7389631834292638) Doc 5 (node score, doc similarity, full similarity): (0.6576277615091234, 0.6506473659825045, 0.654137563745814) Doc 7 (node score, doc similarity, full similarity): (0.6141130778320664, 0.6159139530209246, 0.6150135154264955) Doc 6 (node score, doc similarity, full similarity): (0.6225339833394525, 0.24827341793941335, 0.43540370063943296) Doc 8 (node score, doc similarity, full similarity): (0.5672766061523489, 0.24827341793941335, 0.4077750120458811) Doc 9 (node score, doc similarity, full similarity): (0.5671131641337652, 0.24827341793941335, 0.4076932910365893) The LLM interface is a unified interface provided by LlamaIndex for defining Large Language Models (LLMs) from different sources such as OpenAI, Hugging Face, or LangChain. This interface eliminates the need to write the boilerplate code for defining the LLM interface yourself. The LLM interface supports text completion and chat endpoints, as well as streaming and non-streaming endpoints. It also supports both synchronous and asynchronous endpoints. LLMs are a core component of LlamaIndex and can be used as standalone modules or plugged into other core LlamaIndex modules such as indices, retrievers, and query engines. They are primarily used during the response synthesis step, which occurs after retrieval. Depending on the type of index being used, LLMs may also be used during index construction, insertion, and query traversal. To use LLMs, you can import the necessary modules and instantiate the LLM object. You can then use the LLM object to generate responses or complete text prompts. LlamaIndex provides examples and code snippets to help you get started with using LLMs. It's important to note that tokenization plays a crucial role in LLMs. LlamaIndex uses a global tokenizer by default, but if you change the LLM, you may need to update the tokenizer to ensure accurate token counts, chunking, and prompting. LlamaIndex provides instructions on how to set a global tokenizer using libraries like tiktoken or Hugging Face's AutoTokenizer. Overall, LLMs are powerful tools for building LlamaIndex applications and can be customized within the LlamaIndex abstractions. While LLMs from paid APIs like OpenAI and Anthropic are generally considered more reliable, local open-source models are gaining popularity due to their customizability and transparency. LlamaIndex offers integrations with various LLMs and provides documentation on their compatibility and performance. Contributions to improve the setup and performance of existing LLMs or to add new LLMs are welcome.
In [ ]:
Copied!
base_response = base_query_engine.query(query_str)
print(str(base_response))
base_response = base_query_engine.query(query_str)
print(str(base_response))
The LLM interface is a unified interface provided by LlamaIndex for defining Large Language Model (LLM) modules. It allows users to easily integrate LLMs from different providers such as OpenAI, Hugging Face, or LangChain into their applications without having to write the boilerplate code for defining the LLM interface themselves. LLMs are a core component of LlamaIndex and can be used as standalone modules or plugged into other core LlamaIndex modules such as indices, retrievers, and query engines. They are primarily used during the response synthesis step, which occurs after retrieval. Depending on the type of index being used, LLMs may also be used during index construction, insertion, and query traversal. The LLM interface supports various functionalities, including text completion and chat endpoints. It also provides support for streaming and non-streaming endpoints, as well as synchronous and asynchronous endpoints. To use LLMs, you can import the necessary modules and make use of the provided functions. For example, you can use the OpenAI module to interact with the gpt-3.5-turbo LLM by calling the `OpenAI()` function. You can then use the `complete()` function to generate completions based on a given prompt. It's important to note that LlamaIndex uses a global tokenizer called cl100k from tiktoken by default for all token counting. If you change the LLM being used, you may need to update the tokenizer to ensure accurate token counts, chunking, and prompting. Overall, LLMs and the LLM interface provided by LlamaIndex are essential for building LLM applications and integrating them into the LlamaIndex ecosystem.