如何为每个文档使用多个向量进行检索
通常,为每个文档存储多个向量是非常有用的。有许多用例可以从中受益。例如,我们可以嵌入文档的多个部分,并将这些嵌入与父文档关联起来,从而允许检索器在部分内容上命中时返回更大的文档。
LangChain 实现了一个基础的 MultiVectorRetriever,它简化了这个过程。大部分复杂性在于如何为每个文档创建多个向量。本笔记本涵盖了一些常见的方法来创建这些向量并使用 MultiVectorRetriever
。
为每个文档创建多个向量的方法包括:
- 较小的块:将文档分割成较小的块,并嵌入这些块(这是ParentDocumentRetriever)。
- 摘要:为每个文档创建一个摘要,将其嵌入文档中(或替换文档)。
- 假设性问题:创建每个文档适合回答的假设性问题,将这些假设性问题与文档一起嵌入(或替代文档)。
请注意,这也启用了另一种添加嵌入的方法——手动添加。这很有用,因为你可以明确添加应该导致文档被恢复的问题或查询,从而给你更多的控制权。
下面我们通过一个示例进行讲解。首先我们实例化一些文档。我们将使用Chroma向量存储(内存中)和OpenAI嵌入对它们进行索引,但任何LangChain向量存储或嵌入模型都可以满足需求。
%pip install --upgrade --quiet langchain-chroma langchain langchain-openai > /dev/null
from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
loaders = [
TextLoader("paul_graham_essay.txt"),
TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)
较小的块
通常,检索较大的信息块但嵌入较小的信息块可能很有用。这允许嵌入尽可能接近地捕捉语义含义,同时尽可能多地将上下文传递到下游。请注意,这就是ParentDocumentRetriever所做的。这里我们展示了其背后的工作原理。
我们将区分向量存储和文档存储,向量存储索引(子)文档的嵌入,而文档存储则存放“父”文档并将其与标识符关联。
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
接下来,我们通过拆分原始文档生成“子”文档。请注意,我们将文档标识符存储在相应的Document对象的metadata
中。
# The splitter to use to create smaller chunks
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)
最后,我们在向量存储和文档存储中索引文档:
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
仅向量存储将检索小块:
retriever.vectorstore.similarity_search("justice breyer")[0]
Document(page_content='Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.', metadata={'doc_id': '064eca46-a4c4-4789-8e3b-583f9597e54f', 'source': 'state_of_the_union.txt'})
而检索器将返回较大的父文档:
len(retriever.invoke("justice breyer")[0].page_content)
9875
检索器在向量数据库上执行的默认搜索类型是相似性搜索。LangChain 向量存储还支持通过最大边际相关性进行搜索。这可以通过检索器的search_type
参数来控制:
from langchain.retrievers.multi_vector import SearchType
retriever.search_type = SearchType.mmr
len(retriever.invoke("justice breyer")[0].page_content)
9875
将摘要与文档关联以便检索
摘要可能能够更准确地提炼出一个块的内容,从而带来更好的检索效果。这里我们展示了如何创建摘要,然后嵌入这些摘要。
我们构建一个简单的链,它将接收一个输入Document对象,并使用LLM生成摘要。
pip install -qU langchain-openai
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
import uuid
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
chain = (
{"doc": lambda x: x.page_content}
| ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
| llm
| StrOutputParser()
)
请注意,我们可以批量处理文档链:
summaries = chain.batch(docs, {"max_concurrency": 5})
然后我们可以像之前一样初始化一个MultiVectorRetriever
,在我们的向量存储中索引摘要,并在我们的文档存储中保留原始文档:
# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# # We can also add the original chunks to the vectorstore if we so want
# for i, doc in enumerate(docs):
# doc.metadata[id_key] = doc_ids[i]
# retriever.vectorstore.add_documents(docs)
查询向量存储将返回摘要:
sub_docs = retriever.vectorstore.similarity_search("justice breyer")
sub_docs[0]
Document(page_content="President Biden recently nominated Judge Ketanji Brown Jackson to serve on the United States Supreme Court, emphasizing her qualifications and broad support. The President also outlined a plan to secure the border, fix the immigration system, protect women's rights, support LGBTQ+ Americans, and advance mental health services. He highlighted the importance of bipartisan unity in passing legislation, such as the Violence Against Women Act. The President also addressed supporting veterans, particularly those impacted by exposure to burn pits, and announced plans to expand benefits for veterans with respiratory cancers. Additionally, he proposed a plan to end cancer as we know it through the Cancer Moonshot initiative. President Biden expressed optimism about the future of America and emphasized the strength of the American people in overcoming challenges.", metadata={'doc_id': '84015b1b-980e-400a-94d8-cf95d7e079bd'})
而检索器将返回更大的源文档:
retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)
9194
假设查询
LLM 还可以用于生成一份假设性问题列表,这些问题可能会与RAG应用中的相关查询具有密切的语义相似性。然后可以将这些问题嵌入并与文档关联,以提高检索效果。
下面,我们使用with_structured_output方法将LLM输出结构化为字符串列表。
from typing import List
from pydantic import BaseModel, Field
class HypotheticalQuestions(BaseModel):
"""Generate hypothetical questions."""
questions: List[str] = Field(..., description="List of questions")
chain = (
{"doc": lambda x: x.page_content}
# Only asking for 3 hypothetical questions, but this could be adjusted
| ChatPromptTemplate.from_template(
"Generate a list of exactly 3 hypothetical questions that the below document could be used to answer:\n\n{doc}"
)
| ChatOpenAI(max_retries=0, model="gpt-4o").with_structured_output(
HypotheticalQuestions
)
| (lambda x: x.questions)
)
在单个文档上调用链表明它输出了一系列问题:
chain.invoke(docs[0])
["What impact did the IBM 1401 have on the author's early programming experiences?",
"How did the transition from using the IBM 1401 to microcomputers influence the author's programming journey?",
"What role did Lisp play in shaping the author's understanding and approach to AI?"]
我们可以批量处理所有文档上的链,并像以前一样组装我们的向量存储和文档存储:
# Batch chain over documents to generate hypothetical questions
hypothetical_questions = chain.batch(docs, {"max_concurrency": 5})
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="hypo-questions", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"
# The retriever (empty to start)
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]
# Generate Document objects from hypothetical questions
question_docs = []
for i, question_list in enumerate(hypothetical_questions):
question_docs.extend(
[Document(page_content=s, metadata={id_key: doc_ids[i]}) for s in question_list]
)
retriever.vectorstore.add_documents(question_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
请注意,查询底层向量存储将检索与输入查询在语义上相似的假设问题:
sub_docs = retriever.vectorstore.similarity_search("justice breyer")
sub_docs
[Document(page_content='What might be the potential benefits of nominating Circuit Court of Appeals Judge Ketanji Brown Jackson to the United States Supreme Court?', metadata={'doc_id': '43292b74-d1b8-4200-8a8b-ea0cb57fbcdb'}),
Document(page_content='How might the Bipartisan Infrastructure Law impact the economic competition between the U.S. and China?', metadata={'doc_id': '66174780-d00c-4166-9791-f0069846e734'}),
Document(page_content='What factors led to the creation of Y Combinator?', metadata={'doc_id': '72003c4e-4cc9-4f09-a787-0b541a65b38c'}),
Document(page_content='How did the ability to publish essays online change the landscape for writers and thinkers?', metadata={'doc_id': 'e8d2c648-f245-4bcc-b8d3-14e64a164b64'})]
调用检索器将返回相应的文档:
retrieved_docs = retriever.invoke("justice breyer")
len(retrieved_docs[0].page_content)
9194