递归检索器 + 文档代理¶
本指南展示了如何将递归检索和“文档代理”结合起来,以实现对异构文档的高级决策。
有两个推动因素促使我们寻求更好的检索解决方案:
- 将检索嵌入与基于块的合成解耦。通常情况下,通过摘要检索文档会比直接检索原始块返回更相关的上下文给查询。这正是递归检索直接允许的。
- 在文档内部,用户可能需要动态执行超出基于事实的问答任务的任务。我们引入了“文档代理”的概念 - 这些代理可以访问给定文档的向量搜索和摘要工具。
设置和下载数据¶
在这一部分,我们将定义导入内容,然后下载关于不同城市的维基百科文章。每篇文章都被单独存储。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
%pip install llama-index-llms-openai
%pip install llama-index-agent-openai
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import SummaryIndex
from llama_index.core.schema import IndexNode
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.llms.openai import OpenAI
In [ ]:
Copied!
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
wiki_titles = ["Toronto", "Seattle", "Chicago", "Boston", "Houston"]
In [ ]:
Copied!
from pathlib import Pathimport requestsfor title in wiki_titles: response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "extracts", # 'exintro': True, "explaintext": True, }, ).json() page = next(iter(response["query"]["pages"].values())) wiki_text = page["extract"] data_path = Path("data") if not data_path.exists(): Path.mkdir(data_path) with open(data_path / f"{title}.txt", "w") as fp: fp.write(wiki_text)
from pathlib import Pathimport requestsfor title in wiki_titles: response = requests.get( "https://en.wikipedia.org/w/api.php", params={ "action": "query", "format": "json", "titles": title, "prop": "extracts", # 'exintro': True, "explaintext": True, }, ).json() page = next(iter(response["query"]["pages"].values())) wiki_text = page["extract"] data_path = Path("data") if not data_path.exists(): Path.mkdir(data_path) with open(data_path / f"{title}.txt", "w") as fp: fp.write(wiki_text)
In [ ]:
Copied!
# 加载所有维基文档city_docs = {}for wiki_title in wiki_titles: city_docs[wiki_title] = SimpleDirectoryReader( input_files=[f"data/{wiki_title}.txt"] ).load_data()
# 加载所有维基文档city_docs = {}for wiki_title in wiki_titles: city_docs[wiki_title] = SimpleDirectoryReader( input_files=[f"data/{wiki_title}.txt"] ).load_data()
定义LLM + 服务上下文 + 回调管理器
In [ ]:
Copied!
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
import os
os.environ["OPENAI_API_KEY"] = "sk-..."
In [ ]:
Copied!
from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
from llama_index.core import Settings
Settings.llm = OpenAI(temperature=0, model="gpt-3.5-turbo")
为每个文档构建文档代理¶
在这一部分,我们为每个文档定义"文档代理"。
首先,我们为每个文档定义一个向量索引(用于语义搜索)和摘要索引(用于摘要生成)。然后,这两个查询引擎被转换为工具,传递给一个调用代理的OpenAI函数。
这个文档代理可以动态选择在给定文档内执行语义搜索或摘要生成。
我们为每个城市创建一个单独的文档代理。
In [ ]:
Copied!
from llama_index.agent.openai import OpenAIAgent# 构建代理字典agents = {}for wiki_title in wiki_titles: # 构建向量索引 vector_index = VectorStoreIndex.from_documents( city_docs[wiki_title], ) # 构建摘要索引 summary_index = SummaryIndex.from_documents( city_docs[wiki_title], ) # 定义查询引擎 vector_query_engine = vector_index.as_query_engine() list_query_engine = summary_index.as_query_engine() # 定义工具 query_engine_tools = [ QueryEngineTool( query_engine=vector_query_engine, metadata=ToolMetadata( name="vector_tool", description=( f"用于从{wiki_title}中检索特定上下文" ), ), ), QueryEngineTool( query_engine=list_query_engine, metadata=ToolMetadata( name="summary_tool", description=( "用于与摘要相关的问题总结" f" {wiki_title}" ), ), ), ] # 构建代理 function_llm = OpenAI(model="gpt-3.5-turbo-0613") agent = OpenAIAgent.from_tools( query_engine_tools, llm=function_llm, verbose=True, ) agents[wiki_title] = agent
from llama_index.agent.openai import OpenAIAgent# 构建代理字典agents = {}for wiki_title in wiki_titles: # 构建向量索引 vector_index = VectorStoreIndex.from_documents( city_docs[wiki_title], ) # 构建摘要索引 summary_index = SummaryIndex.from_documents( city_docs[wiki_title], ) # 定义查询引擎 vector_query_engine = vector_index.as_query_engine() list_query_engine = summary_index.as_query_engine() # 定义工具 query_engine_tools = [ QueryEngineTool( query_engine=vector_query_engine, metadata=ToolMetadata( name="vector_tool", description=( f"用于从{wiki_title}中检索特定上下文" ), ), ), QueryEngineTool( query_engine=list_query_engine, metadata=ToolMetadata( name="summary_tool", description=( "用于与摘要相关的问题总结" f" {wiki_title}" ), ), ), ] # 构建代理 function_llm = OpenAI(model="gpt-3.5-turbo-0613") agent = OpenAIAgent.from_tools( query_engine_tools, llm=function_llm, verbose=True, ) agents[wiki_title] = agent
构建可组合的检索器¶
现在我们定义一组摘要节点,其中每个节点都链接到相应的维基百科城市文章。然后我们在这些节点之上定义一个可组合的检索器+查询引擎,用于将查询路由到特定节点,然后该节点将其路由到相关的文档代理。
In [ ]:
Copied!
# 定义顶层节点objects = []for wiki_title in wiki_titles: # 定义链接到这些节点的索引节点 wiki_summary = ( f"这个内容包含了关于{wiki_title}的维基百科文章。如果你需要查找关于{wiki_title}的具体事实,可以使用这个索引。\n如果你想分析多个城市,请不要使用这个索引。" ) node = IndexNode( text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title] ) objects.append(node)
# 定义顶层节点objects = []for wiki_title in wiki_titles: # 定义链接到这些节点的索引节点 wiki_summary = ( f"这个内容包含了关于{wiki_title}的维基百科文章。如果你需要查找关于{wiki_title}的具体事实,可以使用这个索引。\n如果你想分析多个城市,请不要使用这个索引。" ) node = IndexNode( text=wiki_summary, index_id=wiki_title, obj=agents[wiki_title] ) objects.append(node)
In [ ]:
Copied!
# 定义顶层检索器vector_index = VectorStoreIndex( objects=objects,)query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
# 定义顶层检索器vector_index = VectorStoreIndex( objects=objects,)query_engine = vector_index.as_query_engine(similarity_top_k=1, verbose=True)
运行示例查询¶
In [ ]:
Copied!
# 应该使用波士顿代理 -> 向量工具response = query_engine.query("告诉我关于波士顿的体育队伍")
# 应该使用波士顿代理 -> 向量工具response = query_engine.query("告诉我关于波士顿的体育队伍")
Retrieval entering Boston: OpenAIAgent Retrieving from object OpenAIAgent with query Tell me about the sports teams in Boston Added user message to memory: Tell me about the sports teams in Boston
In [ ]:
Copied!
print(response)
print(response)
Boston is home to several professional sports teams across different leagues, including a successful baseball team in Major League Baseball, a highly successful American football team in the National Football League, one of the most successful basketball teams in the NBA, a professional ice hockey team in the National Hockey League, and a professional soccer team in Major League Soccer. These teams have a rich history, passionate fan bases, and have achieved great success both locally and nationally.
In [ ]:
Copied!
# 应该使用休斯顿代理 -> 向量工具response = query_engine.query("告诉我休斯顿的体育队伍情况")
# 应该使用休斯顿代理 -> 向量工具response = query_engine.query("告诉我休斯顿的体育队伍情况")
Retrieval entering Houston: OpenAIAgent Retrieving from object OpenAIAgent with query Tell me about the sports teams in Houston Added user message to memory: Tell me about the sports teams in Houston
In [ ]:
Copied!
print(response)
print(response)
Houston is home to several professional sports teams across different leagues, including the Houston Texans in the NFL, the Houston Rockets in the NBA, the Houston Astros in MLB, the Houston Dynamo in MLS, and the Houston Dash in NWSL. These teams compete in football, basketball, baseball, soccer, and women's soccer respectively, and have achieved various levels of success in their respective leagues. Additionally, the city also has minor league baseball, hockey, and other sports teams that cater to sports enthusiasts.
In [ ]:
Copied!
# 应该使用西雅图代理 -> 摘要工具response = query_engine.query( "给我一个关于芝加哥所有积极方面的摘要")
# 应该使用西雅图代理 -> 摘要工具response = query_engine.query( "给我一个关于芝加哥所有积极方面的摘要")
Retrieval entering Chicago: OpenAIAgent Retrieving from object OpenAIAgent with query Give me a summary on all the positive aspects of Chicago Added user message to memory: Give me a summary on all the positive aspects of Chicago === Calling Function === Calling function: summary_tool with args: { "input": "positive aspects of Chicago" } Got output: Chicago is recognized for its robust economy, acting as a key hub for finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It stands out in the derivatives market and is a top-ranking city in terms of gross domestic product. Chicago is a favored destination for tourists, known for its rich art scene covering visual arts, literature, film, theater, comedy, food, dance, and music. The city hosts prestigious educational institutions and professional sports teams across different leagues. ========================
In [ ]:
Copied!
print(response)
print(response)
Chicago is known for its strong economy with a focus on finance, culture, commerce, industry, education, technology, telecommunications, and transportation. It is a major player in the derivatives market and boasts a high gross domestic product. The city is a popular tourist destination with a vibrant art scene that includes visual arts, literature, film, theater, comedy, food, dance, and music. Additionally, Chicago is home to prestigious educational institutions and professional sports teams across various leagues.