递归检索器 + 查询引擎演示¶
在这个演示中,我们将演示如何使用我们的“RecursiveRetriever”模块处理层次化数据的一个用例。
递归检索的概念是,我们不仅探索直接相关的节点,还要探索节点与其他检索器/查询引擎的关系,并执行它们。例如,一个节点可能代表一个结构化表格的简洁摘要,并链接到该结构化表格上的SQL/Pandas查询引擎。那么,如果检索到该节点,我们也希望查询底层的查询引擎以获取答案。
这对具有层次关系的文档特别有用。在这个示例中,我们将遍历一个关于亿万富翁的维基百科文章(以PDF形式),其中包含文本和各种嵌入式结构化表格。我们首先为每个表格创建一个Pandas查询引擎,但同时也用一个IndexNode
(存储到查询引擎的链接)来表示每个表格;这个节点与其他节点一起存储在一个向量存储中。
在查询时,如果获取到一个IndexNode
,则将查询底层的查询引擎/检索器。
设置说明
我们使用camelot
从PDF中提取基于文本的表格。
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental
import camelot# https://en.wikipedia.org/wiki/The_World%27s_Billionairesfrom llama_index.core import VectorStoreIndexfrom llama_index.experimental.query_engine import PandasQueryEnginefrom llama_index.core.schema import IndexNodefrom llama_index.llms.openai import OpenAIfrom llama_index.readers.file import PyMuPDFReaderfrom typing import List
默认设置¶
import os
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
file_path = "billionaires_page.pdf"
# 初始化PDF阅读器reader = PyMuPDFReader()
docs = reader.load(file_path)
# 使用camelot来解析表格def get_tables(path: str, pages: List[int]): table_dfs = [] for page in pages: table_list = camelot.read_pdf(path, pages=str(page)) table_df = table_list[0].df table_df = ( table_df.rename(columns=table_df.iloc[0]) .drop(table_df.index[0]) .reset_index(drop=True) ) table_dfs.append(table_df) return table_dfs
table_dfs = get_tables(file_path, pages=[3, 25])
# 显示2023年全球亿万富翁的排行榜table_dfs[0]
No. | Name | Net worth\n(USD) | Age | Nationality | Primary source(s) of wealth | |
---|---|---|---|---|---|---|
0 | 1 | Bernard Arnault &\nfamily | $211 billion | 74 | France | LVMH |
1 | 2 | Elon Musk | $180 billion | 51 | United\nStates | Tesla, SpaceX, X Corp. |
2 | 3 | Jeff Bezos | $114 billion | 59 | United\nStates | Amazon |
3 | 4 | Larry Ellison | $107 billion | 78 | United\nStates | Oracle Corporation |
4 | 5 | Warren Buffett | $106 billion | 92 | United\nStates | Berkshire Hathaway |
5 | 6 | Bill Gates | $104 billion | 67 | United\nStates | Microsoft |
6 | 7 | Michael Bloomberg | $94.5 billion | 81 | United\nStates | Bloomberg L.P. |
7 | 8 | Carlos Slim & family | $93 billion | 83 | Mexico | Telmex, América Móvil, Grupo\nCarso |
8 | 9 | Mukesh Ambani | $83.4 billion | 65 | India | Reliance Industries |
9 | 10 | Steve Ballmer | $80.7 billion | 67 | United\nStates | Microsoft |
# 显示前十大亿万富翁的列表table_dfs[1]
Year | Number of billionaires | Group's combined net worth | |
---|---|---|---|
0 | 2023[2] | 2,640 | $12.2 trillion |
1 | 2022[6] | 2,668 | $12.7 trillion |
2 | 2021[11] | 2,755 | $13.1 trillion |
3 | 2020 | 2,095 | $8.0 trillion |
4 | 2019 | 2,153 | $8.7 trillion |
5 | 2018 | 2,208 | $9.1 trillion |
6 | 2017 | 2,043 | $7.7 trillion |
7 | 2016 | 1,810 | $6.5 trillion |
8 | 2015[18] | 1,826 | $7.1 trillion |
9 | 2014[67] | 1,645 | $6.4 trillion |
10 | 2013[68] | 1,426 | $5.4 trillion |
11 | 2012 | 1,226 | $4.6 trillion |
12 | 2011 | 1,210 | $4.5 trillion |
13 | 2010 | 1,011 | $3.6 trillion |
14 | 2009 | 793 | $2.4 trillion |
15 | 2008 | 1,125 | $4.4 trillion |
16 | 2007 | 946 | $3.5 trillion |
17 | 2006 | 793 | $2.6 trillion |
18 | 2005 | 691 | $2.2 trillion |
19 | 2004 | 587 | $1.9 trillion |
20 | 2003 | 476 | $1.4 trillion |
21 | 2002 | 497 | $1.5 trillion |
22 | 2001 | 538 | $1.8 trillion |
23 | 2000 | 470 | $898 billion |
24 | Sources: Forbes.[18][67][66][68] |
创建Pandas查询引擎¶
我们为每个结构化表创建一个Pandas查询引擎。
这些引擎可以单独执行,以回答关于每个表的查询。
警告: 此工具为LLM提供对eval
函数的访问权限。
在运行此工具的机器上,可能会执行任意代码。
虽然对代码进行了一定程度的过滤,但不建议在生产环境中使用此工具,除非进行了严格的沙盒化或虚拟机化。
# 定义对这些表格的查询引擎llm = OpenAI(model="gpt-4")df_query_engines = [ PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]
response = df_query_engines[0].query(
"What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
$180 billion
response = df_query_engines[1].query(
"How many billionaires were there in 2009?"
)
print(str(response))
793
构建向量索引¶
对分块文档以及链接到表格的额外IndexNode
对象构建向量索引。
from llama_index.core import Settings
doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
# 定义索引节点summaries = [ ( "此节点提供关于2023年世界上最富有的亿万富翁的信息" ), ( "此节点提供了从2000年到2023年亿万富翁数量和他们的总净值的信息。" ),]df_nodes = [ IndexNode(text=summary, index_id=f"pandas{idx}") for idx, summary in enumerate(summaries)]df_id_query_engine_mapping = { f"pandas{idx}": df_query_engine for idx, df_query_engine in enumerate(df_query_engines)}
# 构建顶层向量索引 + 查询引擎vector_index = VectorStoreIndex(doc_nodes + df_nodes)vector_retriever = vector_index.as_retriever(similarity_top_k=1)
在我们的RetrieverQueryEngine
中使用RecursiveRetriever
¶
我们定义了一个RecursiveRetriever
对象来递归地检索/查询节点。然后我们将其放入我们的RetrieverQueryEngine
中,同时还有一个ResponseSynthesizer
来合成一个响应。
我们传入从id到retriever和从id到query engine的映射。然后我们传入一个表示我们首先查询的检索器的根id。
# 基准向量索引(不包括额外的df节点)。# 用于基准测试vector_index0 = VectorStoreIndex(doc_nodes)vector_query_engine0 = vector_index0.as_query_engine()
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer
recursive_retriever = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever},
query_engine_dict=df_id_query_engine_mapping,
verbose=True,
)
response_synthesizer = get_response_synthesizer(response_mode="compact")
query_engine = RetrieverQueryEngine.from_args(
recursive_retriever, response_synthesizer=response_synthesizer
)
response = query_engine.query(
"What's the net worth of the second richest billionaire in 2023?"
)
Retrieving with query id None: What's the net worth of the second richest billionaire in 2023? Retrieved node with id, entering: pandas0 Retrieving with query id pandas0: What's the net worth of the second richest billionaire in 2023? Got response: $180 billion
response.source_nodes[0].node.get_content()
"Query: What's the net worth of the second richest billionaire in 2023?\nResponse: $180\xa0billion"
str(response)
'$180 billion.'
response = query_engine.query("How many billionaires were there in 2009?")
Retrieving with query id None: How many billionaires were there in 2009? Retrieved node with id, entering: pandas1 Retrieving with query id pandas1: How many billionaires were there in 2009? Got response: 793
str(response)
'793'
response = vector_query_engine0.query(
"How many billionaires were there in 2009?"
)
print(response.source_nodes[0].node.get_content())
print(str(response))
Based on the context information, it is not possible to determine the exact number of billionaires in 2009. The provided information only mentions the number of billionaires in 2013 and 2014.
response.source_nodes[0].node.get_content()
response = query_engine.query(
"Which billionaires are excluded from this list?"
)
print(str(response))
Royal families and dictators whose wealth is contingent on a position are excluded from this list.