递归检索器 + 查询引擎演示¶

在这个演示中，我们将演示如何使用我们的“RecursiveRetriever”模块处理层次化数据的一个用例。

递归检索的概念是，我们不仅探索直接相关的节点，还要探索节点与其他检索器/查询引擎的关系，并执行它们。例如，一个节点可能代表一个结构化表格的简洁摘要，并链接到该结构化表格上的SQL/Pandas查询引擎。那么，如果检索到该节点，我们也希望查询底层的查询引擎以获取答案。

这对具有层次关系的文档特别有用。在这个示例中，我们将遍历一个关于亿万富翁的维基百科文章（以PDF形式），其中包含文本和各种嵌入式结构化表格。我们首先为每个表格创建一个Pandas查询引擎，但同时也用一个IndexNode（存储到查询引擎的链接）来表示每个表格；这个节点与其他节点一起存储在一个向量存储中。

在查询时，如果获取到一个IndexNode，则将查询底层的查询引擎/检索器。

设置说明

我们使用camelot从PDF中提取基于文本的表格。

In [ ]:

Copied!





%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental
%pip install llama-index-embeddings-openai
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-experimental

In [ ]:

Copied!

import camelot# https://en.wikipedia.org/wiki/The_World%27s_Billionairesfrom llama_index.core import VectorStoreIndexfrom llama_index.experimental.query_engine import PandasQueryEnginefrom llama_index.core.schema import IndexNodefrom llama_index.llms.openai import OpenAIfrom llama_index.readers.file import PyMuPDFReaderfrom typing import List
import camelot# https://en.wikipedia.org/wiki/The_World%27s_Billionairesfrom llama_index.core import VectorStoreIndexfrom llama_index.experimental.query_engine import PandasQueryEnginefrom llama_index.core.schema import IndexNodefrom llama_index.llms.openai import OpenAIfrom llama_index.readers.file import PyMuPDFReaderfrom typing import List

默认设置¶

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
import os

os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

In [ ]:

Copied!





from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

加载文档（和表格）¶

我们使用PyMuPDFReader来读取文档的主要文本。

我们还使用camelot来从文档中提取一些结构化的表格。

In [ ]:

Copied!

file_path = "billionaires_page.pdf"
file_path = "billionaires_page.pdf"

In [ ]:

Copied!

# 初始化PDF阅读器reader = PyMuPDFReader()
# 初始化PDF阅读器reader = PyMuPDFReader()

In [ ]:

Copied!

docs = reader.load(file_path)
docs = reader.load(file_path)

In [ ]:

Copied!

# 使用camelot来解析表格def get_tables(path: str, pages: List[int]):    table_dfs = []    for page in pages:        table_list = camelot.read_pdf(path, pages=str(page))        table_df = table_list[0].df        table_df = (            table_df.rename(columns=table_df.iloc[0])            .drop(table_df.index[0])            .reset_index(drop=True)        )        table_dfs.append(table_df)    return table_dfs
# 使用camelot来解析表格def get_tables(path: str, pages: List[int]):    table_dfs = []    for page in pages:        table_list = camelot.read_pdf(path, pages=str(page))        table_df = table_list[0].df        table_df = (            table_df.rename(columns=table_df.iloc[0])            .drop(table_df.index[0])            .reset_index(drop=True)        )        table_dfs.append(table_df)    return table_dfs

In [ ]:

Copied!

table_dfs = get_tables(file_path, pages=[3, 25])
table_dfs = get_tables(file_path, pages=[3, 25])

In [ ]:

Copied!

# 显示2023年全球亿万富翁的排行榜table_dfs[0]
# 显示2023年全球亿万富翁的排行榜table_dfs[0]

Out[ ]:

	No.	Name	Net worth\n(USD)	Age	Nationality	Primary source(s) of wealth
0	1	Bernard Arnault &\nfamily	$211 billion	74	France	LVMH
1	2	Elon Musk	$180 billion	51	United\nStates	Tesla, SpaceX, X Corp.
2	3	Jeff Bezos	$114 billion	59	United\nStates	Amazon
3	4	Larry Ellison	$107 billion	78	United\nStates	Oracle Corporation
4	5	Warren Buffett	$106 billion	92	United\nStates	Berkshire Hathaway
5	6	Bill Gates	$104 billion	67	United\nStates	Microsoft
6	7	Michael Bloomberg	$94.5 billion	81	United\nStates	Bloomberg L.P.
7	8	Carlos Slim & family	$93 billion	83	Mexico	Telmex, América Móvil, Grupo\nCarso
8	9	Mukesh Ambani	$83.4 billion	65	India	Reliance Industries
9	10	Steve Ballmer	$80.7 billion	67	United\nStates	Microsoft

In [ ]:

Copied!

# 显示前十大亿万富翁的列表table_dfs[1]
# 显示前十大亿万富翁的列表table_dfs[1]

Out[ ]:

	Year	Number of billionaires	Group's combined net worth
0	2023[2]	2,640	$12.2 trillion
1	2022[6]	2,668	$12.7 trillion
2	2021[11]	2,755	$13.1 trillion
3	2020	2,095	$8.0 trillion
4	2019	2,153	$8.7 trillion
5	2018	2,208	$9.1 trillion
6	2017	2,043	$7.7 trillion
7	2016	1,810	$6.5 trillion
8	2015[18]	1,826	$7.1 trillion
9	2014[67]	1,645	$6.4 trillion
10	2013[68]	1,426	$5.4 trillion
11	2012	1,226	$4.6 trillion
12	2011	1,210	$4.5 trillion
13	2010	1,011	$3.6 trillion
14	2009	793	$2.4 trillion
15	2008	1,125	$4.4 trillion
16	2007	946	$3.5 trillion
17	2006	793	$2.6 trillion
18	2005	691	$2.2 trillion
19	2004	587	$1.9 trillion
20	2003	476	$1.4 trillion
21	2002	497	$1.5 trillion
22	2001	538	$1.8 trillion
23	2000	470	$898 billion
24	Sources: Forbes.[18][67][66][68]

创建Pandas查询引擎¶

我们为每个结构化表创建一个Pandas查询引擎。

这些引擎可以单独执行，以回答关于每个表的查询。

警告： 此工具为LLM提供对eval函数的访问权限。在运行此工具的机器上，可能会执行任意代码。虽然对代码进行了一定程度的过滤，但不建议在生产环境中使用此工具，除非进行了严格的沙盒化或虚拟机化。

In [ ]:

Copied!

# 定义对这些表格的查询引擎llm = OpenAI(model="gpt-4")df_query_engines = [    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]
# 定义对这些表格的查询引擎llm = OpenAI(model="gpt-4")df_query_engines = [    PandasQueryEngine(table_df, llm=llm) for table_df in table_dfs]

In [ ]:

Copied!





response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))
response = df_query_engines[0].query(
    "What's the net worth of the second richest billionaire in 2023?"
)
print(str(response))

$180 billion

In [ ]:

Copied!





response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))
response = df_query_engines[1].query(
    "How many billionaires were there in 2009?"
)
print(str(response))

构建向量索引¶

对分块文档以及链接到表格的额外IndexNode对象构建向量索引。

In [ ]:

Copied!

from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)
from llama_index.core import Settings

doc_nodes = Settings.node_parser.get_nodes_from_documents(docs)

In [ ]:

Copied!

# 定义索引节点summaries = [    (        "此节点提供关于2023年世界上最富有的亿万富翁的信息"    ),    (        "此节点提供了从2000年到2023年亿万富翁数量和他们的总净值的信息。"    ),]df_nodes = [    IndexNode(text=summary, index_id=f"pandas{idx}")    for idx, summary in enumerate(summaries)]df_id_query_engine_mapping = {    f"pandas{idx}": df_query_engine    for idx, df_query_engine in enumerate(df_query_engines)}
# 定义索引节点summaries = [    (        "此节点提供关于2023年世界上最富有的亿万富翁的信息"    ),    (        "此节点提供了从2000年到2023年亿万富翁数量和他们的总净值的信息。"    ),]df_nodes = [    IndexNode(text=summary, index_id=f"pandas{idx}")    for idx, summary in enumerate(summaries)]df_id_query_engine_mapping = {    f"pandas{idx}": df_query_engine    for idx, df_query_engine in enumerate(df_query_engines)}

In [ ]:

Copied!

# 构建顶层向量索引 + 查询引擎vector_index = VectorStoreIndex(doc_nodes + df_nodes)vector_retriever = vector_index.as_retriever(similarity_top_k=1)
# 构建顶层向量索引 + 查询引擎vector_index = VectorStoreIndex(doc_nodes + df_nodes)vector_retriever = vector_index.as_retriever(similarity_top_k=1)

在我们的`RetrieverQueryEngine`中使用`RecursiveRetriever`¶

我们定义了一个RecursiveRetriever对象来递归地检索/查询节点。然后我们将其放入我们的RetrieverQueryEngine中，同时还有一个ResponseSynthesizer来合成一个响应。

我们传入从id到retriever和从id到query engine的映射。然后我们传入一个表示我们首先查询的检索器的根id。

In [ ]:

Copied!

# 基准向量索引（不包括额外的df节点）。# 用于基准测试vector_index0 = VectorStoreIndex(doc_nodes)vector_query_engine0 = vector_index0.as_query_engine()
# 基准向量索引（不包括额外的df节点）。# 用于基准测试vector_index0 = VectorStoreIndex(doc_nodes)vector_query_engine0 = vector_index0.as_query_engine()

In [ ]:

Copied!





from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import get_response_synthesizer

recursive_retriever = RecursiveRetriever(
    "vector",
    retriever_dict={"vector": vector_retriever},
    query_engine_dict=df_id_query_engine_mapping,
    verbose=True,
)

response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever, response_synthesizer=response_synthesizer
)

In [ ]:

Copied!

response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)
response = query_engine.query(
    "What's the net worth of the second richest billionaire in 2023?"
)

Retrieving with query id None: What's the net worth of the second richest billionaire in 2023?
Retrieved node with id, entering: pandas0
Retrieving with query id pandas0: What's the net worth of the second richest billionaire in 2023?
Got response: $180 billion

In [ ]:

Copied!

response.source_nodes[0].node.get_content()
response.source_nodes[0].node.get_content()

Out[ ]:

"Query: What's the net worth of the second richest billionaire in 2023?\nResponse: $180\xa0billion"

In [ ]:

Copied!

str(response)
str(response)

Out[ ]:

'$180 billion.'

In [ ]:

Copied!

response = query_engine.query("How many billionaires were there in 2009?")
response = query_engine.query("How many billionaires were there in 2009?")

Retrieving with query id None: How many billionaires were there in 2009?
Retrieved node with id, entering: pandas1
Retrieving with query id pandas1: How many billionaires were there in 2009?
Got response: 793

In [ ]:

Copied!

str(response)
str(response)

Out[ ]:

'793'

In [ ]:

Copied!

response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)
response = vector_query_engine0.query(
    "How many billionaires were there in 2009?"
)

In [ ]:

Copied!

print(response.source_nodes[0].node.get_content())
print(response.source_nodes[0].node.get_content())

In [ ]:

Copied!

print(str(response))
print(str(response))

Based on the context information, it is not possible to determine the exact number of billionaires in 2009. The provided information only mentions the number of billionaires in 2013 and 2014.

In [ ]:

Copied!

response.source_nodes[0].node.get_content()
response.source_nodes[0].node.get_content()

In [ ]:

Copied!

response = query_engine.query(
    "Which billionaires are excluded from this list?"
)
response = query_engine.query(
    "Which billionaires are excluded from this list?"
)

In [ ]:

Copied!

print(str(response))
print(str(response))

Royal families and dictators whose wealth is contingent on a position are excluded from this list.

递归检索器 + 查询引擎演示¶

默认设置¶

加载文档（和表格）¶

创建Pandas查询引擎¶

构建向量索引¶

在我们的RetrieverQueryEngine中使用RecursiveRetriever¶

在我们的`RetrieverQueryEngine`中使用`RecursiveRetriever`¶