递归检索器 + 节点引用 + Braintrust¶
本指南展示了如何使用递归检索来遍历节点关系,并根据“引用”获取节点。
节点引用是一个强大的概念。当您首次执行检索时,您可能希望检索引用而不是原始文本。您可以让多个引用指向同一个节点。
在本指南中,我们探讨了节点引用的一些不同用法:
- 块引用:不同块大小指向更大的块
- 元数据引用:摘要 + 生成的问题指向更大的块
我们通过Braintrust来评估我们的递归检索 + 节点引用方法的效果。Braintrust是构建人工智能产品的企业级堆栈。从评估到提示播放区,再到数据管理,我们消除了将人工智能整合到您的业务中的不确定性和烦琐。
您可以在这里查看示例评估仪表板:
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
%pip install llama-index-llms-openai
%pip install llama-index-readers-file
In [ ]:
Copied!
%load_ext autoreload%autoreload 2# 注意:将YOUR_OPENAI_API_KEY替换为您的OpenAI API密钥,将YOUR_BRAINTRUST_API_KEY替换为您的BrainTrust API密钥。不要用引号括起来。# 在https://braintrustdata.com/注册Braintrust,并在https://www.braintrustdata.com/app/braintrustdata.com/settings/api-keys获取您的API密钥# 注意:将YOUR_OPENAI_KEY替换为您的OpenAI API密钥,将YOUR_BRAINTRUST_API_KEY替换为您的BrainTrust API密钥。不要用引号括起来。%env OPENAI_API_KEY=%env BRAINTRUST_API_KEY=%env TOKENIZERS_PARALLELISM=true # 这是为了避免Chroma发出警告消息而需要的。
%load_ext autoreload%autoreload 2# 注意:将YOUR_OPENAI_API_KEY替换为您的OpenAI API密钥,将YOUR_BRAINTRUST_API_KEY替换为您的BrainTrust API密钥。不要用引号括起来。# 在https://braintrustdata.com/注册Braintrust,并在https://www.braintrustdata.com/app/braintrustdata.com/settings/api-keys获取您的API密钥# 注意:将YOUR_OPENAI_KEY替换为您的OpenAI API密钥,将YOUR_BRAINTRUST_API_KEY替换为您的BrainTrust API密钥。不要用引号括起来。%env OPENAI_API_KEY=%env BRAINTRUST_API_KEY=%env TOKENIZERS_PARALLELISM=true # 这是为了避免Chroma发出警告消息而需要的。
In [ ]:
Copied!
%pip install -U llama_hub llama_index braintrust autoevals pypdf pillow transformers torch torchvision
%pip install -U llama_hub llama_index braintrust autoevals pypdf pillow transformers torch torchvision
加载数据 + 设置¶
在这一部分,我们将下载 Llama 2 论文并创建一个初始节点集(块大小为 1024)。
In [ ]:
Copied!
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
In [ ]:
Copied!
from pathlib import Path
from llama_index.readers.file import PDFReader
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
import json
from pathlib import Path
from llama_index.readers.file import PDFReader
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
import json
In [ ]:
Copied!
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
loader = PDFReader()
docs0 = loader.load_data(file=Path("./data/llama2.pdf"))
In [ ]:
Copied!
from llama_index.core import Document
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
from llama_index.core import Document
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
In [ ]:
Copied!
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
In [ ]:
Copied!
node_parser = SentenceSplitter(chunk_size=1024)
node_parser = SentenceSplitter(chunk_size=1024)
In [ ]:
Copied!
base_nodes = node_parser.get_nodes_from_documents(docs)# 将节点id设置为一个常量for idx, node in enumerate(base_nodes): node.id_ = f"node-{idx}"
base_nodes = node_parser.get_nodes_from_documents(docs)# 将节点id设置为一个常量for idx, node in enumerate(base_nodes): node.id_ = f"node-{idx}"
In [ ]:
Copied!
from llama_index.core.embeddings import resolve_embed_model
embed_model = resolve_embed_model("local:BAAI/bge-small-en")
llm = OpenAI(model="gpt-3.5-turbo")
from llama_index.core.embeddings import resolve_embed_model
embed_model = resolve_embed_model("local:BAAI/bge-small-en")
llm = OpenAI(model="gpt-3.5-turbo")
基准检索器¶
定义一个基准检索器,它简单地通过嵌入相似性来获取前k个原始文本节点。
In [ ]:
Copied!
base_index = VectorStoreIndex(base_nodes, embed_model=embed_model)
base_retriever = base_index.as_retriever(similarity_top_k=2)
base_index = VectorStoreIndex(base_nodes, embed_model=embed_model)
base_retriever = base_index.as_retriever(similarity_top_k=2)
In [ ]:
Copied!
retrievals = base_retriever.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
retrievals = base_retriever.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
In [ ]:
Copied!
for n in retrievals:
display_source_node(n, source_length=1500)
for n in retrievals:
display_source_node(n, source_length=1500)
In [ ]:
Copied!
query_engine_base = RetrieverQueryEngine.from_args(base_retriever, llm=llm)
query_engine_base = RetrieverQueryEngine.from_args(base_retriever, llm=llm)
In [ ]:
Copied!
response = query_engine_base.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
response = query_engine_base.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
分块引用:较小的子块引用较大的父块¶
在这个用法示例中,我们展示了如何构建一个较小的子块指向较大父块的图。
在查询时,我们检索较小的子块,但是我们跟随引用到较大的父块。这样可以让我们在合成时获得更多的上下文信息。
In [ ]:
Copied!
sub_chunk_sizes = [128, 256, 512]sub_node_parsers = [SentenceSplitter(chunk_size=c) for c in sub_chunk_sizes]all_nodes = []for base_node in base_nodes: for n in sub_node_parsers: sub_nodes = n.get_nodes_from_documents([base_node]) sub_inodes = [ IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes ] all_nodes.extend(sub_inodes) # also add original node to node original_node = IndexNode.from_text_node(base_node, base_node.node_id) all_nodes.append(original_node)
sub_chunk_sizes = [128, 256, 512]sub_node_parsers = [SentenceSplitter(chunk_size=c) for c in sub_chunk_sizes]all_nodes = []for base_node in base_nodes: for n in sub_node_parsers: sub_nodes = n.get_nodes_from_documents([base_node]) sub_inodes = [ IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes ] all_nodes.extend(sub_inodes) # also add original node to node original_node = IndexNode.from_text_node(base_node, base_node.node_id) all_nodes.append(original_node)
In [ ]:
Copied!
all_nodes_dict = {n.node_id: n for n in all_nodes}
all_nodes_dict = {n.node_id: n for n in all_nodes}
In [ ]:
Copied!
vector_index_chunk = VectorStoreIndex(all_nodes, embed_model=embed_model)
vector_index_chunk = VectorStoreIndex(all_nodes, embed_model=embed_model)
In [ ]:
Copied!
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)
In [ ]:
Copied!
retriever_chunk = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever_chunk},
node_dict=all_nodes_dict,
verbose=True,
)
retriever_chunk = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever_chunk},
node_dict=all_nodes_dict,
verbose=True,
)
In [ ]:
Copied!
nodes = retriever_chunk.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
display_source_node(node, source_length=2000)
nodes = retriever_chunk.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
display_source_node(node, source_length=2000)
In [ ]:
Copied!
query_engine_chunk = RetrieverQueryEngine.from_args(retriever_chunk, llm=llm)
query_engine_chunk = RetrieverQueryEngine.from_args(retriever_chunk, llm=llm)
In [ ]:
Copied!
response = query_engine_chunk.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
response = query_engine_chunk.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
元数据引用:对更大块内容的摘要和生成的问题¶
在这个用法示例中,我们展示了如何定义引用源节点的附加上下文。
这个附加上下文包括摘要以及生成的问题。
在查询时,我们检索较小的块,但是我们会跟随引用到更大的块。这样可以为综合提供更多的上下文。
In [ ]:
Copied!
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import IndexNode
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
)
In [ ]:
Copied!
extractors = [
SummaryExtractor(summaries=["self"], show_progress=True),
QuestionsAnsweredExtractor(questions=5, show_progress=True),
]
extractors = [
SummaryExtractor(summaries=["self"], show_progress=True),
QuestionsAnsweredExtractor(questions=5, show_progress=True),
]
In [ ]:
Copied!
# 在基本节点上运行元数据提取器,返回字典metadata_dicts = []for extractor in extractors: metadata_dicts.extend(extractor.extract(base_nodes))
# 在基本节点上运行元数据提取器,返回字典metadata_dicts = []for extractor in extractors: metadata_dicts.extend(extractor.extract(base_nodes))
In [ ]:
Copied!
# 缓存元数据字典def save_metadata_dicts(path): with open(path, "w") as fp: for m in metadata_dicts: fp.write(json.dumps(m) + "\n")def load_metadata_dicts(path): with open(path, "r") as fp: metadata_dicts = [json.loads(l) for l in fp.readlines()] return metadata_dicts
# 缓存元数据字典def save_metadata_dicts(path): with open(path, "w") as fp: for m in metadata_dicts: fp.write(json.dumps(m) + "\n")def load_metadata_dicts(path): with open(path, "r") as fp: metadata_dicts = [json.loads(l) for l in fp.readlines()] return metadata_dicts
In [ ]:
Copied!
save_metadata_dicts("data/llama2_metadata_dicts.jsonl")
save_metadata_dicts("data/llama2_metadata_dicts.jsonl")
In [ ]:
Copied!
metadata_dicts = load_metadata_dicts("data/llama2_metadata_dicts.jsonl")
metadata_dicts = load_metadata_dicts("data/llama2_metadata_dicts.jsonl")
In [ ]:
Copied!
# 所有节点由源节点和元数据组成import copyall_nodes = copy.deepcopy(base_nodes)for idx, d in enumerate(metadata_dicts): inode_q = IndexNode( text=d["questions_this_excerpt_can_answer"], index_id=base_nodes[idx].node_id, ) inode_s = IndexNode( text=d["section_summary"], index_id=base_nodes[idx].node_id ) all_nodes.extend([inode_q, inode_s])
# 所有节点由源节点和元数据组成import copyall_nodes = copy.deepcopy(base_nodes)for idx, d in enumerate(metadata_dicts): inode_q = IndexNode( text=d["questions_this_excerpt_can_answer"], index_id=base_nodes[idx].node_id, ) inode_s = IndexNode( text=d["section_summary"], index_id=base_nodes[idx].node_id ) all_nodes.extend([inode_q, inode_s])
In [ ]:
Copied!
all_nodes_dict = {n.node_id: n for n in all_nodes}
all_nodes_dict = {n.node_id: n for n in all_nodes}
In [ ]:
Copied!
## 将索引加载到向量索引中from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIllm = OpenAI(model="gpt-3.5-turbo")vector_index_metadata = VectorStoreIndex(all_nodes)
## 将索引加载到向量索引中from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIllm = OpenAI(model="gpt-3.5-turbo")vector_index_metadata = VectorStoreIndex(all_nodes)
In [ ]:
Copied!
vector_retriever_metadata = vector_index_metadata.as_retriever(
similarity_top_k=2
)
vector_retriever_metadata = vector_index_metadata.as_retriever(
similarity_top_k=2
)
In [ ]:
Copied!
retriever_metadata = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever_metadata},
node_dict=all_nodes_dict,
verbose=True,
)
retriever_metadata = RecursiveRetriever(
"vector",
retriever_dict={"vector": vector_retriever_metadata},
node_dict=all_nodes_dict,
verbose=True,
)
In [ ]:
Copied!
nodes = retriever_metadata.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
display_source_node(node, source_length=2000)
nodes = retriever_metadata.retrieve(
"Can you tell me about the key concepts for safety finetuning"
)
for node in nodes:
display_source_node(node, source_length=2000)
In [ ]:
Copied!
query_engine_metadata = RetrieverQueryEngine.from_args(
retriever_metadata, llm=llm
)
query_engine_metadata = RetrieverQueryEngine.from_args(
retriever_metadata, llm=llm
)
In [ ]:
Copied!
response = query_engine_metadata.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
response = query_engine_metadata.query(
"Can you tell me about the key concepts for safety finetuning"
)
print(str(response))
评估¶
我们使用Braintrust来评估我们的递归检索+节点引用方法的效果。Braintrust是构建人工智能产品的企业级堆栈。从评估到快速实验,再到数据管理,我们让将人工智能整合到您的业务中变得更加确定和轻松。
我们评估块引用和元数据引用两种方法。我们使用嵌入相似度查找来检索引用节点。我们将这两种方法与直接获取原始节点的基准检索器进行比较。在指标方面,我们使用命中率和MRR进行评估。
您可以在以下链接中查看示例评估仪表板:
数据集生成¶
我们首先从文本块集合中生成一个问题数据集。
In [ ]:
Copied!
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
eval_dataset = generate_question_context_pairs(base_nodes)
eval_dataset = generate_question_context_pairs(base_nodes)
In [ ]:
Copied!
eval_dataset.save_json("data/llama2_eval_dataset.json")
eval_dataset.save_json("data/llama2_eval_dataset.json")
In [ ]:
Copied!
# 可选eval_dataset = EmbeddingQAFinetuneDataset.from_json( "data/llama2_eval_dataset.json")
# 可选eval_dataset = EmbeddingQAFinetuneDataset.from_json( "data/llama2_eval_dataset.json")
In [ ]:
Copied!
import pandas as pd# 将向量检索相似度的 top k 设置为更高top_k = 10def display_results(names, results_arr): """显示来自 evaluate 的结果。""" hit_rates = [] mrrs = [] for name, eval_results in zip(names, results_arr): metric_dicts = [] for eval_result in eval_results: metric_dict = eval_result.metric_vals_dict metric_dicts.append(metric_dict) results_df = pd.DataFrame(metric_dicts) hit_rate = results_df["hit_rate"].mean() mrr = results_df["mrr"].mean() hit_rates.append(hit_rate) mrrs.append(mrr) final_df = pd.DataFrame( {"retrievers": names, "hit_rate": hit_rates, "mrr": mrrs} ) display(final_df)
import pandas as pd# 将向量检索相似度的 top k 设置为更高top_k = 10def display_results(names, results_arr): """显示来自 evaluate 的结果。""" hit_rates = [] mrrs = [] for name, eval_results in zip(names, results_arr): metric_dicts = [] for eval_result in eval_results: metric_dict = eval_result.metric_vals_dict metric_dicts.append(metric_dict) results_df = pd.DataFrame(metric_dicts) hit_rate = results_df["hit_rate"].mean() mrr = results_df["mrr"].mean() hit_rates.append(hit_rate) mrrs.append(mrr) final_df = pd.DataFrame( {"retrievers": names, "hit_rate": hit_rates, "mrr": mrrs} ) display(final_df)
让我们定义一些评分函数并定义我们的数据集数据变量。
In [ ]:
Copied!
queries = eval_dataset.queries
relevant_docs = eval_dataset.relevant_docs
data = [
({"input": queries[query], "expected": relevant_docs[query]})
for query in queries.keys()
]
def hitRateScorer(input, expected, output=None):
is_hit = any([id in expected for id in output])
return 1 if is_hit else 0
def mrrScorer(input, expected, output=None):
for i, id in enumerate(output):
if id in expected:
return 1 / (i + 1)
return 0
queries = eval_dataset.queries
relevant_docs = eval_dataset.relevant_docs
data = [
({"input": queries[query], "expected": relevant_docs[query]})
for query in queries.keys()
]
def hitRateScorer(input, expected, output=None):
is_hit = any([id in expected for id in output])
return 1 if is_hit else 0
def mrrScorer(input, expected, output=None):
for i, id in enumerate(output):
if id in expected:
return 1 / (i + 1)
return 0
In [ ]:
Copied!
import braintrust# 评估块检索器vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=10)retriever_chunk = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever_chunk}, node_dict=all_nodes_dict, verbose=False,)def runChunkRetriever(input, hooks): retrieved_nodes = retriever_chunk.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idschunkEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runChunkRetriever, scores=[hitRateScorer, mrrScorer],)
import braintrust# 评估块检索器vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=10)retriever_chunk = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever_chunk}, node_dict=all_nodes_dict, verbose=False,)def runChunkRetriever(input, hooks): retrieved_nodes = retriever_chunk.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idschunkEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runChunkRetriever, scores=[hitRateScorer, mrrScorer],)
In [ ]:
Copied!
# 评估元数据检索器vector_retriever_metadata = vector_index_metadata.as_retriever( similarity_top_k=10)retriever_metadata = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever_metadata}, node_dict=all_nodes_dict, verbose=False,)def runMetaDataRetriever(input, hooks): retrieved_nodes = retriever_metadata.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idsmetadataEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runMetaDataRetriever, scores=[hitRateScorer, mrrScorer],)
# 评估元数据检索器vector_retriever_metadata = vector_index_metadata.as_retriever( similarity_top_k=10)retriever_metadata = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever_metadata}, node_dict=all_nodes_dict, verbose=False,)def runMetaDataRetriever(input, hooks): retrieved_nodes = retriever_metadata.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idsmetadataEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runMetaDataRetriever, scores=[hitRateScorer, mrrScorer],)
In [ ]:
Copied!
# 评估基础检索器base_retriever = base_index.as_retriever(similarity_top_k=10)def runBaseRetriever(input, hooks): retrieved_nodes = base_retriever.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idsbaseEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runBaseRetriever, scores=[hitRateScorer, mrrScorer],)
# 评估基础检索器base_retriever = base_index.as_retriever(similarity_top_k=10)def runBaseRetriever(input, hooks): retrieved_nodes = base_retriever.retrieve(input) retrieved_ids = [node.node.node_id for node in retrieved_nodes] return retrieved_idsbaseEval = await braintrust.Eval( name="llamaindex-recurisve-retrievers", data=data, task=runBaseRetriever, scores=[hitRateScorer, mrrScorer],)