In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-readers-file pymupdf
%pip install llama-index-llms-openai
%pip install llama-index-postprocessor-cohere-rerank
%pip install llama-index-readers-file pymupdf
In [ ]:
Copied!
%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2
设置¶
这里我们定义了必要的导入。
如果您在Colab上打开此笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
# 注意:这仅在jupyter笔记本中是必需的。
# 详情:Jupyter在后台运行一个事件循环。
# 当我们启动一个事件循环来进行异步查询时,会导致嵌套的事件循环。
# 通常情况下是不允许这样做的,我们使用nest_asyncio来允许这样做以方便操作。
import nest_asyncio
nest_asyncio.apply()
# 注意:这仅在jupyter笔记本中是必需的。
# 详情:Jupyter在后台运行一个事件循环。
# 当我们启动一个事件循环来进行异步查询时,会导致嵌套的事件循环。
# 通常情况下是不允许这样做的,我们使用nest_asyncio来允许这样做以方便操作。
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.core import SummaryIndex
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().handlers = []
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.core import SummaryIndex
from llama_index.core.response.notebook_utils import display_response
from llama_index.llms.openai import OpenAI
Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. NumExpr defaulting to 8 threads.
加载数据¶
在这一部分,我们首先将Llama 2论文作为单个文档加载进来。然后,我们根据不同的块大小对其进行多次分块。我们为每个块大小构建一个单独的向量索引。
In [ ]:
Copied!
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"
--2023-09-28 12:56:38-- https://arxiv.org/pdf/2307.09288.pdf Resolving arxiv.org (arxiv.org)... 128.84.21.199 Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 13661300 (13M) [application/pdf] Saving to: ‘data/llama2.pdf’ data/llama2.pdf 100%[===================>] 13.03M 521KB/s in 42s 2023-09-28 12:57:20 (320 KB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]
In [ ]:
Copied!
from pathlib import Path
from llama_index.core import Document
from llama_index.readers.file import PyMuPDFReader
from pathlib import Path
from llama_index.core import Document
from llama_index.readers.file import PyMuPDFReader
In [ ]:
Copied!
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
loader = PyMuPDFReader()
docs0 = loader.load(file_path=Path("./data/llama2.pdf"))
doc_text = "\n\n".join([d.get_content() for d in docs0])
docs = [Document(text=doc_text)]
在这里,我们尝试不同的块大小:128、256、512和1024。
In [ ]:
Copied!
# 初始化模块
llm = OpenAI(model="gpt-4")
chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
print(f"块大小: {chunk_size}")
splitter = SentenceSplitter(chunk_size=chunk_size)
nodes = splitter.get_nodes_from_documents(docs)
# 将块大小添加到节点中以便以后跟踪
for node in nodes:
node.metadata["chunk_size"] = chunk_size
node.excluded_embed_metadata_keys = ["chunk_size"]
node.excluded_llm_metadata_keys = ["chunk_size"]
nodes_list.append(nodes)
# 构建向量索引
vector_index = VectorStoreIndex(nodes)
vector_indices.append(vector_index)
# 初始化模块
llm = OpenAI(model="gpt-4")
chunk_sizes = [128, 256, 512, 1024]
nodes_list = []
vector_indices = []
for chunk_size in chunk_sizes:
print(f"块大小: {chunk_size}")
splitter = SentenceSplitter(chunk_size=chunk_size)
nodes = splitter.get_nodes_from_documents(docs)
# 将块大小添加到节点中以便以后跟踪
for node in nodes:
node.metadata["chunk_size"] = chunk_size
node.excluded_embed_metadata_keys = ["chunk_size"]
node.excluded_llm_metadata_keys = ["chunk_size"]
nodes_list.append(nodes)
# 构建向量索引
vector_index = VectorStoreIndex(nodes)
vector_indices.append(vector_index)
Chunk Size: 128 Chunk Size: 256 Chunk Size: 512 Chunk Size: 1024
定义集成检索器¶
我们使用递归检索抽象来设置一个“集成”检索器,其工作原理如下:
- 为每个块大小定义一个单独的
IndexNode
,对应于向量检索器(用于块大小128的检索器,用于块大小256的检索器,以此类推)。 - 将所有的IndexNodes放入单个
SummaryIndex
中 - 当调用相应的检索器时,所有节点都会被返回。 - 定义一个递归检索器,其根节点为摘要索引检索器。这将首先从摘要索引检索器中获取所有节点,然后递归调用每个块大小的向量检索器。
- 对最终结果进行重新排序。
最终的结果是在运行查询时会调用所有的向量检索器。
In [ ]:
Copied!
# 尝试集成检索
from llama_index.core.tools import RetrieverTool
from llama_index.core.schema import IndexNode
# retriever_tools = []
retriever_dict = {}
retriever_nodes = []
for chunk_size, vector_index in zip(chunk_sizes, vector_indices):
node_id = f"chunk_{chunk_size}"
node = IndexNode(
text=(
"从Llama 2论文中检索相关内容(块大小"
f" {chunk_size})"
),
index_id=node_id,
)
retriever_nodes.append(node)
retriever_dict[node_id] = vector_index.as_retriever()
# 尝试集成检索
from llama_index.core.tools import RetrieverTool
from llama_index.core.schema import IndexNode
# retriever_tools = []
retriever_dict = {}
retriever_nodes = []
for chunk_size, vector_index in zip(chunk_sizes, vector_indices):
node_id = f"chunk_{chunk_size}"
node = IndexNode(
text=(
"从Llama 2论文中检索相关内容(块大小"
f" {chunk_size})"
),
index_id=node_id,
)
retriever_nodes.append(node)
retriever_dict[node_id] = vector_index.as_retriever()
定义递归检索器。
In [ ]:
Copied!
from llama_index.core.selectors import PydanticMultiSelector
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core import SummaryIndex
# 派生的检索器将只检索所有节点
summary_index = SummaryIndex(retriever_nodes)
retriever = RecursiveRetriever(
root_id="root",
retriever_dict={"root": summary_index.as_retriever(), **retriever_dict},
)
from llama_index.core.selectors import PydanticMultiSelector
from llama_index.core.retrievers import RouterRetriever
from llama_index.core.retrievers import RecursiveRetriever
from llama_index.core import SummaryIndex
# 派生的检索器将只检索所有节点
summary_index = SummaryIndex(retriever_nodes)
retriever = RecursiveRetriever(
root_id="root",
retriever_dict={"root": summary_index.as_retriever(), **retriever_dict},
)
让我们在一个样本查询上测试检索器。
In [ ]:
Copied!
nodes = await retriever.aretrieve(
"Tell me about the main aspects of safety fine-tuning"
)
nodes = await retriever.aretrieve(
"Tell me about the main aspects of safety fine-tuning"
)
In [ ]:
Copied!
print(f"Number of nodes: {len(nodes)}")
for node in nodes:
print(node.node.metadata["chunk_size"])
print(node.node.get_text())
print(f"Number of nodes: {len(nodes)}")
for node in nodes:
print(node.node.metadata["chunk_size"])
print(node.node.get_text())
定义重新排序器以处理最终检索到的节点集。
In [ ]:
Copied!
# 定义重新排序器
from llama_index.core.postprocessor import LLMRerank,SentenceTransformerRerank
from llama_index.postprocessor.cohere_rerank import CohereRerank
# 重新排序器 = LLMRerank()
# 重新排序器 = SentenceTransformerRerank(top_n=10)
重新排序器 = CohereRerank(top_n=10)
# 定义重新排序器
from llama_index.core.postprocessor import LLMRerank,SentenceTransformerRerank
from llama_index.postprocessor.cohere_rerank import CohereRerank
# 重新排序器 = LLMRerank()
# 重新排序器 = SentenceTransformerRerank(top_n=10)
重新排序器 = CohereRerank(top_n=10)
定义检索查询引擎,将递归检索器和重新排名器整合在一起。
In [ ]:
Copied!
# 定义RetrieverQueryEngine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
# 定义RetrieverQueryEngine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
In [ ]:
Copied!
response = query_engine.query(
"Tell me about the main aspects of safety fine-tuning"
)
response = query_engine.query(
"Tell me about the main aspects of safety fine-tuning"
)
In [ ]:
Copied!
display_response(
response, show_source=True, source_length=500, show_source_metadata=True
)
display_response(
response, show_source=True, source_length=500, show_source_metadata=True
)
分析每个块的相对重要性¶
集成检索的一个有趣特性是,通过重新排序,我们实际上可以使用最终检索集中块的顺序来确定每个块大小的重要性。例如,如果某些块大小总是排在前面,那么这些块可能与查询更相关。
In [ ]:
Copied!
# 根据在组合排名中的位置计算每个块大小的平均精度
from collections import defaultdict
import pandas as pd
def mrr_all(metadata_values, metadata_key, source_nodes):
# 源节点是一个排名列表
# 遍历每个值,找出在source_nodes中的位置
value_to_mrr_dict = {}
for metadata_value in metadata_values:
mrr = 0
for idx, source_node in enumerate(source_nodes):
if source_node.node.metadata[metadata_key] == metadata_value:
mrr = 1 / (idx + 1)
break
else:
continue
# 标准化AP,设置在字典中
value_to_mrr_dict[metadata_value] = mrr
df = pd.DataFrame(value_to_mrr_dict, index=["MRR"])
df.style.set_caption("平均倒数排名")
return df
# 根据在组合排名中的位置计算每个块大小的平均精度
from collections import defaultdict
import pandas as pd
def mrr_all(metadata_values, metadata_key, source_nodes):
# 源节点是一个排名列表
# 遍历每个值,找出在source_nodes中的位置
value_to_mrr_dict = {}
for metadata_value in metadata_values:
mrr = 0
for idx, source_node in enumerate(source_nodes):
if source_node.node.metadata[metadata_key] == metadata_value:
mrr = 1 / (idx + 1)
break
else:
continue
# 标准化AP,设置在字典中
value_to_mrr_dict[metadata_value] = mrr
df = pd.DataFrame(value_to_mrr_dict, index=["MRR"])
df.style.set_caption("平均倒数排名")
return df
In [ ]:
Copied!
# 计算每个块大小的平均倒数排名(越高越好)
# 我们可以看到块大小为256具有最高的排名结果。
print("每个块大小的平均倒数排名")
mrr_all(chunk_sizes, "chunk_size", response.source_nodes)
# 计算每个块大小的平均倒数排名(越高越好)
# 我们可以看到块大小为256具有最高的排名结果。
print("每个块大小的平均倒数排名")
mrr_all(chunk_sizes, "chunk_size", response.source_nodes)
Mean Reciprocal Rank for each Chunk Size
Out[ ]:
128 | 256 | 512 | 1024 | |
---|---|---|---|---|
MRR | 0.333333 | 1.0 | 0.5 | 0.25 |
评估¶
我们将更严格地评估集成检索器相对于“基准”检索器的工作效果。
我们定义/加载一个评估基准数据集,然后对其进行不同的评估。
警告:这可能会很昂贵,特别是使用GPT-4。请谨慎使用,并调整样本大小以适应您的预算。
In [ ]:
Copied!
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset
from llama_index.llms.openai import OpenAI
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
# 注意:如果数据集尚未保存,请运行此代码
eval_llm = OpenAI(model="gpt-4")
# 从最大的块(1024)生成问题
dataset_generator = DatasetGenerator(
nodes_list[-1],
llm=eval_llm,
show_progress=True,
num_questions_per_chunk=2,
)
# 注意:如果数据集尚未保存,请运行此代码
eval_llm = OpenAI(model="gpt-4")
# 从最大的块(1024)生成问题
dataset_generator = DatasetGenerator(
nodes_list[-1],
llm=eval_llm,
show_progress=True,
num_questions_per_chunk=2,
)
In [ ]:
Copied!
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes(num=60)
In [ ]:
Copied!
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
eval_dataset.save_json("data/llama2_eval_qr_dataset.json")
In [ ]:
Copied!
# 可选
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
# 可选
eval_dataset = QueryResponseDataset.from_json(
"data/llama2_eval_qr_dataset.json"
)
比较结果¶
In [ ]:
Copied!
import asyncio
import nest_asyncio
nest_asyncio.apply()
import asyncio
import nest_asyncio
nest_asyncio.apply()
In [ ]:
Copied!
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator,
PairwiseComparisonEvaluator,
)
# 注意:可以取消其他评估器的注释
evaluator_c = CorrectnessEvaluator(llm=eval_llm) # 正确性评估器
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm) # 语义相似性评估器
evaluator_r = RelevancyEvaluator(llm=eval_llm) # 相关性评估器
evaluator_f = FaithfulnessEvaluator(llm=eval_llm) # 忠实度评估器
pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm) # 成对比较评估器
from llama_index.core.evaluation import (
CorrectnessEvaluator,
SemanticSimilarityEvaluator,
RelevancyEvaluator,
FaithfulnessEvaluator,
PairwiseComparisonEvaluator,
)
# 注意:可以取消其他评估器的注释
evaluator_c = CorrectnessEvaluator(llm=eval_llm) # 正确性评估器
evaluator_s = SemanticSimilarityEvaluator(llm=eval_llm) # 语义相似性评估器
evaluator_r = RelevancyEvaluator(llm=eval_llm) # 相关性评估器
evaluator_f = FaithfulnessEvaluator(llm=eval_llm) # 忠实度评估器
pairwise_evaluator = PairwiseComparisonEvaluator(llm=eval_llm) # 成对比较评估器
In [ ]:
Copied!
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
max_samples = 60
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]
# 重新设置基础查询引擎和集成查询引擎
# 基础查询引擎
base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)
# 集成查询引擎
reranker = CohereRerank(top_n=4)
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
from llama_index.core.evaluation.eval_utils import (
get_responses,
get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner
max_samples = 60
eval_qs = eval_dataset.questions
qr_pairs = eval_dataset.qr_pairs
ref_response_strs = [r for (_, r) in qr_pairs]
# 重新设置基础查询引擎和集成查询引擎
# 基础查询引擎
base_query_engine = vector_indices[-1].as_query_engine(similarity_top_k=2)
# 集成查询引擎
reranker = CohereRerank(top_n=4)
query_engine = RetrieverQueryEngine(retriever, node_postprocessors=[reranker])
In [ ]:
Copied!
base_pred_responses = get_responses(
eval_qs[:max_samples], base_query_engine, show_progress=True
)
base_pred_responses = get_responses(
eval_qs[:max_samples], base_query_engine, show_progress=True
)
In [ ]:
Copied!
pred_responses = get_responses(
eval_qs[:max_samples], query_engine, show_progress=True
)
pred_responses = get_responses(
eval_qs[:max_samples], query_engine, show_progress=True
)
In [ ]:
Copied!
import numpy as np
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
import numpy as np
pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
In [ ]:
Copied!
evaluator_dict = {
"correctness": evaluator_c,
"faithfulness": evaluator_f,
# "relevancy": evaluator_r,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)
evaluator_dict = {
"correctness": evaluator_c,
"faithfulness": evaluator_f,
# "relevancy": evaluator_r,
"semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=1, show_progress=True)
In [ ]:
Copied!
eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
In [ ]:
Copied!
base_eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=base_pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
queries=eval_qs[:max_samples],
responses=base_pred_responses[:max_samples],
reference=ref_response_strs[:max_samples],
)
In [ ]:
Copied!
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["correctness", "faithfulness", "semantic_similarity"],
)
display(results_df)
names | correctness | faithfulness | semantic_similarity | |
---|---|---|---|---|
0 | Ensemble Retriever | 4.375000 | 0.983333 | 0.964546 |
1 | Base Retriever | 4.066667 | 0.983333 | 0.956692 |
In [ ]:
Copied!
batch_runner = BatchEvalRunner(
{"pairwise": pairwise_evaluator}, workers=3, show_progress=True
)
pairwise_eval_results = await batch_runner.aevaluate_response_strs(
queries=eval_qs[:max_samples],
response_strs=pred_response_strs[:max_samples],
reference=base_pred_response_strs[:max_samples],
)
batch_runner = BatchEvalRunner(
{"pairwise": pairwise_evaluator}, workers=3, show_progress=True
)
pairwise_eval_results = await batch_runner.aevaluate_response_strs(
queries=eval_qs[:max_samples],
response_strs=pred_response_strs[:max_samples],
reference=base_pred_response_strs[:max_samples],
)
In [ ]:
Copied!
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["pairwise"],
)
display(results_df)
results_df = get_results_df(
[eval_results, base_eval_results],
["Ensemble Retriever", "Base Retriever"],
["pairwise"],
)
display(results_df)
Out[ ]:
names | pairwise | |
---|---|---|
0 | Pairwise Comparison | 0.5 |