Cohere init8 和二进制嵌入检索评估¶
Cohere Embed 是第一个原生支持浮点、int8、二进制和无符号二进制嵌入的嵌入模型。有关Cohere int8和二进制嵌入的更多详细信息,请参阅他们的主要博客文章。
这个笔记本帮助您评估这些不同的嵌入类型,并为您的RAG管道选择一个。它使用我们的RetrieverEvaluator
来评估使用Retriever模块LlamaIndex的嵌入质量。
观察到的指标:
- 命中率
- MRR(平均倒数排名)
对于任何给定的问题,这些指标将比较从真实上下文中检索到的结果的质量。评估数据集是使用我们的合成数据集生成模块创建的。我们将使用GPT-4进行数据集生成,以避免偏见。
注意:笔记本末尾显示的结果非常特定于数据集,以及考虑的各种其他参数。我们建议您将笔记本用作参考,在您的数据集上进行实验,并评估在RAG管道中使用不同嵌入类型的效果。¶
安装步骤¶
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-cohere
# 设置API密钥
# 在使用API之前,通常需要获取API密钥。
# 以下是如何设置API密钥的示例代码。
api_key = "your_api_key_here"
import os
os.environ["OPENAI_API_KEY"] = "YOUR OPENAI KEY"
os.environ["COHERE_API_KEY"] = "YOUR COHEREAI API KEY"
设置¶
在这里,我们加载数据(PG文章),解析成节点。然后我们使用简单的向量索引对这些数据进行索引,并获得以下不同嵌入类型的检索器。
float
(浮点数)int8
(8位整数)binary
(二进制)ubinary
(无符号二进制)
import nest_asyncio
nest_asyncio.apply()
from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.cohere import CohereEmbedding
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-03-27 20:26:33-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.03s 2024-03-27 20:26:34 (2.18 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
加载数据¶
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
class Node:
def __init__(self, data):
self.data = data
self.next = None
这段代码创建了一个名为Node的类,该类具有一个构造函数__init__
,用于初始化节点的数据和下一个节点的引用。
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# 默认情况下,节点的id被设置为随机的uuid。为了确保每次运行相同的id,我们手动设置它们。 for idx, node in enumerate(nodes): node.id_ = f"node_{idx}"
为不同的嵌入类型创建检索器¶
# llm 用于问题生成# 使用除cohereAI之外的任何其他llm以避免偏见。llm = OpenAI(model="gpt-4")# 返回嵌入模型的函数def cohere_embedding( model_name: str, input_type: str, embedding_type: str) -> CohereEmbedding: return CohereEmbedding( cohere_api_key=os.environ["COHERE_API_KEY"], model_name=model_name, input_type=input_type, embedding_type=embedding_type, )# 返回不同嵌入类型嵌入模型的检索器函数def retriver(nodes, embedding_type="float", model_name="embed-english-v3.0"): vector_index = VectorStoreIndex( nodes, embed_model=cohere_embedding( model_name, "search_document", embedding_type ), ) retriever = vector_index.as_retriever( similarity_top_k=2, embed_model=cohere_embedding( model_name, "search_query", embedding_type ), ) return retriever
# 为浮点嵌入类型构建检索器retriver_float = retriver(nodes)# 为int8嵌入类型构建检索器retriver_int8 = retriver(nodes, "int8")# 为二进制嵌入类型构建检索器retriver_binary = retriver(nodes, "binary")# 为无符号二进制嵌入类型构建检索器retriver_ubinary = retriver(nodes, "ubinary")
尝试检索¶
我们将使用float
检索器对一个示例查询进行检索。
retrieved_nodes = retriver_float.retrieve("What did the author do growing up?")
from llama_index.core.response.notebook_utils import display_source_node
for node in retrieved_nodes:
display_source_node(node, source_length=1000)
Node ID: node_2
Similarity: 0.3641554823852197
Text: I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.
Computers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.
Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledg...
Node ID: node_0
Similarity: 0.36283154406791923
Text: What I Worked On
February 2021
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.
The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.
The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in ...
评估数据集 - (查询, 上下文) 对的合成数据集生成¶
在这里,我们将在现有文本语料库上构建一个简单的评估数据集。
我们使用我们的 generate_question_context_pairs
来在给定的非结构化文本语料库上生成一组 (查询, 上下文) 对。这使用LLM来自动从每个上下文块生成问题。
我们得到一个 EmbeddingQAFinetuneDataset
对象。在高层次上,这包含了一组将查询和相关文档块映射到id的内容,以及语料库本身。
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
100%|██████████| 59/59 [04:10<00:00, 4.24s/it]
queries = qa_dataset.queries.values()
print(list(queries)[0])
"Describe the author's initial experiences with programming on the IBM 1401. What were some of the challenges he faced and how did these experiences shape his understanding of programming?"
# [可选] 保存qa_dataset.save_json("pg_eval_dataset.json")
# [可选] 加载qa_dataset = EmbeddingQAFinetuneDataset.from_json("pg_eval_dataset.json")
使用RetrieverEvaluator
进行检索评估¶
现在我们已经准备好运行我们的检索评估。我们将在我们生成的评估数据集上运行我们的RetrieverEvaluator
。
为不同的嵌入类型定义RetrieverEvaluator
¶
from llama_index.core.evaluation import RetrieverEvaluatormetrics = ["mrr", "hit_rate"]# 浮点型嵌入类型的检索评估器retriever_evaluator_float = RetrieverEvaluator.from_metric_names( metrics, retriever=retriver_float)# int8型嵌入类型的检索评估器retriever_evaluator_int8 = RetrieverEvaluator.from_metric_names( metrics, retriever=retriver_int8)# 二进制型嵌入类型的检索评估器retriever_evaluator_binary = RetrieverEvaluator.from_metric_names( metrics, retriever=retriver_binary)# 无符号二进制型嵌入类型的检索评估器retriever_evaluator_ubinary = RetrieverEvaluator.from_metric_names( metrics, retriever=retriver_ubinary)
# 在一个样本查询上尝试一下sample_id, sample_query = list(qa_dataset.queries.items())[0]sample_expected = qa_dataset.relevant_docs[sample_id]eval_result = retriever_evaluator_float.evaluate(sample_query, sample_expected)print(eval_result)
Query: "Describe the author's initial experiences with programming on the IBM 1401. What were some of the challenges he faced and how did these experiences shape his understanding of programming?" Metrics: {'mrr': 0.5, 'hit_rate': 1.0}
# 在整个数据集上的评估# 浮点嵌入类型eval_results_float = await retriever_evaluator_float.aevaluate_dataset( qa_dataset)# int8嵌入类型eval_results_int8 = await retriever_evaluator_int8.aevaluate_dataset( qa_dataset)# 二进制嵌入类型eval_results_binary = await retriever_evaluator_binary.aevaluate_dataset( qa_dataset)# 无符号二进制嵌入类型eval_results_ubinary = await retriever_evaluator_ubinary.aevaluate_dataset( qa_dataset)
定义display_results
函数,用于在数据框中显示每个检索器的结果。¶
import pandas as pddef display_results(name, eval_results): """显示evaluate函数的结果。""" metric_dicts = [] for eval_result in eval_results: metric_dict = eval_result.metric_vals_dict metric_dicts.append(metric_dict) full_df = pd.DataFrame(metric_dicts) hit_rate = full_df["hit_rate"].mean() mrr = full_df["mrr"].mean() columns = {"Embedding Type": [name], "hit_rate": [hit_rate], "mrr": [mrr]} metric_df = pd.DataFrame(columns) return metric_df
评估结果¶
# 浮点嵌入类型的指标metrics_float = display_results("float", eval_results_float)# int8嵌入类型的指标metrics_int8 = display_results("int8", eval_results_int8)# 二进制嵌入类型的指标metrics_binary = display_results("binary", eval_results_binary)# 无符号二进制嵌入类型的指标metrics_ubinary = display_results("ubinary", eval_results_ubinary)
combined_metrics = pd.concat(
[metrics_float, metrics_int8, metrics_binary, metrics_ubinary]
)
combined_metrics.set_index(["Embedding Type"], append=True, inplace=True)
combined_metrics
hit_rate | mrr | ||
---|---|---|---|
Embedding Type | |||
0 | float | 0.805085 | 0.665254 |
int8 | 0.813559 | 0.673729 | |
binary | 0.491525 | 0.394068 | |
ubinary | 0.449153 | 0.377119 |
注意:上面显示的结果非常特定于数据集和考虑的其他各种参数。我们建议您将笔记本用作参考,对您的数据集进行实验,并评估在您的RAG流程中使用不同嵌入类型的效果。