用于更好检索和综合的自动化元数据提取¶
在本教程中,我们将向您展示如何执行自动化元数据提取,以获得更好的检索结果。 我们使用两个提取器:一个是QuestionAnsweredExtractor,它从一段文本中生成问题/答案对,另一个是SummaryExtractor,它不仅提取当前文本中的摘要,还提取相邻文本中的摘要。
我们将展示这样可以实现“块梦想” - 每个单独的块可以具有更多“整体”的细节,从而在获得检索结果后提高答案质量。
我们的数据源来自Eugene Yan关于LLM Patterns的热门文章:https://eugeneyan.com/writing/llm-patterns/
设置¶
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-llms-openai
%pip install llama-index-readers-web
%pip install llama-index-llms-openai
%pip install llama-index-readers-web
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import os
import openai
import nest_asyncio
nest_asyncio.apply()
import os
import openai
In [ ]:
Copied!
# 可选:设置用于跟踪的W&B回调处理
from llama_index.core import set_global_handler
set_global_handler("wandb", run_args={"project": "llamaindex"})
# 可选:设置用于跟踪的W&B回调处理
from llama_index.core import set_global_handler
set_global_handler("wandb", run_args={"project": "llamaindex"})
In [ ]:
Copied!
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]
定义元数据提取器¶
在这里我们定义元数据提取器。我们定义了两种变体:
- metadata_extractor_1 只包含 QuestionsAnsweredExtractor
- metadata_extractor_2 包含 QuestionsAnsweredExtractor 和 SummaryExtractor
In [ ]:
Copied!
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
In [ ]:
Copied!
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
我们还展示了如何实例化SummaryExtractor
和QuestionsAnsweredExtractor
。
In [ ]:
Copied!
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
)
node_parser = TokenTextSplitter(
separator=" ", chunk_size=256, chunk_overlap=128
)
extractors_1 = [
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
extractors_2 = [
SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import (
SummaryExtractor,
QuestionsAnsweredExtractor,
)
node_parser = TokenTextSplitter(
separator=" ", chunk_size=256, chunk_overlap=128
)
extractors_1 = [
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
extractors_2 = [
SummaryExtractor(summaries=["prev", "self", "next"], llm=llm),
QuestionsAnsweredExtractor(
questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
),
]
加载数据,运行提取器¶
我们使用我们的LlamaHub SimpleWebPageReader加载Eugene的文章(https://eugeneyan.com/writing/llm-patterns/)。
然后我们运行我们的提取器。
In [ ]:
Copied!
from llama_index.core import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader
In [ ]:
Copied!
# 加载博客
from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
# 加载博客
from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
In [ ]:
Copied!
print(docs[0].get_content())
print(docs[0].get_content())
In [ ]:
Copied!
orig_nodes = node_parser.get_nodes_from_documents(docs)
orig_nodes = node_parser.get_nodes_from_documents(docs)
In [ ]:
Copied!
# 仅取前8个节点进行测试
nodes = orig_nodes[20:28]
# 仅取前8个节点进行测试
nodes = orig_nodes[20:28]
In [ ]:
Copied!
print(nodes[3].get_content(metadata_mode="all"))
print(nodes[3].get_content(metadata_mode="all"))
is to measure the distance that words would have to move to convert one sequence to another. However, there are several pitfalls to using these conventional benchmarks and metrics. First, there’s **poor correlation between these metrics and human judgments.** BLEU, ROUGE, and others have had [negative correlation with how humans evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have [low correlation with tasks that require creativity and diversity](https://arxiv.org/abs/2303.16634). Second, these metrics often have **poor adaptability to a wider variety of tasks**. Adopting a metric proposed for one task to another is not always prudent. For example, exact match metrics such as BLEU and ROUGE are a poor fit for tasks like abstractive summarization or dialogue. Since they’re based on n-gram overlap between output and reference, they don’t make sense for a dialogue task where a wide variety
运行元数据提取器¶
In [ ]:
Copied!
from llama_index.core.ingestion import IngestionPipeline
# 使用元数据提取器处理节点
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_1])
nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
from llama_index.core.ingestion import IngestionPipeline
# 使用元数据提取器处理节点
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_1])
nodes_1 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
Parsing documents into nodes: 0%| | 0/8 [00:00<?, ?it/s]
Extracting questions: 0%| | 0/8 [00:00<?, ?it/s]
In [ ]:
Copied!
print(nodes_1[3].get_content(metadata_mode="all"))
print(nodes_1[3].get_content(metadata_mode="all"))
[Excerpt from document] questions_this_excerpt_can_answer: 1. What is the correlation between conventional metrics like BLEU and ROUGE and human judgments in evaluating fluency and adequacy in natural language processing tasks? 2. How do conventional metrics like BLEU and ROUGE perform in tasks that require creativity and diversity? 3. Why are exact match metrics like BLEU and ROUGE not suitable for tasks like abstractive summarization or dialogue in natural language processing? Excerpt: ----- is to measure the distance that words would have to move to convert one sequence to another. However, there are several pitfalls to using these conventional benchmarks and metrics. First, there’s **poor correlation between these metrics and human judgments.** BLEU, ROUGE, and others have had [negative correlation with how humans evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have [low correlation with tasks that require creativity and diversity](https://arxiv.org/abs/2303.16634). Second, these metrics often have **poor adaptability to a wider variety of tasks**. Adopting a metric proposed for one task to another is not always prudent. For example, exact match metrics such as BLEU and ROUGE are a poor fit for tasks like abstractive summarization or dialogue. Since they’re based on n-gram overlap between output and reference, they don’t make sense for a dialogue task where a wide variety -----
In [ ]:
Copied!
# 第二遍:运行摘要,然后是元数据提取器
# 使用元数据提取器处理节点
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])
nodes_2 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
# 第二遍:运行摘要,然后是元数据提取器
# 使用元数据提取器处理节点
pipeline = IngestionPipeline(transformations=[node_parser, *extractors_2])
nodes_2 = pipeline.run(nodes=nodes, in_place=False, show_progress=True)
Parsing documents into nodes: 0%| | 0/8 [00:00<?, ?it/s]
Extracting summaries: 0%| | 0/8 [00:00<?, ?it/s]
Extracting questions: 0%| | 0/8 [00:00<?, ?it/s]
可视化一些样本数据¶
In [ ]:
Copied!
print(nodes_2[3].get_content(metadata_mode="all"))
print(nodes_2[3].get_content(metadata_mode="all"))
[Excerpt from document] prev_section_summary: The section discusses the comparison between BERTScore and MoverScore, two metrics used to evaluate the quality of text generation models. MoverScore is described as a metric that measures the effort required to transform one text sequence into another by mapping semantically related words. The section also highlights the limitations of conventional benchmarks and metrics, such as poor correlation with human judgments and low correlation with tasks requiring creativity. next_section_summary: The section discusses the limitations of current evaluation metrics in natural language processing tasks. It highlights three main issues: lack of creativity and diversity in metrics, poor adaptability to different tasks, and poor reproducibility. The section mentions specific metrics like BLEU and ROUGE, and also references studies that have reported high variance in metric scores. section_summary: The section discusses the limitations of conventional benchmarks and metrics used to measure the distance between word sequences. It highlights two main issues: the poor correlation between these metrics and human judgments, and their limited adaptability to different tasks. The section mentions specific metrics like BLEU and ROUGE, which have been found to have low correlation with human evaluations of fluency, adequacy, creativity, and diversity. It also points out that metrics based on n-gram overlap, such as BLEU and ROUGE, are not suitable for tasks like abstractive summarization or dialogue. questions_this_excerpt_can_answer: 1. What are the limitations of conventional benchmarks and metrics in measuring the distance between word sequences? 2. How do metrics like BLEU and ROUGE correlate with human judgments in terms of fluency, adequacy, creativity, and diversity? 3. Why are metrics based on n-gram overlap, such as BLEU and ROUGE, not suitable for tasks like abstractive summarization or dialogue? Excerpt: ----- is to measure the distance that words would have to move to convert one sequence to another. However, there are several pitfalls to using these conventional benchmarks and metrics. First, there’s **poor correlation between these metrics and human judgments.** BLEU, ROUGE, and others have had [negative correlation with how humans evaluate fluency](https://arxiv.org/abs/2008.12009). They also showed moderate to less correlation with human adequacy scores. In particular, BLEU and ROUGE have [low correlation with tasks that require creativity and diversity](https://arxiv.org/abs/2303.16634). Second, these metrics often have **poor adaptability to a wider variety of tasks**. Adopting a metric proposed for one task to another is not always prudent. For example, exact match metrics such as BLEU and ROUGE are a poor fit for tasks like abstractive summarization or dialogue. Since they’re based on n-gram overlap between output and reference, they don’t make sense for a dialogue task where a wide variety -----
In [ ]:
Copied!
print(nodes_2[1].get_content(metadata_mode="all"))
print(nodes_2[1].get_content(metadata_mode="all"))
[Excerpt from document] prev_section_summary: The section discusses the F_{BERT} formula used in BERTScore and highlights the advantages of BERTScore over simpler metrics like BLEU and ROUGE. It also introduces MoverScore, another metric that uses contextualized embeddings but allows for many-to-one matching. The key topics are BERTScore, MoverScore, and the differences between them. next_section_summary: The section discusses the comparison between BERTScore and MoverScore, two metrics used to evaluate the quality of text generation models. MoverScore is described as a metric that measures the effort required to transform one text sequence into another by mapping semantically related words. The section also highlights the limitations of conventional benchmarks and metrics, such as poor correlation with human judgments and low correlation with tasks requiring creativity. section_summary: The key topics of this section are BERTScore and MoverScore, which are methods used to compute the similarity between generated output and reference in tasks like image captioning and machine translation. BERTScore uses one-to-one matching of tokens, while MoverScore allows for many-to-one matching. MoverScore solves an optimization problem to measure the distance that words would have to move to convert one sequence to another. questions_this_excerpt_can_answer: 1. What is the main difference between BERTScore and MoverScore? 2. How does MoverScore allow for many-to-one matching of tokens? 3. What problem does MoverScore solve to measure the distance between two sequences? Excerpt: ----- to have better correlation for tasks such as image captioning and machine translation. **[MoverScore](https://arxiv.org/abs/1909.02622)** also uses contextualized embeddings to compute the distance between tokens in the generated output and reference. But unlike BERTScore, which is based on one-to-one matching (or “hard alignment”) of tokens, MoverScore allows for many-to-one matching (or “soft alignment”). ![BERTScore \(left\) vs. MoverScore \(right\)](/assets/mover-score.jpg) BERTScore (left) vs. MoverScore (right; [source](https://arxiv.org/abs/1909.02622)) MoverScore enables the mapping of semantically related words in one sequence to their counterparts in another sequence. It does this by solving a constrained optimization problem that finds the minimum effort to transform one text into another. The idea is to measure the distance that words would have to move to convert one sequence to another. However, there -----
设置RAG查询引擎,比较结果!¶
我们在三种节点变体之上设置了3个索引/查询引擎。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import (
display_source_node,
display_response,
)
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import (
display_source_node,
display_response,
)
In [ ]:
Copied!
# 尝试不同的查询引擎
# index0 = VectorStoreIndex(orig_nodes)
# index1 = VectorStoreIndex(nodes_1 + orig_nodes[8:])
# index2 = VectorStoreIndex(nodes_2 + orig_nodes[8:])
index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
index2 = VectorStoreIndex(orig_nodes[:20] + nodes_2 + orig_nodes[28:])
# 尝试不同的查询引擎
# index0 = VectorStoreIndex(orig_nodes)
# index1 = VectorStoreIndex(nodes_1 + orig_nodes[8:])
# index2 = VectorStoreIndex(nodes_2 + orig_nodes[8:])
index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])
index2 = VectorStoreIndex(orig_nodes[:20] + nodes_2 + orig_nodes[28:])
In [ ]:
Copied!
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)
query_engine2 = index2.as_query_engine(similarity_top_k=1)
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)
query_engine2 = index2.as_query_engine(similarity_top_k=1)
尝试一些问题¶
在这个问题中,我们可以看到天真的回答response0
只提到了BLEU和ROUGE,缺乏其他指标的上下文。
另一方面,response2
包含了所有指标的上下文。
In [ ]:
Copied!
# 查询字符串 = "在原始的RAG论文中,您能描述一下生成的两种主要方法并进行比较吗?"
query_str = (
"Can you describe metrics for evaluating text generation quality, compare"
" them, and tell me about their downsides"
)
response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
# 查询字符串 = "在原始的RAG论文中,您能描述一下生成的两种主要方法并进行比较吗?"
query_str = (
"Can you describe metrics for evaluating text generation quality, compare"
" them, and tell me about their downsides"
)
response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
In [ ]:
Copied!
display_response(
response0, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response0, source_length=1000, show_source=True, show_source_metadata=True
)
In [ ]:
Copied!
print(response0.source_nodes[0].node.get_content())
print(response0.source_nodes[0].node.get_content())
require creativity and diversity](https://arxiv.org/abs/2303.16634). Second, these metrics often have **poor adaptability to a wider variety of tasks**. Adopting a metric proposed for one task to another is not always prudent. For example, exact match metrics such as BLEU and ROUGE are a poor fit for tasks like abstractive summarization or dialogue. Since they’re based on n-gram overlap between output and reference, they don’t make sense for a dialogue task where a wide variety of responses are possible. An output can have zero n-gram overlap with the reference but yet be a good response. Third, these metrics have **poor reproducibility**. Even for the same metric, [high variance is reported across different studies](https://arxiv.org/abs/2008.12009), possibly due to variations in human judgment collection or metric parameter settings. Another study of [ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000 studies found that scores were hard
In [ ]:
Copied!
display_response(
response1, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response1, source_length=1000, show_source=True, show_source_metadata=True
)
In [ ]:
Copied!
display_response(
response2, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response2, source_length=1000, show_source=True, show_source_metadata=True
)
在接下来的问题中,我们询问关于BERTScore/MoverScore的情况。
这些回答是相似的。但是response2
比response0
提供了稍微更多的细节,因为它在元数据中包含了更多关于MoverScore的信息。
In [ ]:
Copied!
# 查询字符串 = "ROUGE指标存在哪些可重现性问题?给出与基准相关的一些细节,并描述其他ROUGE问题。"
query_str = (
"Can you give a high-level overview of BERTScore/MoverScore + formulas if"
" available?"
)
response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
# 查询字符串 = "ROUGE指标存在哪些可重现性问题?给出与基准相关的一些细节,并描述其他ROUGE问题。"
query_str = (
"Can you give a high-level overview of BERTScore/MoverScore + formulas if"
" available?"
)
response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)
response2 = query_engine2.query(query_str)
In [ ]:
Copied!
display_response(
response0, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response0, source_length=1000, show_source=True, show_source_metadata=True
)
In [ ]:
Copied!
display_response(
response1, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response1, source_length=1000, show_source=True, show_source_metadata=True
)
In [ ]:
Copied!
display_response(
response2, source_length=1000, show_source=True, show_source_metadata=True
)
display_response(
response2, source_length=1000, show_source=True, show_source_metadata=True
)
In [ ]:
Copied!
response1.source_nodes[0].node.metadata
response1.source_nodes[0].node.metadata
Out[ ]:
{'questions_this_excerpt_can_answer': '1. What is the advantage of using BERTScore over simpler metrics like BLEU and ROUGE?\n2. How does MoverScore differ from BERTScore in terms of token matching?\n3. What tasks have shown better correlation with BERTScore, such as image captioning and machine translation?'}