元数据替换 + 节点句子窗口¶

在这个笔记本中，我们使用SentenceWindowNodeParser将文档解析为每个节点一个句子。每个节点还包含一个“窗口”，其中包含节点句子两侧的句子。

然后，在检索过程中，在将检索到的句子传递给LLM之前，使用MetadataReplacementNodePostProcessor将单个句子替换为包含周围句子的窗口。

这对于大型文档/索引非常有用，因为它有助于检索更精细的细节。

默认情况下，句子窗口是原始句子两侧的5个句子。

在这种情况下，不使用块大小设置，而是遵循窗口设置。

In [ ]:

Copied!

%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai

In [ ]:

Copied!

%load_ext autoreload
%autoreload 2
%load_ext autoreload
%autoreload 2

设置¶

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

import os
import openai
import os
import openai

In [ ]:

Copied!

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!

from llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.core.node_parser import SentenceWindowNodeParserfrom llama_index.core.node_parser import SentenceSplitter# 创建具有默认设置的句子窗口节点解析器node_parser = SentenceWindowNodeParser.from_defaults(    window_size=3,    window_metadata_key="window",    original_text_metadata_key="original_text",)# 基本节点解析器是句子分割器text_splitter = SentenceSplitter()llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)embed_model = HuggingFaceEmbedding(    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512)from llama_index.core import SettingsSettings.llm = llmSettings.embed_model = embed_modelSettings.text_splitter = text_splitter
from llama_index.llms.openai import OpenAIfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.embeddings.huggingface import HuggingFaceEmbeddingfrom llama_index.core.node_parser import SentenceWindowNodeParserfrom llama_index.core.node_parser import SentenceSplitter# 创建具有默认设置的句子窗口节点解析器node_parser = SentenceWindowNodeParser.from_defaults(    window_size=3,    window_metadata_key="window",    original_text_metadata_key="original_text",)# 基本节点解析器是句子分割器text_splitter = SentenceSplitter()llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)embed_model = HuggingFaceEmbedding(    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512)from llama_index.core import SettingsSettings.llm = llmSettings.embed_model = embed_modelSettings.text_splitter = text_splitter

加载数据，构建索引¶

在这一部分，我们将加载数据并构建向量索引。

加载数据¶

在这里，我们使用最新的IPCC气候报告第3章来构建一个索引。

In [ ]:

Copied!

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (6) Could not resolve host: www..ch

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

提取节点¶

我们提取出将存储在VectorIndex中的节点集。这包括使用句子窗口解析器提取的节点，以及使用标准解析器提取的“基本”节点。

In [ ]:

Copied!

nodes = node_parser.get_nodes_from_documents(documents)
nodes = node_parser.get_nodes_from_documents(documents)

In [ ]:

Copied!

base_nodes = text_splitter.get_nodes_from_documents(documents)
base_nodes = text_splitter.get_nodes_from_documents(documents)

构建索引¶

我们同时构建句子索引和“基本”索引（使用默认的块大小）。

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)
from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

In [ ]:

Copied!

base_index = VectorStoreIndex(base_nodes)
base_index = VectorStoreIndex(base_nodes)

查询¶

使用 MetadataReplacementPostProcessor¶

在这里，我们现在使用 MetadataReplacementPostProcessor 来用周围的上下文替换每个节点中的句子。

In [ ]:

Copied!

from llama_index.core.postprocessor import MetadataReplacementPostProcessorquery_engine = sentence_index.as_query_engine(    similarity_top_k=2,    # 目标键默认为`window`，以匹配node_parser的默认设置    node_postprocessors=[        MetadataReplacementPostProcessor(target_metadata_key="window")    ],)window_response = query_engine.query(    "What are the concerns surrounding the AMOC?")print(window_response)
from llama_index.core.postprocessor import MetadataReplacementPostProcessorquery_engine = sentence_index.as_query_engine(    similarity_top_k=2,    # 目标键默认为`window`，以匹配node_parser的默认设置    node_postprocessors=[        MetadataReplacementPostProcessor(target_metadata_key="window")    ],)window_response = query_engine.query(    "What are the concerns surrounding the AMOC?")print(window_response)

There is low confidence in the quantification of Atlantic Meridional Overturning Circulation (AMOC) changes in the 20th century due to low agreement in quantitative reconstructed and simulated trends. Additionally, direct observational records since the mid-2000s remain too short to determine the relative contributions of internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is very likely that AMOC will decline for all SSP scenarios over the 21st century, but it will not involve an abrupt collapse before 2100.

我们还可以检查每个节点检索到的原始句子，以及实际发送到LLM的句子窗口。

In [ ]:

Copied!





window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: Nevertheless, projected future annual cumulative upwelling wind 
changes at most locations and seasons remain within ±10–20% of 
present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; 
Fox-Kemper et al., 2021).
 Continuous observation of the Atlantic meridional overturning 
circulation (AMOC) has improved the understanding of its variability 
(Frajka-Williams et  al., 2019), but there is low confidence in the 
quantification of AMOC changes in the 20th century because of low 
agreement in quantitative reconstructed and simulated trends (WGI 
AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). 
 Direct observational records since the mid-2000s remain too short to 
determine the relative contributions of internal variability, natural 
forcing and anthropogenic forcing to AMOC change (high confidence) 
(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 
2021).  Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).
 3.2.2.4 Sea Ice Changes
Sea ice is a key driver of polar marine life, hosting unique ecosystems 
and affecting diverse marine organisms and food webs through its 
impact on light penetration and supplies of nutrients and organic 
matter (Arrigo, 2014).  Since the late 1970s, Arctic sea ice area has 
decreased for all months, with an estimated decrease of 2 million km2 
(or 25%) for summer sea ice (averaged for August, September and 
October) in 2010–2019 as compared with 1979–1988 (WGI AR6 
Section 9.3.1.1; Fox-Kemper et al., 2021). 
------------------
Original Sentence: Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).

与普通的VectorStoreIndex对比¶

在使用VectorStoreIndex时，我们需要注意以下几点：

In [ ]:

Copied!





query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)
print(vector_response)
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)
print(vector_response)

The concerns surrounding the AMOC are not provided in the given context information.

嗯，那个方法没起作用。让我们增加前k个！这种方法会比句子窗口索引慢，同时会使用更多的标记。

In [ ]:

Copied!





query_engine = base_index.as_query_engine(similarity_top_k=5)
vector_response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)
print(vector_response)
query_engine = base_index.as_query_engine(similarity_top_k=5)
vector_response = query_engine.query(
    "What are the concerns surrounding the AMOC?"
)
print(vector_response)

There are concerns surrounding the AMOC (Atlantic Meridional Overturning Circulation). The context information mentions that the AMOC will decline over the 21st century, with high confidence but low confidence for quantitative projections.

分析¶

因此，SentenceWindowNodeParser + MetadataReplacementNodePostProcessor 组合在这里是明显的赢家。但为什么呢？

句子级别的嵌入似乎捕捉到了更多细粒度的细节，比如单词 AMOC。

我们还可以比较每个索引的检索到的块！

In [ ]:

Copied!

for source_node in window_response.source_nodes:
    print(source_node.node.metadata["original_text"])
    print("--------")
for source_node in window_response.source_nodes:
    print(source_node.node.metadata["original_text"])
    print("--------")

Over the 21st century, AMOC will very likely decline for all SSP 
scenarios but will not involve an abrupt collapse before 2100 (WGI 
AR6 Sections 4.3.2, 9.2.3.1; Fox-Kemper et al., 2021; Lee et al., 2021).

--------
Direct observational records since the mid-2000s remain too short to 
determine the relative contributions of internal variability, natural 
forcing and anthropogenic forcing to AMOC change (high confidence) 
(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 
2021). 
--------

在这里，我们可以看到句子窗口索引轻松检索到了两个讨论AMOC的节点。请记住，这里的嵌入是纯粹基于原始句子的，但LLM实际上最终会阅读周围的上下文！

现在，让我们尝试分析一下为什么朴素向量索引失败了。

In [ ]:

Copied!

for node in vector_response.source_nodes:
    print("AMOC mentioned?", "AMOC" in node.node.text)
    print("--------")
for node in vector_response.source_nodes:
    print("AMOC mentioned?", "AMOC" in node.node.text)
    print("--------")

AMOC mentioned? False
--------
AMOC mentioned? False
--------
AMOC mentioned? True
--------
AMOC mentioned? False
--------
AMOC mentioned? False
--------

把索引为[2]的源节点提到了AMOC，但实际上这段文本是什么样的呢？

In [ ]:

Copied!

print(vector_response.source_nodes[2].node.text)
print(vector_response.source_nodes[2].node.text)

2021; Gulev et al. 
2021)The AMOC will decline over the 21st century 
(high confidence, but low confidence for 
quantitative projections).4.3.2.3, 9.2.3 (Fox-Kemper 
et al. 2021; Lee et al. 
2021)
Sea ice
Arctic sea ice 
changes‘Current Arctic sea ice coverage levels are the 
lowest since at least 1850 for both annual mean 
and late-summer values (high confidence).’2.3.2.1, 9.3.1 (Fox-Kemper 
et al. 2021; Gulev et al. 
2021)‘The Arctic will become practically ice-free in 
September by the end of the 21st century under 
SSP2-4.5, SSP3-7.0 and SSP5-8.5[…](high 
confidence).’4.3.2.1, 9.3.1 (Fox-Kemper 
et al. 2021; Lee et al. 
2021)
Antarctic sea ice 
changesThere is no global significant trend in 
Antarctic sea ice area from 1979 to 2020 (high 
confidence).2.3.2.1, 9.3.2 (Fox-Kemper 
et al. 2021; Gulev et al. 
2021)There is low confidence in model simulations of 
future Antarctic sea ice.9.3.2 (Fox-Kemper et al. 
2021)
Ocean chemistry
Changes in salinityThe ‘large-scale, near-surface salinity contrasts 
have intensified since at least 1950 […] 
(virtually certain).’2.3.3.2, 9.2.2.2 
(Fox-Kemper et al. 2021; 
Gulev et al. 2021)‘Fresh ocean regions will continue to get fresher 
and salty ocean regions will continue to get 
saltier in the 21st century (medium confidence).’9.2.2.2 (Fox-Kemper et al. 
2021)
Ocean acidificationOcean surface pH has declined globally over the 
past four decades (virtually certain).2.3.3.5, 5.3.2.2 (Canadell 
et al. 2021; Gulev et al. 
2021)Ocean surface pH will continue to decrease 
‘through the 21st century, except for the 
lower-emission scenarios SSP1-1.9 and SSP1-2.6 
[…] (high confidence).’4.3.2.5, 4.5.2.2, 5.3.4.1 
(Lee et al. 2021; Canadell 
et al. 2021)
Ocean 
deoxygenationDeoxygenation has occurred in most open 
ocean regions since the mid-20th century (high 
confidence).2.3.3.6, 5.3.3.2 (Canadell 
et al. 2021; Gulev et al. 
2021)Subsurface oxygen content ‘is projected to 
transition to historically unprecedented condition 
with decline over the 21st century (medium 
confidence).’5.3.3.2 (Canadell et al. 
2021)
Changes in nutrient 
concentrationsNot assessed in WGI Not assessed in WGI

所以AMOC被讨论了，但遗憾的是它在中间部分。对于LLMs来说，经常观察到检索到的上下文中间的文本往往被忽略或不太有用。最近的一篇论文"中间的遗失"在这里讨论了这个问题。

[可选] 评估¶

我们将更严格地评估句子窗口检索器相对于基础检索器的工作效果。

我们定义/加载一个评估基准数据集，然后对其进行不同的评估。

警告：这可能会非常昂贵，特别是使用GPT-4。请谨慎使用，并调整样本大小以适应您的预算。

In [ ]:

Copied!

from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset

from llama_index.llms.openai import OpenAI
import nest_asyncio
import random

nest_asyncio.apply()
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset

from llama_index.llms.openai import OpenAI
import nest_asyncio
import random

nest_asyncio.apply()

In [ ]:

Copied!

len(base_nodes)
len(base_nodes)

Out[ ]:

In [ ]:

Copied!

num_nodes_eval = 30# 总共有428个节点。取前200个节点生成问题（文档的后半部分都是参考资料）sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)# 注意：如果数据集尚未保存，则运行此代码# 从最大的块（1024）生成问题dataset_generator = DatasetGenerator(    sample_eval_nodes,    llm=OpenAI(model="gpt-4"),    show_progress=True,    num_questions_per_chunk=2,)
num_nodes_eval = 30# 总共有428个节点。取前200个节点生成问题（文档的后半部分都是参考资料）sample_eval_nodes = random.sample(base_nodes[:200], num_nodes_eval)# 注意：如果数据集尚未保存，则运行此代码# 从最大的块（1024）生成问题dataset_generator = DatasetGenerator(    sample_eval_nodes,    llm=OpenAI(model="gpt-4"),    show_progress=True,    num_questions_per_chunk=2,)

In [ ]:

Copied!

eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()

In [ ]:

Copied!

eval_dataset.save_json("data/ipcc_eval_qr_dataset.json")
eval_dataset.save_json("data/ipcc_eval_qr_dataset.json")

In [ ]:

Copied!

# 可选eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")
# 可选eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")

比较结果¶

In [ ]:

Copied!

import asyncio
import nest_asyncio

nest_asyncio.apply()
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!

from llama_index.core.evaluation import (    CorrectnessEvaluator,  # 正确性评估器    SemanticSimilarityEvaluator,  # 语义相似性评估器    RelevancyEvaluator,  # 相关性评估器    FaithfulnessEvaluator,  # 忠实度评估器    PairwiseComparisonEvaluator,  # 两两比较评估器)from collections import defaultdictimport pandas as pd# 注意：可以取消其他评估器的注释evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))  # 正确性评估器evaluator_s = SemanticSimilarityEvaluator()  # 语义相似性评估器evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))  # 相关性评估器evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))  # 忠实度评估器# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))
from llama_index.core.evaluation import (    CorrectnessEvaluator,  # 正确性评估器    SemanticSimilarityEvaluator,  # 语义相似性评估器    RelevancyEvaluator,  # 相关性评估器    FaithfulnessEvaluator,  # 忠实度评估器    PairwiseComparisonEvaluator,  # 两两比较评估器)from collections import defaultdictimport pandas as pd# 注意：可以取消其他评估器的注释evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))  # 正确性评估器evaluator_s = SemanticSimilarityEvaluator()  # 语义相似性评估器evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))  # 相关性评估器evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))  # 忠实度评估器# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))

In [ ]:

Copied!

from llama_index.core.evaluation.eval_utils import (    get_responses,    get_results_df,)from llama_index.core.evaluation import BatchEvalRunnermax_samples = 30eval_qs = eval_dataset.questionsref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]# 重新设置基础查询引擎和句子窗口查询引擎# 基础查询引擎base_query_engine = base_index.as_query_engine(similarity_top_k=2)# 句子窗口查询引擎query_engine = sentence_index.as_query_engine(    similarity_top_k=2,    # 目标键默认为`window`，以匹配node_parser的默认设置    node_postprocessors=[        MetadataReplacementPostProcessor(target_metadata_key="window")    ],)
from llama_index.core.evaluation.eval_utils import (    get_responses,    get_results_df,)from llama_index.core.evaluation import BatchEvalRunnermax_samples = 30eval_qs = eval_dataset.questionsref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]# 重新设置基础查询引擎和句子窗口查询引擎# 基础查询引擎base_query_engine = base_index.as_query_engine(similarity_top_k=2)# 句子窗口查询引擎query_engine = sentence_index.as_query_engine(    similarity_top_k=2,    # 目标键默认为`window`，以匹配node_parser的默认设置    node_postprocessors=[        MetadataReplacementPostProcessor(target_metadata_key="window")    ],)

In [ ]:

Copied!





import numpy as np

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]
import numpy as np

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]

In [ ]:

Copied!





evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

运行对忠实度/语义相似性的评估。

In [ ]:

Copied!





eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)
eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

In [ ]:

Copied!





base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)
base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

In [ ]:

Copied!





results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)
results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)

	names	correctness	relevancy	faithfulness	semantic_similarity
0	Sentence Window Retriever	4.366667	0.933333	0.933333	0.959583
1	Base Retriever	4.216667	0.900000	0.933333	0.958664