提取元数据以改善文档索引和理解¶
在许多情况下,特别是对于长文档,一段文本可能缺乏必要的上下文来消除该段与其他类似文本的歧义。解决这个问题的一种方法是手动标记数据集或知识库中的每个片段。然而,对于大量或不断更新的文档集,这可能是一项费时费力的工作。
为了解决这个问题,我们使用LLMs来提取与文档相关的特定上下文信息,以帮助检索和语言模型消除类似的段落。
我们通过全新的“元数据提取器”模块来实现这一点。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity
!pip install llama-index
import nest_asyncio
nest_asyncio.apply()
import os
import openai
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY_HERE"
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
我们创建了一个节点解析器,用于提取与文档块相关的文档标题和假设问题嵌入。
我们还展示了如何实例化SummaryExtractor
和KeywordExtractor
,以及如何基于BaseExtractor
基类创建自定义提取器。
from llama_index.core.extractors import ( SummaryExtractor, # 摘要提取器 QuestionsAnsweredExtractor, # 问题回答提取器 TitleExtractor, # 标题提取器 KeywordExtractor, # 关键词提取器 BaseExtractor,)from llama_index.extractors.entity import EntityExtractor # 实体提取器from llama_index.core.node_parser import TokenTextSplitter # 标记文本分割器text_splitter = TokenTextSplitter( separator=" ", chunk_size=512, chunk_overlap=128)class CustomExtractor(BaseExtractor): def extract(self, nodes): metadata_list = [ { "custom": ( node.metadata["document_title"] + "\n" + node.metadata["excerpt_keywords"] ) } for node in nodes ] return metadata_listextractors = [ TitleExtractor(nodes=5, llm=llm), # 标题提取器 QuestionsAnsweredExtractor(questions=3, llm=llm), # 问题回答提取器 # EntityExtractor(prediction_threshold=0.5), # 实体提取器 # SummaryExtractor(summaries=["prev", "self"], llm=llm), # 摘要提取器 # KeywordExtractor(keywords=10, llm=llm), # 关键词提取器 # CustomExtractor()]transformations = [text_splitter] + extractors
from llama_index.core import SimpleDirectoryReader
我们首先加载优步和Lyft分别为2019年和2020年的10k年度SEC报告。
!mkdir -p data
!wget -O "data/10k-132.pdf" "https://www.dropbox.com/scl/fi/6dlqdk6e2k1mjhi8dee5j/uber.pdf?rlkey=2jyoe49bg2vwdlz30l76czq6g&dl=1"
!wget -O "data/10k-vFinal.pdf" "https://www.dropbox.com/scl/fi/qn7g3vrk5mqb18ko4e5in/lyft.pdf?rlkey=j6jxtjwo8zbstdo4wz3ns8zoj&dl=1"
# 注意文档文件名不具信息性,这在生产环境中可能是常见情况uber_docs = SimpleDirectoryReader(input_files=["data/10k-132.pdf"]).load_data()uber_front_pages = uber_docs[0:3]uber_content = uber_docs[63:69]uber_docs = uber_front_pages + uber_content
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations)
uber_nodes = pipeline.run(documents=uber_docs)
uber_nodes[1].metadata
{'page_label': '2', 'file_name': '10k-132.pdf', 'document_title': 'Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc.', 'questions_this_excerpt_can_answer': '1. How many countries does Uber operate in?\n2. What is the total gross bookings of Uber in 2019?\n3. How many trips did Uber facilitate in 2019?'}
# 注意文档文件名不具信息量,这在生产环境中可能是常见情况lyft_docs = SimpleDirectoryReader( input_files=["data/10k-vFinal.pdf"]).load_data()lyft_front_pages = lyft_docs[0:3]lyft_content = lyft_docs[68:73]lyft_docs = lyft_front_pages + lyft_content
from llama_index.core.ingestion import IngestionPipeline
pipeline = IngestionPipeline(transformations=transformations)
lyft_nodes = pipeline.run(documents=lyft_docs)
lyft_nodes[2].metadata
{'page_label': '2', 'file_name': '10k-vFinal.pdf', 'document_title': 'Lyft, Inc. Annual Report on Form 10-K for the Fiscal Year Ended December 31, 2020', 'questions_this_excerpt_can_answer': "1. Has Lyft, Inc. filed a report on and attestation to its management's assessment of the effectiveness of its internal control over financial reporting under Section 404(b) of the Sarbanes-Oxley Act?\n2. Is Lyft, Inc. considered a shell company according to Rule 12b-2 of the Exchange Act?\n3. What was the aggregate market value of Lyft, Inc.'s common stock held by non-affiliates on June 30, 2020?"}
由于我们提出的问题相当复杂,因此我们在下面的所有问答流程中使用了一个子问题查询引擎,并提示它更加关注检索到的来源的相关性。
from llama_index.core.question_gen import LLMQuestionGeneratorfrom llama_index.core.question_gen.prompts import DEFAULT_SUB_QUESTION_PROMPT_TMPLquestion_gen = LLMQuestionGenerator.from_defaults( llm=llm, prompt_template_str=""" 跟随这个例子,但是不要给出一个问题,而是始终在问题前加上:'首先识别并引用最相关的来源,'。 """ + DEFAULT_SUB_QUESTION_PROMPT_TMPL,)
在没有额外元数据的情况下查询索引
在某些情况下,您可能只想查询索引本身,而不需要任何额外的元数据。这可以通过使用Index.query
方法来实现。这个方法接受一个查询字符串作为参数,并返回与该查询匹配的文档的索引。
results = index.query("your query string")
在这种情况下,返回的results
将只包含与查询匹配的文档的索引,而不包含任何额外的元数据。
from copy import deepcopy
nodes_no_metadata = deepcopy(uber_nodes) + deepcopy(lyft_nodes)
for node in nodes_no_metadata:
node.metadata = {
k: node.metadata[k]
for k in node.metadata
if k in ["page_label", "file_name"]
}
print(
"LLM sees:\n",
(nodes_no_metadata)[9].get_content(metadata_mode=MetadataMode.LLM),
)
LLM sees: [Excerpt from document] page_label: 65 file_name: 10k-132.pdf Excerpt: ----- See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a reconciliation of net income (loss) attributable to Uber Technologies, Inc. to Adjusted EBITDA. Year Ended December 31, 2017 to 2018 2018 to 2019 (In millions, exce pt percenta ges) 2017 2018 2019 % Chan ge % Chan ge Adjusted EBITDA ................................ $ (2,642) $ (1,847) $ (2,725) 30% (48)% -----
from llama_index.core import VectorStoreIndex
from llama_index.core.query_engine import SubQuestionQueryEngine
from llama_index.core.tools import QueryEngineTool, ToolMetadata
index_no_metadata = VectorStoreIndex(
nodes=nodes_no_metadata,
)
engine_no_metadata = index_no_metadata.as_query_engine(
similarity_top_k=10, llm=OpenAI(model="gpt-4")
)
final_engine_no_metadata = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[
QueryEngineTool(
query_engine=engine_no_metadata,
metadata=ToolMetadata(
name="sec_filing_documents",
description="financial information on companies",
),
)
],
question_gen=question_gen,
use_async=True,
)
response_no_metadata = final_engine_no_metadata.query( """ 2019年Uber和Lyft在研发和销售营销方面的成本分别是多少,单位为百万美元? 以JSON格式给出你的答案。 """)print(response_no_metadata.response)# 正确答案:# {"Uber": {"研发": 4836, "销售营销": 4626},# "Lyft": {"研发": 1505.6, "销售营销": 814 }}
Generated 4 sub questions. [sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 [sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 [sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $814,122 in thousands. [sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $1,505,640 in thousands. [sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands. [sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands. { "Uber": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 }, "Lyft": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 } }
结果:正如我们所看到的,问答代理似乎不知道在哪里寻找正确的文档。因此,它完全混淆了Lyft和Uber的数据。
使用提取的元数据查询索引¶
print(
"LLM sees:\n",
(uber_nodes + lyft_nodes)[9].get_content(metadata_mode=MetadataMode.LLM),
)
LLM sees: [Excerpt from document] page_label: 65 file_name: 10k-132.pdf document_title: Exploring the Diverse Landscape of 2019: A Comprehensive Annual Report on Uber Technologies, Inc. Excerpt: ----- See the section titled “Reconciliations of Non-GAAP Financial Measures” for our definition and a reconciliation of net income (loss) attributable to Uber Technologies, Inc. to Adjusted EBITDA. Year Ended December 31, 2017 to 2018 2018 to 2019 (In millions, exce pt percenta ges) 2017 2018 2019 % Chan ge % Chan ge Adjusted EBITDA ................................ $ (2,642) $ (1,847) $ (2,725) 30% (48)% -----
index = VectorStoreIndex(
nodes=uber_nodes + lyft_nodes,
)
engine = index.as_query_engine(similarity_top_k=10, llm=OpenAI(model="gpt-4"))
final_engine = SubQuestionQueryEngine.from_defaults(
query_engine_tools=[
QueryEngineTool(
query_engine=engine,
metadata=ToolMetadata(
name="sec_filing_documents",
description="financial information on companies.",
),
)
],
question_gen=question_gen,
use_async=True,
)
response = final_engine.query( """ 2019年,Uber和Lyft在研发和销售营销方面的成本分别是多少,单位为百万美元? 请以JSON格式给出您的答案。 """)print(response.response)# 正确答案:# {"Uber": {"研发": 4836, "销售营销": 4626},# "Lyft": {"研发": 1505.6, "销售营销": 814 }}
Generated 4 sub questions. [sec_filing_documents] Q: What was the cost due to research and development for Uber in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Uber in 2019 [sec_filing_documents] Q: What was the cost due to research and development for Lyft in 2019 [sec_filing_documents] Q: What was the cost due to sales and marketing for Lyft in 2019 [sec_filing_documents] A: The cost due to sales and marketing for Uber in 2019 was $4,626 million. [sec_filing_documents] A: The cost due to research and development for Uber in 2019 was $4,836 million. [sec_filing_documents] A: The cost due to sales and marketing for Lyft in 2019 was $814,122 in thousands. [sec_filing_documents] A: The cost of research and development for Lyft in 2019 was $1,505,640 in thousands. { "Uber": { "Research and Development": 4836, "Sales and Marketing": 4626 }, "Lyft": { "Research and Development": 1505.64, "Sales and Marketing": 814.122 } }
结果:正如我们所看到的,LLM正确地回答了这些问题。
问题领域中的挑战¶
在这个例子中,我们观察到通过向量嵌入提供的搜索质量相当低。这很可能是由于高密度的金融文件,这些文件很可能不代表模型的训练集。
为了提高搜索质量,可以采用其他采用基于关键字的神经搜索方法,比如ColBERTv2/PLAID。特别是,这将有助于匹配特定关键字,以识别高相关性的片段。
其他有效的步骤可能包括利用在金融数据集上进行微调的模型,比如彭博GPT。
最后,我们可以通过提供有关所在上下文的更多上下文信息来进一步丰富元数据。
对这个例子的改进¶
一般来说,可以通过更严格地评估元数据提取的准确性,以及问答管道的准确性和召回率来进一步改进这个例子。此外,整合更多的文档以及完整的文档,可能会提供更多难以消除歧义的混淆段落,进一步对我们构建的系统进行压力测试,并提出进一步的改进建议。