实体元数据提取¶

在这个演示中，我们使用新的 EntityExtractor 来从每个节点中提取存储在元数据中的实体。默认模型是 tomaarsen/span-marker-mbert-base-multinerd，可以从HuggingFace下载并在本地运行。

有关LlamaIndex中元数据提取的更多信息，请参阅我们的文档。

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity
%pip install llama-index-llms-openai
%pip install llama-index-extractors-entity

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

# 运行实体提取器所需# !pip install span_markerimport osos.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
# 运行实体提取器所需# !pip install span_markerimport osos.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

设置提取器和解析器¶

In [ ]:

Copied!

from llama_index.extractors.entity import EntityExtractorfrom llama_index.core.node_parser import SentenceSplitterentity_extractor = EntityExtractor(    prediction_threshold=0.5,    label_entities=False,  # 在元数据中包含实体标签（可能是错误的）    device="cpu",  # 如果有GPU，则设置为"cuda")node_parser = SentenceSplitter()transformations = [node_parser, entity_extractor]
from llama_index.extractors.entity import EntityExtractorfrom llama_index.core.node_parser import SentenceSplitterentity_extractor = EntityExtractor(    prediction_threshold=0.5,    label_entities=False,  # 在元数据中包含实体标签（可能是错误的）    device="cpu",  # 如果有GPU，则设置为"cuda")node_parser = SentenceSplitter()transformations = [node_parser, entity_extractor]

/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
/Users/loganmarkewich/llama_index/llama-index/lib/python3.9/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "

'NoneType' object has no attribute 'cadam32bit_grad_fp32'

加载数据¶

在这里，我们将下载2023年IPCC气候报告-第3章关于海洋和沿海生态系统（172页）。

In [ ]:

Copied!

!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf
!curl https://www.ipcc.ch/report/ar6/wg2/downloads/report/IPCC_AR6_WGII_Chapter03.pdf --output IPCC_AR6_WGII_Chapter03.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 20.7M  100 20.7M    0     0  22.1M      0 --:--:-- --:--:-- --:--:-- 22.1M

接下来，加载文档。

In [ ]:

Copied!

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["./IPCC_AR6_WGII_Chapter03.pdf"]
).load_data()

提取元数据¶

现在，这是一个相当长的文档。由于我们目前不是在CPU上运行，所以我们只会在文档的子集上运行。不过，你可以随时自行在所有文档上运行它！

In [ ]:

Copied!

from llama_index.core.ingestion import IngestionPipelineimport randomrandom.seed(42)# 注释掉以在所有文档上运行# 在CPU上处理100个文档大约需要5分钟documents = random.sample(documents, 100)pipeline = IngestionPipeline(transformations=transformations)nodes = pipeline.run(documents=documents)
from llama_index.core.ingestion import IngestionPipelineimport randomrandom.seed(42)# 注释掉以在所有文档上运行# 在CPU上处理100个文档大约需要5分钟documents = random.sample(documents, 100)pipeline = IngestionPipeline(transformations=transformations)nodes = pipeline.run(documents=documents)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

检查输出结果¶

In [ ]:

Copied!

samples = random.sample(nodes, 5)
for node in samples:
    print(node.metadata)
samples = random.sample(nodes, 5)
for node in samples:
    print(node.metadata)

{'page_label': '387', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}
{'page_label': '410', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Parmesan', 'Boyd', 'Riebesell', 'Gattuso'}}
{'page_label': '391', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Gulev', 'Fox-Kemper'}}
{'page_label': '430', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf', 'entities': {'Kessouri', 'van der Sleen', 'Brodeur', 'Siedlecki', 'Fiechter', 'Ramajo', 'Carozza'}}
{'page_label': '388', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}

尝试一个查询！¶

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)

index = VectorStoreIndex(nodes=nodes)
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.2)

index = VectorStoreIndex(nodes=nodes)

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)
query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)

According to the provided context information, Fox-Kemper is mentioned in relation to the observed and projected trends of ocean warming and marine heatwaves. It is stated that Fox-Kemper et al. (2021) reported that ocean warming has increased on average by 0.88°C from 1850-1900 to 2011-2020. Additionally, it is mentioned that Fox-Kemper et al. (2021) projected that ocean warming will continue throughout the 21st century, with the rate of global ocean warming becoming scenario-dependent from the mid-21st century. Fox-Kemper is also cited as a source for the information on the increasing frequency, intensity, and duration of marine heatwaves over the 20th and early 21st centuries, as well as the projected increase in frequency of marine heatwaves in the future.

没有元数据的对比¶

在这里，我们重新构建索引，但没有元数据。

In [ ]:

Copied!

for node in nodes:
    node.metadata.pop("entities", None)

print(nodes[0].metadata)
for node in nodes:
    node.metadata.pop("entities", None)

print(nodes[0].metadata)

{'page_label': '542', 'file_name': 'IPCC_AR6_WGII_Chapter03.pdf'}

In [ ]:

Copied!

index = VectorStoreIndex(nodes=nodes)
index = VectorStoreIndex(nodes=nodes)

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)
query_engine = index.as_query_engine()
response = query_engine.query("What is said by Fox-Kemper?")
print(response)

According to the provided context information, Fox-Kemper is mentioned in relation to the decline of the AMOC (Atlantic Meridional Overturning Circulation) over the 21st century. The statement mentions that there is high confidence in the decline of the AMOC, but low confidence for quantitative projections.

正如我们所看到的，我们的元数据丰富的索引能够获取更多相关的信息。