Pydantic提取器¶
在这里,我们测试我们的PydanticProgramExtractor
的能力 - 能够使用LLM(标准文本完成LLM或函数调用LLM)提取整个Pydantic对象。
与使用“单个”元数据提取器相比,这种方法的优势在于我们可以在单个LLM调用中提取多个实体。
设置¶
In [ ]:
Copied!
%pip install llama-index-readers-web
%pip install llama-index-program-openai
%pip install llama-index-readers-web
%pip install llama-index-program-openai
In [ ]:
Copied!
import nest_asyncio
nest_asyncio.apply()
import os
import openai
import nest_asyncio
nest_asyncio.apply()
import os
import openai
In [ ]:
Copied!
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"
openai.api_key = os.getenv("OPENAI_API_KEY")
设置 Pydantic 模型¶
在这里,我们定义了一个基本的结构化模式,我们想要提取它包含:
- entities: 文本块中的唯一实体
- summary: 文本块的简洁摘要
- contains_number: 文本块是否包含数字
这显然是一个玩具模式。我们鼓励您在想要提取的元数据类型方面发挥创意!
In [ ]:
Copied!
from pydantic import BaseModel, Field
from typing import List
from pydantic import BaseModel, Field
from typing import List
In [ ]:
Copied!
class NodeMetadata(BaseModel): """节点元数据。""" entities: List[str] = Field( ..., description="该文本块中的唯一实体。" ) summary: str = Field( ..., description="该文本块的简洁摘要。" ) contains_number: bool = Field( ..., description=( "文本块是否包含任何数字(整数、浮点数等)。" ), )
class NodeMetadata(BaseModel): """节点元数据。""" entities: List[str] = Field( ..., description="该文本块中的唯一实体。" ) summary: str = Field( ..., description="该文本块的简洁摘要。" ) contains_number: bool = Field( ..., description=( "文本块是否包含任何数字(整数、浮点数等)。" ), )
设置提取器¶
在这里,我们设置元数据提取器。请注意,我们提供了提示模板,以便清楚地了解正在进行的操作。
In [ ]:
Copied!
from llama_index.program.openai import OpenAIPydanticProgramfrom llama_index.core.extractors import PydanticProgramExtractorEXTRACT_TEMPLATE_STR = """\这里是该部分的内容:----------------{context_str}----------------根据上下文信息,提取出一个 {class_name} 对象。\"""openai_program = OpenAIPydanticProgram.from_defaults( output_cls=NodeMetadata, prompt_template_str="{input}", # extract_template_str=EXTRACT_TEMPLATE_STR)program_extractor = PydanticProgramExtractor( program=openai_program, input_key="input", show_progress=True)
from llama_index.program.openai import OpenAIPydanticProgramfrom llama_index.core.extractors import PydanticProgramExtractorEXTRACT_TEMPLATE_STR = """\这里是该部分的内容:----------------{context_str}----------------根据上下文信息,提取出一个 {class_name} 对象。\"""openai_program = OpenAIPydanticProgram.from_defaults( output_cls=NodeMetadata, prompt_template_str="{input}", # extract_template_str=EXTRACT_TEMPLATE_STR)program_extractor = PydanticProgramExtractor( program=openai_program, input_key="input", show_progress=True)
加载数据¶
我们使用我们的 LlamaHub SimpleWebPageReader 加载 Eugene 的文章(https://eugeneyan.com/writing/llm-patterns/)。
In [ ]:
Copied!
# 加载博客from llama_index.readers.web import SimpleWebPageReaderfrom llama_index.core.node_parser import SentenceSplitterreader = SimpleWebPageReader(html_to_text=True)docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
# 加载博客from llama_index.readers.web import SimpleWebPageReaderfrom llama_index.core.node_parser import SentenceSplitterreader = SimpleWebPageReader(html_to_text=True)docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])
In [ ]:
Copied!
from llama_index.core.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
from llama_index.core.ingestion import IngestionPipeline
node_parser = SentenceSplitter(chunk_size=1024)
pipeline = IngestionPipeline(transformations=[node_parser, program_extractor])
orig_nodes = pipeline.run(documents=docs)
In [ ]:
Copied!
orig_nodes
orig_nodes
In [ ]:
Copied!
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
sample_entry = program_extractor.extract(orig_nodes[0:1])[0]
Extracting Pydantic object: 0%| | 0/1 [00:00<?, ?it/s]
In [ ]:
Copied!
display(sample_entry)
display(sample_entry)
{'entities': ['eugeneyan', 'HackerNews', 'Karpathy'], 'summary': 'This section discusses practical patterns for integrating large language models (LLMs) into systems & products. It introduces seven key patterns and provides information on evaluations and benchmarks in the field of language modeling.', 'contains_number': True}
In [ ]:
Copied!
new_nodes = program_extractor.process_nodes(orig_nodes)
new_nodes = program_extractor.process_nodes(orig_nodes)
Extracting Pydantic object: 0%| | 0/29 [00:00<?, ?it/s]
In [ ]:
Copied!
display(new_nodes[5:7])
display(new_nodes[5:7])