跳至内容

元数据提取使用模式#

您可以使用LLMs通过我们的Metadata Extractor模块实现元数据自动提取。

我们的元数据提取器模块包含以下"特征提取器":

  • SummaryExtractor - 自动从一组节点中提取摘要
  • QuestionsAnsweredExtractor - 提取每个节点能够回答的一系列问题
  • TitleExtractor - 从每个节点的上下文中提取标题
  • EntityExtractor - 从每个节点的内容中提取提及的实体(即地点、人物、事物的名称)

然后你可以将Metadata Extractor与我们的节点解析器进行链式调用:

from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)

# assume documents are defined -> extract nodes
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[text_splitter, title_extractor, qa_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

或插入到索引中:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    documents, transformations=[text_splitter, title_extractor, qa_extractor]
)

资源#