如何处理提取时的长文本
在处理文件时,比如PDF,你可能会遇到超出你的语言模型上下文窗口的文本。为了处理这些文本,可以考虑以下策略:
- 更换LLM 选择一个支持更大上下文窗口的不同LLM。
- 暴力破解 将文档分块,并从每个块中提取内容。
- RAG 将文档分块,索引这些块,并仅从看起来“相关”的块子集中提取内容。
请记住,这些策略有不同的权衡,最佳策略可能取决于您正在设计的应用程序!
本指南演示了如何实施策略2和3。
设置
首先,我们将安装本指南所需的依赖项:
%pip install -qU langchain-community lxml faiss-cpu langchain-openai
Note: you may need to restart the kernel to use updated packages.
现在我们需要一些示例数据!让我们下载一篇关于汽车的文章并将其加载为LangChain的Document。
import re
import requests
from langchain_community.document_loaders import BSHTMLLoader
# Download the content
response = requests.get("https://en.wikipedia.org/wiki/Car")
# Write it to a file
with open("car.html", "w", encoding="utf-8") as f:
f.write(response.text)
# Load it with an HTML parser
loader = BSHTMLLoader("car.html")
document = loader.load()[0]
# Clean up code
# Replace consecutive new lines with a single new line
document.page_content = re.sub("\n\n+", "\n", document.page_content)
print(len(document.page_content))
80427
定义模式
按照提取教程,我们将使用Pydantic来定义我们希望提取的信息的模式。在这种情况下,我们将提取一个包含年份和描述的“关键发展”(例如,重要的历史事件)列表。
请注意,我们还包含了一个evidence
键,并指示模型逐字提供文章中的相关句子。这使我们能够将提取结果与原始文档中的文本(模型的重建)进行比较。
from typing import List, Optional
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from pydantic import BaseModel, Field
class KeyDevelopment(BaseModel):
"""Information about a development in the history of cars."""
year: int = Field(
..., description="The year when there was an important historic development."
)
description: str = Field(
..., description="What happened in this year? What was the development?"
)
evidence: str = Field(
...,
description="Repeat in verbatim the sentence(s) from which the year and description information were extracted",
)
class ExtractionData(BaseModel):
"""Extracted information about key developments in the history of cars."""
key_developments: List[KeyDevelopment]
# Define a custom prompt to provide instructions and any additional context.
# 1) You can add examples into the prompt template to improve extraction quality
# 2) Introduce additional parameters to take context into account (e.g., include metadata
# about the document from which the text was extracted.)
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are an expert at identifying key historic development in text. "
"Only extract important historic developments. Extract nothing if no important information can be found in the text.",
),
("human", "{text}"),
]
)
创建一个提取器
让我们选择一个LLM。因为我们正在使用工具调用,所以我们需要一个支持工具调用功能的模型。请参阅此表格以获取可用的LLM。
pip install -qU langchain-openai
import getpass
import os
if not os.environ.get("OPENAI_API_KEY"):
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-0125-preview", temperature=0)
extractor = prompt | llm.with_structured_output(
schema=ExtractionData,
include_raw=False,
)
暴力方法
将文档分割成块,使得每个块都能适应LLMs的上下文窗口。
from langchain_text_splitters import TokenTextSplitter
text_splitter = TokenTextSplitter(
# Controls the size of each chunk
chunk_size=2000,
# Controls overlap between chunks
chunk_overlap=20,
)
texts = text_splitter.split_text(document.page_content)
使用batch功能在每个块上并行运行提取!
你可以经常使用 .batch() 来并行化提取!.batch
在底层使用线程池来帮助你并行化工作负载。
如果你的模型通过API暴露,这很可能会加快你的提取流程!
# Limit just to the first 3 chunks
# so the code can be re-run quickly
first_few = texts[:3]
extractions = extractor.batch(
[{"text": text} for text in first_few],
{"max_concurrency": 5}, # limit the concurrency by passing max concurrency!
)
合并结果
从各个块中提取数据后,我们将希望将提取的内容合并在一起。
key_developments = []
for extraction in extractions:
key_developments.extend(extraction.key_developments)
key_developments[:10]
[KeyDevelopment(year=1769, description='Nicolas-Joseph Cugnot built the first full-scale, self-propelled mechanical vehicle, a steam-powered tricycle.', evidence='Nicolas-Joseph Cugnot is widely credited with building the first full-scale, self-propelled mechanical vehicle in about 1769; he created a steam-powered tricycle.'),
KeyDevelopment(year=1807, description="Nicéphore Niépce and his brother Claude created what was probably the world's first internal combustion engine.", evidence="In 1807, Nicéphore Niépce and his brother Claude created what was probably the world's first internal combustion engine (which they called a Pyréolophore), but installed it in a boat on the river Saone in France."),
KeyDevelopment(year=1886, description='Carl Benz patented the Benz Patent-Motorwagen, marking the birth of the modern car.', evidence='In November 1881, French inventor Gustave Trouvé demonstrated a three-wheeled car powered by electricity at the International Exposition of Electricity. Although several other German engineers (including Gottlieb Daimler, Wilhelm Maybach, and Siegfried Marcus) were working on cars at about the same time, the year 1886 is regarded as the birth year of the modern car—a practical, marketable automobile for everyday use—when the German Carl Benz patented his Benz Patent-Motorwagen; he is generally acknowledged as the inventor of the car.'),
KeyDevelopment(year=1886, description='Carl Benz began promotion of his vehicle, marking the introduction of the first commercially available automobile.', evidence='Benz began promotion of the vehicle on 3 July 1886.'),
KeyDevelopment(year=1888, description="Bertha Benz undertook the first road trip by car to prove the road-worthiness of her husband's invention.", evidence="In August 1888, Bertha Benz, the wife and business partner of Carl Benz, undertook the first road trip by car, to prove the road-worthiness of her husband's invention."),
KeyDevelopment(year=1896, description='Benz designed and patented the first internal-combustion flat engine, called boxermotor.', evidence='In 1896, Benz designed and patented the first internal-combustion flat engine, called boxermotor.'),
KeyDevelopment(year=1897, description='The first motor car in central Europe and one of the first factory-made cars in the world, the Präsident automobil, was produced by Nesselsdorfer Wagenbau.', evidence='The first motor car in central Europe and one of the first factory-made cars in the world, was produced by Czech company Nesselsdorfer Wagenbau (later renamed to Tatra) in 1897, the Präsident automobil.'),
KeyDevelopment(year=1901, description='Ransom Olds started large-scale, production-line manufacturing of affordable cars at his Oldsmobile factory in Lansing, Michigan.', evidence='Large-scale, production-line manufacturing of affordable cars was started by Ransom Olds in 1901 at his Oldsmobile factory in Lansing, Michigan.'),
KeyDevelopment(year=1913, description="Henry Ford introduced the world's first moving assembly line for cars at the Highland Park Ford Plant.", evidence="This concept was greatly expanded by Henry Ford, beginning in 1913 with the world's first moving assembly line for cars at the Highland Park Ford Plant.")]
基于RAG的方法
另一个简单的想法是将文本分块,但不是从每个块中提取信息,而是只关注最相关的块。
识别哪些块是相关的可能很困难。
例如,在我们这里使用的car
文章中,大部分文章包含关键的开发信息。因此,通过使用RAG,我们可能会丢弃很多相关信息。
我们建议您根据您的使用情况进行实验,以确定这种方法是否有效。
要实现基于RAG的方法:
- 将您的文档分块并索引它们(例如,在向量存储中);
- 在
extractor
链前添加一个使用向量存储的检索步骤。
这里有一个简单的例子,依赖于FAISS
向量存储。
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
texts = text_splitter.split_text(document.page_content)
vectorstore = FAISS.from_texts(texts, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever(
search_kwargs={"k": 1}
) # Only extract from first document
在这种情况下,RAG提取器仅查看顶部文档。
rag_extractor = {
"text": retriever | (lambda docs: docs[0].page_content) # fetch content of top doc
} | extractor
results = rag_extractor.invoke("Key developments associated with cars")
for key_development in results.key_developments:
print(key_development)
year=2006 description='Car-sharing services in the US experienced double-digit growth in revenue and membership.' evidence='in the US, some car-sharing services have experienced double-digit growth in revenue and membership growth between 2006 and 2007.'
year=2020 description='56 million cars were manufactured worldwide, with China producing the most.' evidence='In 2020, there were 56 million cars manufactured worldwide, down from 67 million the previous year. The automotive industry in China produces by far the most (20 million in 2020).'
常见问题
不同的方法在成本、速度和准确性方面各有优缺点。
注意这些问题:
- 分块内容意味着如果信息分布在多个块中,LLM可能无法提取信息。
- 大块重叠可能导致相同的信息被提取两次,因此请准备好去重!
- LLMs 可能会编造数据。如果在大量文本中寻找单一事实并使用蛮力方法,你可能会得到更多编造的数据。