跳到主要内容

长文档内容提取

nbviewer

GPT-3可以帮助我们从太大而无法适应上下文窗口的文档中提取关键数字、日期或其他重要内容。解决这个问题的一种方法是将文档分块处理,然后将每个块分别处理,最后合并成一个答案列表。

在这个笔记本中,我们将介绍这种方法: - 加载一个长PDF并提取文本 - 创建一个用于提取关键信息的提示 - 将我们的文档分块并处理每个块以提取任何答案 - 最后将它们合并 - 这种简单的方法将被扩展到另外三个更难的问题

方法

  • 设置: 获取一个PDF,一个关于动力单元的F1财务法规文件,并从中提取文本以进行实体提取。我们将使用这个来尝试提取隐藏在内容中的答案。
  • 简单实体提取: 通过以下方式从文档的各个块中提取关键信息:
    • 创建一个包含我们问题和期望格式示例的模板提示
    • 创建一个函数,接受文本块作为输入,与提示组合并获取响应
    • 运行一个脚本来分块文本,提取答案并输出以供解析
  • 复杂实体提取: 提出一些需要更严格推理才能解决的更困难的问题

设置

!pip install textract
!pip install tiktoken

import textract
import os
import openai
import tiktoken

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# 使用textract从每个PDF中提取原始文本
text = textract.process('data/fia_f1_power_unit_financial_regulations_issue_1_-_2022-08-16.pdf', method='pdfminer').decode('utf-8')
clean_text = text.replace(" ", " ").replace("\n", "; ").replace(';',' ')

简单实体提取

这个notebook演示了如何使用spaCy库来进行简单的实体提取。我们将使用一个简单的例子来演示如何识别文本中的人名、地点和组织名称。

# 示例提示 - 
document = '<document>'
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR\n2. What is the value of External Manufacturing Costs in USD\n3. What is the Capital Expenditure Limit in USD\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)

Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author
1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR
2. What is the value of External Manufacturing Costs in USD
3. What is the Capital Expenditure Limit in USD

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)
1.
# 将文本分割成大小为n的小块,尽量在句子结束处断开。
def create_chunks(text, n, tokenizer):
tokens = tokenizer.encode(text)
"""从文本中逐个生成大小为 n 的块。"""
i = 0
while i < len(tokens):
# 在0.5倍至1.5倍n个词元的范围内,寻找最近的句子结尾。
j = min(i + int(1.5 * n), len(tokens))
while j > i + int(0.5 * n):
# 解码这些标记,并检查是否有句号或换行符。
chunk = tokenizer.decode(tokens[i:j])
if chunk.endswith(".") or chunk.endswith("\n"):
break
j -= 1
# 若未找到句子结尾,则以n个词作为分块大小。
if j == i + int(0.5 * n):
j = min(i + n, len(tokens))
yield tokens[i:j]
i = j

def extract_chunk(document,template_prompt):
prompt = template_prompt.replace('<document>',document)

messages = [
{"role": "system", "content": "You help extract information from documents."},
{"role": "user", "content": prompt}
]

response = client.chat.completions.create(
model='gpt-4',
messages=messages,
temperature=0,
max_tokens=1500,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
return "1." + response.choices[0].message.content

# 初始化分词器
tokenizer = tiktoken.get_encoding("cl100k_base")

results = []

chunks = create_chunks(clean_text,1000,tokenizer)
text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

for chunk in text_chunks:
results.append(extract_chunk(chunk,template_prompt))
# 打印(块)
print(results[-1])


groups = [r.split('\n') for r in results]

# 将各组文件压缩打包
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1. What is the amount of the "Power Unit Cost Cap" in USD, GBP and EUR: USD 95,000,000 (Page 2); GBP 76,459,000 (Page 2); EUR 90,210,000 (Page 2)',
'2. What is the value of External Manufacturing Costs in USD: US Dollars 20,000,000 in respect of each of the Full Year Reporting Periods ending on 31 December 2023, 31 December 2024 and 31 December 2025, adjusted for Indexation (Page 10)',
'3. What is the Capital Expenditure Limit in USD: US Dollars 30,000,000 (Page 32)']

复杂实体提取

# 示例提示 - 
template_prompt=f'''Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output \"Not specified\".
When you extract a key piece of information, include the closest page number.
Use the following format:\n0. Who is the author\n1. How is a Minor Overspend Breach calculated\n2. How is a Major Overspend Breach calculated\n3. Which years do these financial regulations apply to\n\nDocument: \"\"\"<document>\"\"\"\n\n0. Who is the author: Tom Anderson (Page 1)\n1.'''
print(template_prompt)

Extract key pieces of information from this regulation document.
If a particular piece of information is not present, output "Not specified".
When you extract a key piece of information, include the closest page number.
Use the following format:
0. Who is the author
1. How is a Minor Overspend Breach calculated
2. How is a Major Overspend Breach calculated
3. Which years do these financial regulations apply to

Document: """<document>"""

0. Who is the author: Tom Anderson (Page 1)
1.
results = []

for chunk in text_chunks:
results.append(extract_chunk(chunk,template_prompt))

groups = [r.split('\n') for r in results]

# 将各组文件压缩打包
zipped = list(zip(*groups))
zipped = [x for y in zipped for x in y if "Not specified" not in x and "__" not in x]
zipped

['1. How is a Minor Overspend Breach calculated: A Minor Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by less than 5% (Page 24)',
'2. How is a Major Overspend Breach calculated: A Material Overspend Breach arises when a Power Unit Manufacturer submits its Full Year Reporting Documentation and Relevant Costs reported therein exceed the Power Unit Cost Cap by 5% or more (Page 25)',
'3. Which years do these financial regulations apply to: 2026 onwards (Page 1)',
'3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 2)',
'3. Which years do these financial regulations apply to: 2022-2025 (Page 6)',
'3. Which years do these financial regulations apply to: 2023, 2024, 2025, 2026 and subsequent Full Year Reporting Periods (Page 10)',
'3. Which years do these financial regulations apply to: 2022 (Page 14)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)',
'3. Which years do these financial regulations apply to: 2022 (Page 19)',
'3. Which years do these financial regulations apply to: 2022 (Page 21)',
'3. Which years do these financial regulations apply to: 2026 onwards (Page 26)',
'3. Which years do these financial regulations apply to: 2026 (Page 2)',
'3. Which years do these financial regulations apply to: 2022 (Page 30)',
'3. Which years do these financial regulations apply to: 2022 (Page 32)',
'3. Which years do these financial regulations apply to: 2023, 2024 and 2025 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 37)',
'3. Which years do these financial regulations apply to: 2026 onwards (Page 40)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2026 to 2030 seasons (Page 46)',
'3. Which years do these financial regulations apply to: 2022 (Page 47)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 56)',
'3. Which years do these financial regulations apply to: 2022 (Page 1)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)',
'3. Which years do these financial regulations apply to: 2022 (Page 16)']

汇总

我们已经成功提取了前两个答案,而第三个答案由于每页都出现的日期而受到干扰,不过正确答案也在其中。

要进一步调整,您可以考虑尝试以下内容: - 更具描述性或具体的提示 - 如果您有足够的训练数据,可以微调模型以找到一组非常好的输出 - 数据分块的方式 - 我们选择了1000个标记且没有重叠,但更智能的分块方式,将信息分成部分、按标记切割或类似的方式可能会获得更好的结果

然而,经过最少的调整,我们现在已经使用长文档的内容回答了6个不同难度的问题,并且有一种可重复使用的方法,我们可以将其应用于任何需要实体提取的长文档。期待看到您能用这个方法做些什么!