多模态结构化输出:GPT-4o与其他GPT-4变种¶
在这个笔记本中,我们使用MultiModalLLMCompletionProgram
类来执行带有图像的结构化数据提取。我们将对GPT-4具有视觉能力的模型进行比较。
%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
图像数据集:纸牌¶
这个数据集包含了一系列纸牌的图像,每张纸牌包括花色和数字。这个数据集可以用于图像分类、目标检测和图像识别等任务。
对于这个数据提取任务,我们将使用多模态LLMs来从所谓的PaperCards中提取信息。这些是包含研究论文摘要的可视化内容。可以通过执行下面的命令从我们的dropbox账户下载数据集。
下载图片¶
!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
将PaperCards加载为ImageDocuments¶
## 导入json模块
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document
# 上下文图片
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
# 让我们看一个
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
构建我们的MultiModalLLMCompletionProgram(多模态结构化输出)¶
期望的结构化输出¶
在这里,我们将定义我们的数据类(即Pydantic BaseModel),它将保存我们从给定图像或PaperCard中提取的数据。
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional
# 期望的输出结构
class PaperCard(BaseModel):
"""用于存储PaperCard文本属性的数据类。"""
title: str = Field(description="论文标题。")
year: str = Field(description="论文发表年份。")
authors: str = Field(description="论文作者。")
arxiv_id: str = Field(description="Arxiv论文ID。")
main_contribution: str = Field(
description="论文的主要贡献。"
)
insights: str = Field(
description="论文的主要见解或动机。"
)
main_results: List[str] = Field(
description="论文的主要结果。"
)
tech_bits: Optional[str] = Field(
description="描述图像技术部分显示的内容。"
)
接下来,我们定义我们的MultiModalLLMCompletionProgram
。在这里,我们实际上将定义三个单独的程序,分别针对具有视觉能力的GPT-4模型,即:GPT-4o、GPT-4v和GPT-4Turbo。
paper_card_extraction_prompt = """
使用附带的PaperCard图像来提取其中的数据,并存储到提供的数据类中。
"""
gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)
gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)
gpt_4turbo = OpenAIMultiModal(
model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)
multimodal_llms = {
"gpt_4o": gpt_4o,
"gpt_4v": gpt_4v,
"gpt_4turbo": gpt_4turbo,
}
programs = {
mdl_name: MultiModalLLMCompletionProgram.from_defaults(
output_cls=PaperCard,
prompt_template_str=paper_card_extraction_prompt,
multi_modal_llm=mdl,
)
for mdl_name, mdl in multimodal_llms.items()
}
让我们进行一次测试运行¶
# 请确保您正在使用llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
papercard
PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')
运行数据提取任务¶
现在我们已经测试过我们的程序,准备将程序应用于PaperCards上的数据提取任务!
import time
import tqdm
results = {}
for mdl_name, program in programs.items():
print(f"Model: {mdl_name}")
results[mdl_name] = {
"papercards": [],
"failures": [],
"execution_times": [],
"image_paths": [],
}
total_time = 0
for img in tqdm.tqdm(image_documents):
results[mdl_name]["image_paths"].append(img.image_path)
start_time = time.time()
try:
structured_output = program(image_documents=[img])
end_time = time.time() - start_time
results[mdl_name]["papercards"].append(structured_output)
results[mdl_name]["execution_times"].append(end_time)
results[mdl_name]["failures"].append(None)
except Exception as e:
results[mdl_name]["papercards"].append(None)
results[mdl_name]["execution_times"].append(None)
results[mdl_name]["failures"].append(e)
print()
Model: gpt_4o
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]
Model: gpt_4v
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]
Model: gpt_4turbo
100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]
定量分析¶
在这里,我们将对各种程序进行快速的定量分析。具体来说,我们将比较总故障次数、成功数据提取作业的总执行时间以及平均执行时间。
import numpy as np
import pandas as pd
指标 = {
"gpt_4o": {},
"gpt_4v": {},
"gpt_4turbo": {},
}
# 错误计数
for 模型名称, 模型结果 in 结果.items():
指标[模型名称]["错误计数"] = sum(
el is not None for el in 模型结果["失败"]
)
指标[模型名称]["总执行时间"] = sum(
el for el in 模型结果["执行时间"] if el is not None
)
指标[模型名称]["平均执行时间"] = 指标[模型名称][
"总执行时间"
] / (len(图片文档) - 指标[模型名称]["错误计数"])
指标[模型名称]["中位执行时间"] = np.percentile(
[el for el in 模型结果["执行时间"] if el is not None], q=0.5
)
pd.DataFrame(metrics)
gpt_4o | gpt_4v | gpt_4turbo | |
---|---|---|---|
error_count | 0.000000 | 14.000000 | 1.000000 |
total_execution_time | 541.128802 | 586.500559 | 762.130032 |
average_execution_time | 15.460823 | 27.928598 | 22.415589 |
median_execution_time | 5.377015 | 11.879649 | 7.177287 |
GPT-4o确实更快!¶
- GPT-4o在总执行时间(成功程序的执行时间,失败的提取不在此计算范围内)以及平均和中位数执行时间方面显然更快
- GPT-4o不仅更快,而且能够为所有PaperCards提取结果。相比之下,GPT-4v失败了14次,GPT-4turbo失败了1次。
定性分析¶
在这最后一节中,我们将对提取结果进行定性分析。最终,我们将得到一个人工评估的“标记”数据集,用于数据提取任务。接下来提供的工具将允许您对三个程序(或模型)对PaperCard数据提取的结果进行手动评估。您作为标记者的任务是将程序的结果从0到5进行排名,其中5表示完美的数据提取。
from IPython.display import clear_output
def display_results_and_papercard(ix: int):
# 图像
image_path = results["gpt_4o"]["image_paths"][ix]
# 输出
gpt_4o_output = results["gpt_4o"]["papercards"][ix]
gpt_4v_output = results["gpt_4v"]["papercards"][ix]
gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]
image = Image.open(image_path).convert("RGB")
plt.figure(figsize=(10, 10))
plt.axis("off")
plt.imshow(image)
plt.show()
print("GPT-4o\n")
if gpt_4o_output is not None:
print(json.dumps(gpt_4o_output.dict(), indent=4))
else:
print("提取数据失败")
print()
print("============================================\n")
print("GPT-4v\n")
if gpt_4v_output is not None:
print(json.dumps(gpt_4v_output.dict(), indent=4))
else:
print("提取数据失败")
print()
print("============================================\n")
print("GPT-4turbo\n")
if gpt_4turbo_output is not None:
print(json.dumps(gpt_4turbo_output.dict(), indent=4))
else:
print("提取数据失败")
print()
print("============================================\n")
GRADES = {
"gpt_4o": [0] * len(image_documents),
"gpt_4v": [0] * len(image_documents),
"gpt_4turbo": [0] * len(image_documents),
}
def manual_evaluation_single(img_ix: int):
"""更新GRADES字典以进行单个PaperCard数据提取任务的评估。"""
display_results_and_papercard(img_ix)
gpt_4o_grade = input(
"为GPT-4o提供0到5的评分,5为最高分。"
)
gpt_4v_grade = input(
"为GPT-4v提供0到5的评分,5为最高分。"
)
gpt_4turbo_grade = input(
"为GPT-4turbo提供0到5的评分,5为最高分。"
)
GRADES["gpt_4o"][img_ix] = gpt_4o_grade
GRADES["gpt_4v"][img_ix] = gpt_4v_grade
GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade
def manual_evaluations(img_ix: Optional[int] = None):
"""用于手动评分gpt-4变体在PaperCard数据提取任务上的交互式程序。"""
if img_ix is None:
# 标记所有结果
for ix in range(len(image_documents)):
print(f"您正在标记第{ix + 1}个,共{len(image_documents)}个")
print()
manual_evaluation_single(ix)
clear_output(wait=True)
else:
manual_evaluation_single(img_ix)
manual_evaluations()
You are marking 35 out of 35
GPT-4o { "title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models", "year": "2023", "authors": "Kim, Seungone et al.", "arxiv_id": "arxiv:2310.08441", "main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.", "insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.", "main_results": [ "Prometheus matches or outperforms GPT-4.", "Prometheus can function as a reward model.", "Reference answers are crucial for fine-grained evaluation." ], "tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations" } ============================================ GPT-4v { "title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models", "year": "2023", "authors": "Kim, George, et al.", "arxiv_id": "arXiv:2310.08941", "main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.", "insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.", "main_results": [ "The main results section would list the key findings or results of the paper as described in the image." ], "tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image." } ============================================ GPT-4turbo { "title": "Prometheus: Evaluating Capability In Language Models", "year": "2023", "authors": "Kim, George, et al.", "arxiv_id": "arXiv:2310.05941", "main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.", "insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.", "main_results": [ "Prometheus LM outperforms GPT-4 on targeted feedback tasks.", "Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.", "Feedback quality was better as reported by human judges." ], "tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models." } ============================================
Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3 Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5 Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5
grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
gpt_4o 3.585714 gpt_4v 1.300000 gpt_4turbo 2.128571 dtype: float64
观察表¶
在下表中,我们列出了我们希望从PaperCard中提取的每个组件的一般观察结果。 GPT-4v和GPT-4Turbo的表现相似,但稍微偏向于GPT-4Turbo。一般来说,GPT-4o在这个数据提取任务中表现比其他模型要好得多。最后,所有模型似乎都在描述PaperCard的Tech Bits部分时遇到了困难,有时所有模型都会生成摘要而不是精确提取;然而,GPT-4o做得比其他模型少。
提取的组件 | GPT-4o | GPT-4v & GPT-4Turbo |
---|---|---|
标题、年份、作者 | 非常好,可能达到100% | 可能达到80%,在少数示例中产生幻觉 |
Arxiv ID | 良好,大约95%准确 | 70%准确 |
主要贡献 | 良好(约80%),但无法提取列出的多个贡献 | 不太好,60%准确,有些幻觉 |
洞察 | 不太好(约65%),更多地进行总结而不是提取 | 更多地进行总结而不是提取 |
主要结果 | 非常擅长提取主要结果的摘要陈述 | 在这里产生了很多幻觉 |
技术要点 | 无法生成关于图表的详细描述 | 无法生成关于图表的详细描述 |
概要¶
- GPT-4o比GPT-4v和GPT-4turbo更快,且失败次数更少(0次!)
- GPT-4o比GPT-4v和GPT-4turbo产生更好的数据提取结果
- GPT-4o在从PaperCard中提取事实(标题、作者、年份以及主要结果部分的主题声明)方面表现非常出色
- GPT-4v和GPT-4turbo经常会产生主要结果的幻觉,有时甚至会出现作者的错误
- 使用更好的提示,特别是从Insights部分提取数据,以及描述Tech Bits,可能会改善GPT-4o的结果