多模态结构化输出：GPT-4o与其他GPT-4变种¶

在这个笔记本中，我们使用MultiModalLLMCompletionProgram类来执行带有图像的结构化数据提取。我们将对GPT-4具有视觉能力的模型进行比较。

In [ ]:

Copied!





%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q
%pip install llama-index-llms-openai -q
%pip install llama-index-multi-modal-llms-openai -q
%pip install llama-index-readers-file -q
%pip install -U llama-index-core -q

In [ ]:

Copied!

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

图像数据集：纸牌¶

这个数据集包含了一系列纸牌的图像，每张纸牌包括花色和数字。这个数据集可以用于图像分类、目标检测和图像识别等任务。

对于这个数据提取任务，我们将使用多模态LLMs来从所谓的PaperCards中提取信息。这些是包含研究论文摘要的可视化内容。可以通过执行下面的命令从我们的dropbox账户下载数据集。

下载图片¶

In [ ]:

Copied!





!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip
!mkdir data
!wget "https://www.dropbox.com/scl/fo/jlxavjjzddcv6owvr9e6y/AJoNd0T2pUSeynOTtM_f60c?rlkey=4mvwc1r6lowmy7zqpnm1ikd24&st=1cs1gs9c&dl=1" -O data/paper_cards.zip
!unzip data/paper_cards.zip -d data
!rm data/paper_cards.zip

将PaperCards加载为ImageDocuments¶

In [ ]:

Copied!





## 导入json模块
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# 上下文图片
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()
## 导入json模块
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# 上下文图片
image_path = "./data"
image_documents = SimpleDirectoryReader(image_path).load_data()

In [ ]:

Copied!





# 让我们看一个
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()
# 让我们看一个
img_doc = image_documents[0]
image = Image.open(img_doc.image_path).convert("RGB")
plt.figure(figsize=(8, 8))
plt.axis("off")
plt.imshow(image)
plt.show()

No description has been provided for this image

构建我们的MultiModalLLMCompletionProgram（多模态结构化输出）¶

期望的结构化输出¶

在这里，我们将定义我们的数据类（即Pydantic BaseModel），它将保存我们从给定图像或PaperCard中提取的数据。

In [ ]:

Copied!





from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional


# 期望的输出结构
class PaperCard(BaseModel):
    """用于存储PaperCard文本属性的数据类。"""

    title: str = Field(description="论文标题。")
    year: str = Field(description="论文发表年份。")
    authors: str = Field(description="论文作者。")
    arxiv_id: str = Field(description="Arxiv论文ID。")
    main_contribution: str = Field(
        description="论文的主要贡献。"
    )
    insights: str = Field(
        description="论文的主要见解或动机。"
    )
    main_results: List[str] = Field(
        description="论文的主要结果。"
    )
    tech_bits: Optional[str] = Field(
        description="描述图像技术部分显示的内容。"
    )
from llama_index.core.program import MultiModalLLMCompletionProgram
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.bridge.pydantic import BaseModel, Field
from typing import List, Optional


# 期望的输出结构
class PaperCard(BaseModel):
    """用于存储PaperCard文本属性的数据类。"""

    title: str = Field(description="论文标题。")
    year: str = Field(description="论文发表年份。")
    authors: str = Field(description="论文作者。")
    arxiv_id: str = Field(description="Arxiv论文ID。")
    main_contribution: str = Field(
        description="论文的主要贡献。"
    )
    insights: str = Field(
        description="论文的主要见解或动机。"
    )
    main_results: List[str] = Field(
        description="论文的主要结果。"
    )
    tech_bits: Optional[str] = Field(
        description="描述图像技术部分显示的内容。"
    )

接下来，我们定义我们的MultiModalLLMCompletionProgram。在这里，我们实际上将定义三个单独的程序，分别针对具有视觉能力的GPT-4模型，即：GPT-4o、GPT-4v和GPT-4Turbo。

In [ ]:

Copied!





paper_card_extraction_prompt = """
使用附带的PaperCard图像来提取其中的数据，并存储到提供的数据类中。
"""

gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)

gpt_4turbo = OpenAIMultiModal(
    model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)

multimodal_llms = {
    "gpt_4o": gpt_4o,
    "gpt_4v": gpt_4v,
    "gpt_4turbo": gpt_4turbo,
}

programs = {
    mdl_name: MultiModalLLMCompletionProgram.from_defaults(
        output_cls=PaperCard,
        prompt_template_str=paper_card_extraction_prompt,
        multi_modal_llm=mdl,
    )
    for mdl_name, mdl in multimodal_llms.items()
}
paper_card_extraction_prompt = """
使用附带的PaperCard图像来提取其中的数据，并存储到提供的数据类中。
"""

gpt_4o = OpenAIMultiModal(model="gpt-4o", max_new_tokens=4096)

gpt_4v = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=4096)

gpt_4turbo = OpenAIMultiModal(
    model="gpt-4-turbo-2024-04-09", max_new_tokens=4096
)

multimodal_llms = {
    "gpt_4o": gpt_4o,
    "gpt_4v": gpt_4v,
    "gpt_4turbo": gpt_4turbo,
}

programs = {
    mdl_name: MultiModalLLMCompletionProgram.from_defaults(
        output_cls=PaperCard,
        prompt_template_str=paper_card_extraction_prompt,
        multi_modal_llm=mdl,
    )
    for mdl_name, mdl in multimodal_llms.items()
}

让我们进行一次测试运行¶

In [ ]:

Copied!

# 请确保您正在使用llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])
# 请确保您正在使用llama-index-core v0.10.37
papercard = programs["gpt_4o"](image_documents=[image_documents[0]])

In [ ]:

Copied!

papercard
papercard

Out[ ]:

PaperCard(title='CRITIC: LLMs Can Self-Correct With Tool-Interactive Critiquing', year='2023', authors='Gao, Zhibin et al.', arxiv_id='arXiv:2305.11738', main_contribution='A framework for verifying and then correcting hallucinations by large language models (LLMs) with external tools (e.g., text-to-text APIs).', insights='LLMs can hallucinate and produce false information. By using external tools, these hallucinations can be identified and corrected.', main_results=['CRITIC leads to marked improvements over baselines on QA, math, and toxicity reduction tasks.', 'Feedback from external tools is crucial for an LLM to self-correct.', 'CRITIC significantly outperforms baselines on QA, math, and toxicity reduction tasks.'], tech_bits='The technical bits section describes the CRITIC prompt, which includes an initial output, critique, and revision steps. It also highlights the tools used for critiquing, such as a calculator for math tasks and a toxicity classifier for toxicity reduction tasks.')

运行数据提取任务¶

现在我们已经测试过我们的程序，准备将程序应用于PaperCards上的数据提取任务！

In [ ]:

Copied!

import time
import tqdm
import time
import tqdm

In [ ]:

Copied!





results = {}

for mdl_name, program in programs.items():
    print(f"Model: {mdl_name}")
    results[mdl_name] = {
        "papercards": [],
        "failures": [],
        "execution_times": [],
        "image_paths": [],
    }
    total_time = 0
    for img in tqdm.tqdm(image_documents):
        results[mdl_name]["image_paths"].append(img.image_path)
        start_time = time.time()
        try:
            structured_output = program(image_documents=[img])
            end_time = time.time() - start_time
            results[mdl_name]["papercards"].append(structured_output)
            results[mdl_name]["execution_times"].append(end_time)
            results[mdl_name]["failures"].append(None)
        except Exception as e:
            results[mdl_name]["papercards"].append(None)
            results[mdl_name]["execution_times"].append(None)
            results[mdl_name]["failures"].append(e)
    print()
results = {}

for mdl_name, program in programs.items():
    print(f"Model: {mdl_name}")
    results[mdl_name] = {
        "papercards": [],
        "failures": [],
        "execution_times": [],
        "image_paths": [],
    }
    total_time = 0
    for img in tqdm.tqdm(image_documents):
        results[mdl_name]["image_paths"].append(img.image_path)
        start_time = time.time()
        try:
            structured_output = program(image_documents=[img])
            end_time = time.time() - start_time
            results[mdl_name]["papercards"].append(structured_output)
            results[mdl_name]["execution_times"].append(end_time)
            results[mdl_name]["failures"].append(None)
        except Exception as e:
            results[mdl_name]["papercards"].append(None)
            results[mdl_name]["execution_times"].append(None)
            results[mdl_name]["failures"].append(e)
    print()

Model: gpt_4o

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [09:01<00:00, 15.46s/it]

Model: gpt_4v

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [17:29<00:00, 29.99s/it]

Model: gpt_4turbo

100%|█████████████████████████████████████████████████████████████████████████████████████| 35/35 [14:50<00:00, 25.44s/it]

定量分析¶

在这里，我们将对各种程序进行快速的定量分析。具体来说，我们将比较总故障次数、成功数据提取作业的总执行时间以及平均执行时间。

In [ ]:

Copied!

import numpy as np
import pandas as pd
import numpy as np
import pandas as pd

In [ ]:

Copied!





指标 = {
    "gpt_4o": {},
    "gpt_4v": {},
    "gpt_4turbo": {},
}

# 错误计数
for 模型名称, 模型结果 in 结果.items():
    指标[模型名称]["错误计数"] = sum(
        el is not None for el in 模型结果["失败"]
    )
    指标[模型名称]["总执行时间"] = sum(
        el for el in 模型结果["执行时间"] if el is not None
    )
    指标[模型名称]["平均执行时间"] = 指标[模型名称][
        "总执行时间"
    ] / (len(图片文档) - 指标[模型名称]["错误计数"])
    指标[模型名称]["中位执行时间"] = np.percentile(
        [el for el in 模型结果["执行时间"] if el is not None], q=0.5
    )
指标 = {
    "gpt_4o": {},
    "gpt_4v": {},
    "gpt_4turbo": {},
}

# 错误计数
for 模型名称, 模型结果 in 结果.items():
    指标[模型名称]["错误计数"] = sum(
        el is not None for el in 模型结果["失败"]
    )
    指标[模型名称]["总执行时间"] = sum(
        el for el in 模型结果["执行时间"] if el is not None
    )
    指标[模型名称]["平均执行时间"] = 指标[模型名称][
        "总执行时间"
    ] / (len(图片文档) - 指标[模型名称]["错误计数"])
    指标[模型名称]["中位执行时间"] = np.percentile(
        [el for el in 模型结果["执行时间"] if el is not None], q=0.5
    )

In [ ]:

Copied!

pd.DataFrame(metrics)
pd.DataFrame(metrics)

Out[ ]:

	gpt_4o	gpt_4v	gpt_4turbo
error_count	0.000000	14.000000	1.000000
total_execution_time	541.128802	586.500559	762.130032
average_execution_time	15.460823	27.928598	22.415589
median_execution_time	5.377015	11.879649	7.177287

GPT-4o确实更快！¶

GPT-4o在总执行时间（成功程序的执行时间，失败的提取不在此计算范围内）以及平均和中位数执行时间方面显然更快
GPT-4o不仅更快，而且能够为所有PaperCards提取结果。相比之下，GPT-4v失败了14次，GPT-4turbo失败了1次。

定性分析¶

在这最后一节中，我们将对提取结果进行定性分析。最终，我们将得到一个人工评估的“标记”数据集，用于数据提取任务。接下来提供的工具将允许您对三个程序（或模型）对PaperCard数据提取的结果进行手动评估。您作为标记者的任务是将程序的结果从0到5进行排名，其中5表示完美的数据提取。

In [ ]:

Copied!

from IPython.display import clear_output
from IPython.display import clear_output

In [ ]:

Copied!





def display_results_and_papercard(ix: int):
    # 图像
    image_path = results["gpt_4o"]["image_paths"][ix]

    # 输出
    gpt_4o_output = results["gpt_4o"]["papercards"][ix]
    gpt_4v_output = results["gpt_4v"]["papercards"][ix]
    gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]

    image = Image.open(image_path).convert("RGB")
    plt.figure(figsize=(10, 10))
    plt.axis("off")
    plt.imshow(image)
    plt.show()

    print("GPT-4o\n")
    if gpt_4o_output is not None:
        print(json.dumps(gpt_4o_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")

    print("GPT-4v\n")
    if gpt_4v_output is not None:
        print(json.dumps(gpt_4v_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")

    print("GPT-4turbo\n")
    if gpt_4turbo_output is not None:
        print(json.dumps(gpt_4turbo_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")
def display_results_and_papercard(ix: int):
    # 图像
    image_path = results["gpt_4o"]["image_paths"][ix]

    # 输出
    gpt_4o_output = results["gpt_4o"]["papercards"][ix]
    gpt_4v_output = results["gpt_4v"]["papercards"][ix]
    gpt_4turbo_output = results["gpt_4turbo"]["papercards"][ix]

    image = Image.open(image_path).convert("RGB")
    plt.figure(figsize=(10, 10))
    plt.axis("off")
    plt.imshow(image)
    plt.show()

    print("GPT-4o\n")
    if gpt_4o_output is not None:
        print(json.dumps(gpt_4o_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")

    print("GPT-4v\n")
    if gpt_4v_output is not None:
        print(json.dumps(gpt_4v_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")

    print("GPT-4turbo\n")
    if gpt_4turbo_output is not None:
        print(json.dumps(gpt_4turbo_output.dict(), indent=4))
    else:
        print("提取数据失败")
    print()
    print("============================================\n")

In [ ]:

Copied!





GRADES = {
    "gpt_4o": [0] * len(image_documents),
    "gpt_4v": [0] * len(image_documents),
    "gpt_4turbo": [0] * len(image_documents),
}


def manual_evaluation_single(img_ix: int):
    """更新GRADES字典以进行单个PaperCard数据提取任务的评估。"""
    display_results_and_papercard(img_ix)

    gpt_4o_grade = input(
        "为GPT-4o提供0到5的评分，5为最高分。"
    )
    gpt_4v_grade = input(
        "为GPT-4v提供0到5的评分，5为最高分。"
    )
    gpt_4turbo_grade = input(
        "为GPT-4turbo提供0到5的评分，5为最高分。"
    )

    GRADES["gpt_4o"][img_ix] = gpt_4o_grade
    GRADES["gpt_4v"][img_ix] = gpt_4v_grade
    GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade


def manual_evaluations(img_ix: Optional[int] = None):
    """用于手动评分gpt-4变体在PaperCard数据提取任务上的交互式程序。"""
    if img_ix is None:
        # 标记所有结果
        for ix in range(len(image_documents)):
            print(f"您正在标记第{ix + 1}个，共{len(image_documents)}个")
            print()
            manual_evaluation_single(ix)
            clear_output(wait=True)
    else:
        manual_evaluation_single(img_ix)
GRADES = {
    "gpt_4o": [0] * len(image_documents),
    "gpt_4v": [0] * len(image_documents),
    "gpt_4turbo": [0] * len(image_documents),
}


def manual_evaluation_single(img_ix: int):
    """更新GRADES字典以进行单个PaperCard数据提取任务的评估。"""
    display_results_and_papercard(img_ix)

    gpt_4o_grade = input(
        "为GPT-4o提供0到5的评分，5为最高分。"
    )
    gpt_4v_grade = input(
        "为GPT-4v提供0到5的评分，5为最高分。"
    )
    gpt_4turbo_grade = input(
        "为GPT-4turbo提供0到5的评分，5为最高分。"
    )

    GRADES["gpt_4o"][img_ix] = gpt_4o_grade
    GRADES["gpt_4v"][img_ix] = gpt_4v_grade
    GRADES["gpt_4turbo"][img_ix] = gpt_4turbo_grade


def manual_evaluations(img_ix: Optional[int] = None):
    """用于手动评分gpt-4变体在PaperCard数据提取任务上的交互式程序。"""
    if img_ix is None:
        # 标记所有结果
        for ix in range(len(image_documents)):
            print(f"您正在标记第{ix + 1}个，共{len(image_documents)}个")
            print()
            manual_evaluation_single(ix)
            clear_output(wait=True)
    else:
        manual_evaluation_single(img_ix)

In [ ]:

Copied!

manual_evaluations()
manual_evaluations()

You are marking 35 out of 35

GPT-4o

{
    "title": "Prometheus: Inducing Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, Seungone et al.",
    "arxiv_id": "arxiv:2310.08441",
    "main_contribution": "An open-source LLM (LLMav2) evaluation specializing in fine-grained evaluations using human-like rubrics.",
    "insights": "While large LLMs like GPT-4 have shown impressive performance, they still lack fine-grained evaluation capabilities. Prometheus aims to address this by providing a dataset and evaluation framework that can assess models on a more detailed level.",
    "main_results": [
        "Prometheus matches or outperforms GPT-4.",
        "Prometheus can function as a reward model.",
        "Reference answers are crucial for fine-grained evaluation."
    ],
    "tech_bits": "Score Rubric, Feedback Collection, Generated Instructions, Generated Responses, Generated Rubrics, Evaluations, Answers & Explanations"
}

============================================

GPT-4v

{
    "title": "PROMETHEUS: Fine-Grained Evaluation Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.08941",
    "main_contribution": "PROMETHEUS presents a novel source-level LLM evaluation suite using a custom feedback collection interface.",
    "insights": "The insights section would contain a summary of the main insight or motivation for the paper as described in the image.",
    "main_results": [
        "The main results section would list the key findings or results of the paper as described in the image."
    ],
    "tech_bits": "The tech bits section would describe what's being displayed in the technical bits section of the image."
}

============================================

GPT-4turbo

{
    "title": "Prometheus: Evaluating Capability In Language Models",
    "year": "2023",
    "authors": "Kim, George, et al.",
    "arxiv_id": "arXiv:2310.05941",
    "main_contribution": "Prometheus uses a custom feedback collection system designed for fine-tuning language models.",
    "insights": "The main insight is that fine-tuning language models on specific tasks can improve their overall performance, especially when using a custom feedback collection system.",
    "main_results": [
        "Prometheus LM outperforms GPT-4 on targeted feedback tasks.",
        "Prometheus LM's custom feedback function was 2% more effective than Prometheus 3.",
        "Feedback quality was better as reported by human judges."
    ],
    "tech_bits": "The technical bits section includes a Rubric Score, Seed, Fine-Grained Annotations, and Models. It also shows a feedback collection process with a visual representation of the feedback loop involving seed, generated annotations, and models."
}

============================================

Provide a rating from 0 to 5, with 5 being the highest for GPT-4o. 3
Provide a rating from 0 to 5, with 5 being the highest for GPT-4v. 1.5
Provide a rating from 0 to 5, with 5 being the highest for GPT-4turbo. 1.5

In [ ]:

Copied!

grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()
grades_df = pd.DataFrame(GRADES, dtype=float)
grades_df.mean()

Out[ ]:

gpt_4o        3.585714
gpt_4v        1.300000
gpt_4turbo    2.128571
dtype: float64

观察表¶

在下表中，我们列出了我们希望从PaperCard中提取的每个组件的一般观察结果。 GPT-4v和GPT-4Turbo的表现相似，但稍微偏向于GPT-4Turbo。一般来说，GPT-4o在这个数据提取任务中表现比其他模型要好得多。最后，所有模型似乎都在描述PaperCard的Tech Bits部分时遇到了困难，有时所有模型都会生成摘要而不是精确提取；然而，GPT-4o做得比其他模型少。

提取的组件	GPT-4o	GPT-4v & GPT-4Turbo
标题、年份、作者	非常好，可能达到100%	可能达到80%，在少数示例中产生幻觉
Arxiv ID	良好，大约95%准确	70%准确
主要贡献	良好（约80%），但无法提取列出的多个贡献	不太好，60%准确，有些幻觉
洞察	不太好（约65%），更多地进行总结而不是提取	更多地进行总结而不是提取
主要结果	非常擅长提取主要结果的摘要陈述	在这里产生了很多幻觉
技术要点	无法生成关于图表的详细描述	无法生成关于图表的详细描述

概要¶

GPT-4o比GPT-4v和GPT-4turbo更快，且失败次数更少（0次！）
GPT-4o比GPT-4v和GPT-4turbo产生更好的数据提取结果
GPT-4o在从PaperCard中提取事实（标题、作者、年份以及主要结果部分的主题声明）方面表现非常出色
GPT-4v和GPT-4turbo经常会产生主要结果的幻觉，有时甚至会出现作者的错误
使用更好的提示，特别是从Insights部分提取数据，以及描述Tech Bits，可能会改善GPT-4o的结果