使用OpenAI GPT-4V模型进行图像推理的多模态LLM¶

在这个笔记本中，我们将展示如何使用OpenAI GPT4V MultiModal LLM类/抽象来理解/推理图像。

我们还展示了我们现在支持的几个OpenAI GPT4V LLM函数：

complete（同步和异步都支持）：用于单个提示和图像列表
chat（同步和异步都支持）：用于多个聊天消息
stream complete（同步和异步都支持）：用于完整输出的流式传输
stream chat（同步和异步都支持）：用于聊天输出的流式传输

In [ ]:

Copied!

%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-openai

In [ ]:

Copied!

!pip install openai matplotlib
!pip install openai matplotlib

使用GPT4V来理解来自URL的图像¶

在这个示例中，我们将使用OpenAI的GPT-3模型来理解来自URL的图像。我们将使用OpenAI的GPT-3模型来提取图像的描述，并展示其在理解图像方面的能力。

我们将使用以下步骤来实现这一目标：

从URL下载图像
使用GPT-3模型来生成图像描述
展示生成的图像描述

让我们开始吧！

In [ ]:

Copied!

import os

OPENAI_API_KEY = "sk-"  # 在这里填入你的OpenAI API密钥
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os

OPENAI_API_KEY = "sk-"  # 在这里填入你的OpenAI API密钥
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

初始化`OpenAIMultiModal` 从Urls读取图片¶

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls


image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    # "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents = load_image_urls(image_urls)

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_KEY, max_new_tokens=300
)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls


image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    # "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents = load_image_urls(image_urls)

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_KEY, max_new_tokens=300
)

In [ ]:

Copied!





from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)
from PIL import Image
import requests
from io import BytesIO
import matplotlib.pyplot as plt

img_response = requests.get(image_urls[0])
print(image_urls[0])
img = Image.open(BytesIO(img_response.content))
plt.imshow(img)

https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg

Out[ ]:

<matplotlib.image.AxesImage at 0x17ef8c7d0>

No description has been provided for this image

用一堆图片完成一个提示¶

在这个任务中，你将会看到一系列的图片，然后需要根据这些图片完成一个提示。

In [ ]:

Copied!





complete_response = openai_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)
complete_response = openai_mm_llm.complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

In [ ]:

Copied!

print(complete_response)
print(complete_response)

The image shows the Colosseum in Rome illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's multiple arches are vividly lit, contrasting with the dark blue sky in the background. Some construction or restoration work appears to be in progress at the base of the structure, and a few people can be seen walking near the site.

Steam 完成一个包含大量图片的提示¶

In [ ]:

Copied!





stream_complete_response = openai_mm_llm.stream_complete(
    prompt="give me more context for this image",
    image_documents=image_documents,
)
stream_complete_response = openai_mm_llm.stream_complete(
    prompt="give me more context for this image",
    image_documents=image_documents,
)

In [ ]:

Copied!

for r in stream_complete_response:
    print(r.delta, end="")
for r in stream_complete_response:
    print(r.delta, end="")

This image shows the Colosseum, also known as the Flavian Amphitheatre, which is an iconic symbol of Imperial Rome and is located in the center of Rome, Italy. It is one of the world's most famous landmarks and is considered one of the greatest works of Roman architecture and engineering.

The Colosseum is illuminated at night with the colors of the Italian flag: green, white, and red. This lighting could be for a special occasion or event, such as a national holiday, a cultural celebration, or in solidarity with a cause. The use of lighting to display the national colors is a way to highlight the structure's significance to Italy and its people.

The Colosseum was built in the first century AD under the emperors of the Flavian dynasty and was used for gladiatorial contests and public spectacles such as mock sea battles, animal hunts, executions, re-enactments of famous battles, and dramas based on Classical mythology. It could hold between 50,000 and 80,000 spectators and was used for entertainment in the Roman Empire for over 400 years.

Today, the Colosseum is a major tourist attraction, drawing millions of visitors each year. It also serves as a powerful reminder of the Roman Empire's history and its lasting influence on the world.

通过聊天消息列表进行聊天¶

这个Python程序演示了如何使用一个包含聊天消息的列表来模拟一个简单的聊天程序。

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai.utils import (
    generate_openai_multi_modal_chat_message,
)

chat_msg_1 = generate_openai_multi_modal_chat_message(
    prompt="描述图像作为替代文本",
    role="用户",
    image_documents=image_documents,
)

chat_msg_2 = generate_openai_multi_modal_chat_message(
    prompt="图像是显示美国抵押贷款利率激增的图表。这是数据的视觉表示，顶部有标题，x轴和y轴有标签。不幸的是，没有看到图像，我无法提供关于数据或图表设计的具体细节。",
    role="助手",
)

chat_msg_3 = generate_openai_multi_modal_chat_message(
    prompt="我可以知道更多吗？",
    role="用户",
)

chat_messages = [chat_msg_1, chat_msg_2, chat_msg_3]
chat_response = openai_mm_llm.chat(
    # prompt="描述图像作为替代文本",
    messages=chat_messages,
)
from llama_index.multi_modal_llms.openai.utils import (
    generate_openai_multi_modal_chat_message,
)

chat_msg_1 = generate_openai_multi_modal_chat_message(
    prompt="描述图像作为替代文本",
    role="用户",
    image_documents=image_documents,
)

chat_msg_2 = generate_openai_multi_modal_chat_message(
    prompt="图像是显示美国抵押贷款利率激增的图表。这是数据的视觉表示，顶部有标题，x轴和y轴有标签。不幸的是，没有看到图像，我无法提供关于数据或图表设计的具体细节。",
    role="助手",
)

chat_msg_3 = generate_openai_multi_modal_chat_message(
    prompt="我可以知道更多吗？",
    role="用户",
)

chat_messages = [chat_msg_1, chat_msg_2, chat_msg_3]
chat_response = openai_mm_llm.chat(
    # prompt="描述图像作为替代文本",
    messages=chat_messages,
)

In [ ]:

Copied!

for msg in chat_messages:
    print(msg.role, msg.content)
for msg in chat_messages:
    print(msg.role, msg.content)

MessageRole.USER [{'type': 'text', 'text': 'Describe the images as an alternative text'}, {'type': 'image_url', 'image_url': 'https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg'}]
MessageRole.ASSISTANT The image is a graph showing the surge in US mortgage rates. It is a visual representation of data, with a title at the top and labels for the x and y-axes. Unfortunately, without seeing the image, I cannot provide specific details about the data or the exact design of the graph.
MessageRole.USER can I know more?

In [ ]:

Copied!

print(chat_response)
print(chat_response)

assistant: I apologize for the confusion earlier. The image actually shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a twilight setting, with the sky transitioning from blue to black. The lighting accentuates the arches and the texture of the stone, creating a dramatic and colorful display. There are some people and a street visible in the foreground, with construction barriers indicating some ongoing work or preservation efforts.

通过聊天消息列表进行流式聊天¶

In [ ]:

Copied!





stream_chat_response = openai_mm_llm.stream_chat(
    # 提示="用替代文本描述图像",
    messages=chat_messages,
)
stream_chat_response = openai_mm_llm.stream_chat(
    # 提示="用替代文本描述图像",
    messages=chat_messages,
)

In [ ]:

Copied!

for r in stream_chat_response:
    print(r.delta, end="")
for r in stream_chat_response:
    print(r.delta, end="")

I apologize for the confusion earlier. The image actually shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a twilight setting, with the sky transitioning from blue to black. The lighting accentuates the arches and the texture of the stone, creating a dramatic and patriotic display. There are a few people visible at the base of the Colosseum, and some construction barriers suggest maintenance or archaeological work may be taking place.

异步完成¶

In [ ]:

Copied!





response_acomplete = await openai_mm_llm.acomplete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)
response_acomplete = await openai_mm_llm.acomplete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

In [ ]:

Copied!

print(response_acomplete)
print(response_acomplete)

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's iconic arches are vividly lit, and the structure stands out against the dark blue evening sky. A few people can be seen near the base of the Colosseum, and there is some construction fencing visible in the foreground.

异步流完整¶

In [ ]:

Copied!





response_astream_complete = await openai_mm_llm.astream_complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)
response_astream_complete = await openai_mm_llm.astream_complete(
    prompt="Describe the images as an alternative text",
    image_documents=image_documents,
)

In [ ]:

Copied!

async for delta in response_astream_complete:
    print(delta.delta, end="")
async for delta in response_astream_complete:
    print(delta.delta, end="")

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater's iconic arches are vividly lit, and the structure stands out against the dark blue evening sky. Some construction or restoration work appears to be in progress at the base of the Colosseum, indicated by scaffolding and barriers. A few individuals can be seen near the structure, giving a sense of scale to the massive edifice.

异步聊天¶

In [ ]:

Copied!

achat_response = await openai_mm_llm.achat(
    messages=chat_messages,
)
achat_response = await openai_mm_llm.achat(
    messages=chat_messages,
)

In [ ]:

Copied!

print(achat_response)
print(achat_response)

assistant: I apologize for the confusion in my previous response. Let me provide you with an accurate description of the image you've provided.

The image shows the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is captured in a moment of twilight, with the sky transitioning from blue to black, highlighting the structure's iconic arches and the illuminated colors. There are some people and a street visible in the foreground, with construction barriers indicating some ongoing work or preservation efforts. The Colosseum's grandeur and historical significance are emphasized by the lighting and the dusk setting.

异步流式聊天¶

In [ ]:

Copied!

astream_chat_response = await openai_mm_llm.astream_chat(
    messages=chat_messages,
)
astream_chat_response = await openai_mm_llm.astream_chat(
    messages=chat_messages,
)

In [ ]:

Copied!

async for delta in astream_chat_response:
    print(delta.delta, end="")
async for delta in astream_chat_response:
    print(delta.delta, end="")

I apologize for the confusion in my previous response. The image actually depicts the Colosseum in Rome, Italy, illuminated at night with the colors of the Italian flag: green, white, and red. The ancient amphitheater is shown with its iconic arched openings, and the lighting accentuates its grandeur against the evening sky. There are a few people and some construction barriers visible at the base, indicating ongoing preservation efforts or public works.

包含两张图片的完成版¶

In [ ]:

Copied!





image_urls = [
    "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    # "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents_1 = load_image_urls(image_urls)

response_multi = openai_mm_llm.complete(
    prompt="这些图片之间有关系吗？",
    image_documents=image_documents_1,
)
print(response_multi)
image_urls = [
    "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    # "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    # "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]

image_documents_1 = load_image_urls(image_urls)

response_multi = openai_mm_llm.complete(
    prompt="这些图片之间有关系吗？",
    image_documents=image_documents_1,
)
print(response_multi)

No, there is no direct relationship between these two images. The first image is an infographic showing the surge in U.S. mortgage rates and its comparison with existing home sales, indicating economic data. The second image is of a person holding a trophy, which seems to be related to a sports achievement or recognition. The content of the two images pertains to entirely different subjects—one is focused on economic information, while the other is related to an individual's achievement in a likely sporting context.

使用GPT4V来理解本地文件中的图像¶

In [ ]:

Copied!





from llama_index.core import SimpleDirectoryReader

# 将你的本地目录放在这里
image_documents = SimpleDirectoryReader("./images_wiki").load_data()

response = openai_mm_llm.complete(
    prompt="描述这些图片的替代文本",
    image_documents=image_documents,
)
from llama_index.core import SimpleDirectoryReader

# 将你的本地目录放在这里
image_documents = SimpleDirectoryReader("./images_wiki").load_data()

response = openai_mm_llm.complete(
    prompt="描述这些图片的替代文本",
    image_documents=image_documents,
)

In [ ]:

Copied!

from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("./images_wiki/3.jpg")
plt.imshow(img)
from PIL import Image
import matplotlib.pyplot as plt

img = Image.open("./images_wiki/3.jpg")
plt.imshow(img)

Out[ ]:

<matplotlib.image.AxesImage at 0x297eec110>

In [ ]:

Copied!

print(response)
print(response)

You are looking at a close-up image of a glass Coca-Cola bottle. The label on the bottle features the iconic Coca-Cola logo with additional text underneath it commemorating the 2002 FIFA World Cup hosted by Korea/Japan. The label also indicates that the bottle contains 250 ml of the product. In the background with a shallow depth of field, you can see the blurred image of another Coca-Cola bottle, emphasizing the focus on the one in the foreground. The overall lighting and detail provide a clear view of the bottle and its labeling.

使用OpenAI GPT-4V模型进行图像推理的多模态LLM¶

使用GPT4V来理解来自URL的图像¶

初始化OpenAIMultiModal 从Urls读取图片¶

用一堆图片完成一个提示¶

Steam 完成一个包含大量图片的提示¶

通过聊天消息列表进行聊天¶

通过聊天消息列表进行流式聊天¶

异步完成¶

异步流完整¶

异步聊天¶

异步流式聊天¶

包含两张图片的完成版¶

使用GPT4V来理解本地文件中的图像¶

初始化`OpenAIMultiModal` 从Urls读取图片¶