使用 Replicate LlaVa、Fuyu 8B、MiniGPT4 模型进行图像推理的多模态LLM¶

在这个笔记本中，我们展示了如何使用MultiModal LLM类进行图像理解/推理。我们目前支持以下模型：

LLava-13B
Fuyu-8B
MiniGPT-4

在第二部分中，我们展示了如何在Replicate中使用stream complete和async complete。

注意：目前，Replicate的多模态LLM仅支持一次处理一个图像文档。

In [ ]:

Copied!

%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-multi-modal-llms-replicate

In [ ]:

Copied!

% pip install replicate
% pip install replicate

加载和初始化 Replicate¶

In [ ]:

Copied!

import os

REPLICATE_API_TOKEN = ""  # 在这里填入你的Relicate API令牌
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

import os

REPLICATE_API_TOKEN = ""  # 在这里填入你的Relicate API令牌
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN

在本节中，我们将学习如何使用Python从互联网上下载图像，并将其加载到本地。

In [ ]:

Copied!





from PIL import Image
import requests
from io import BytesIO

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument


if not os.path.exists("test_images"):
    os.makedirs("test_images")

# 目前fuyu-8b模型在复制品上大多可以很好地处理JPG图像URL，而不是本地文件
image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# 保存图像
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

# 选项1：直接从URL加载图像
# image_documents = load_image_urls(image_urls)
# 选项2：从本地加载图像
image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]
from PIL import Image
import requests
from io import BytesIO

from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument


if not os.path.exists("test_images"):
    os.makedirs("test_images")

# 目前fuyu-8b模型在复制品上大多可以很好地处理JPG图像URL，而不是本地文件
image_urls = [
    # "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
    "https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
    "https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
    "https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# 保存图像
for idx, image_url in enumerate(image_urls):
    response = requests.get(image_url)
    img = Image.open(BytesIO(response.content))
    img.save(f"test_images/{idx}.png")

# 选项1：直接从URL加载图像
# image_documents = load_image_urls(image_urls)
# 选项2：从本地加载图像
image_documents = [
    ImageDocument(image_path=f"test_images/{idx}.png")
    for idx in range(len(image_urls))
]

可视化图像¶

In [ ]:

Copied!

import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)
import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris

image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)

No description has been provided for this image

提供各种提示以测试不同的多模态LLM模型¶

In [ ]:

Copied!





from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
    REPLICATE_MULTI_MODAL_LLM_MODELS,
)

prompts = [
    "what is shown in this image?",
    "how many people are shown in the image?",
    "is there anything unusual in the image?",
]
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
    REPLICATE_MULTI_MODAL_LLM_MODELS,
)

prompts = [
    "what is shown in this image?",
    "how many people are shown in the image?",
    "is there anything unusual in the image?",
]

使用不同的LLMs和不同的提示生成不同图像的推理¶

In [ ]:

Copied!





res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                ## 初始化多模态LLM模型
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(
                    f"使用提示 {prompt}，图像 {image_idx}，和MM模型 {llm_model} 推断LLM模型时出错"
                )
                print("推断失败原因: ", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )
res = []
for prompt_idx, prompt in enumerate(prompts):
    for image_idx, image_doc in enumerate(image_documents):
        for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
            try:
                ## 初始化多模态LLM模型
                multi_modal_llm = ReplicateMultiModal(
                    model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
                    max_new_tokens=100,
                    temperature=0.1,
                    num_input_files=1,
                    top_p=0.9,
                    num_beams=1,
                    repetition_penalty=1,
                )

                mm_resp = multi_modal_llm.complete(
                    prompt=prompt,
                    image_documents=[image_doc],
                )
            except Exception as e:
                print(
                    f"使用提示 {prompt}，图像 {image_idx}，和MM模型 {llm_model} 推断LLM模型时出错"
                )
                print("推断失败原因: ", e)
                continue
            res.append(
                {
                    "model": llm_model,
                    "prompt": prompt,
                    "response": mm_resp,
                    "image": str(image_doc.image_path),
                }
            )

显示多模态LLM的采样响应¶

In [ ]:

Copied!





from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])
from IPython.display import display
import pandas as pd

pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])

	model	prompt	response	image
0	llava-13b	what is shown in this image?	The image shows a man holding a gold trophy, possibly a soccer trophy, while wearing a suit and tie.	test_images/0.png
1	fuyu-8b	what is shown in this image?	The image shows a man wearing a suit and holding a golden ball trophy.	test_images/0.png
2	minigpt-4	what is shown in this image?	The image shows a man in a black suit and tie holding a golden trophy.	test_images/0.png
3	llava-13b	what is shown in this image?	The image shows a large, illuminated building, which is the Colosseum in Rome, Italy. The building is lit up at night, and the lights create a beautiful and dramatic effect.	test_images/1.png
4	fuyu-8b	what is shown in this image?	The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater.	test_images/1.png

人工标记多模态LLM推理结果的正确性和相关性¶

请注意，人工标记在标记相关性和正确性时可能存在一些偏见/主观性。

我们将正确性和相关性评分标记在[1, 5]之间

5：完美地回答了问题
4：在某种程度上回答了问题
3：部分回答了问题
2：用错误答案回答了问题
1：没有回答或出现幻觉

Model	Prompt/Question	Model Reasoning Results	Correctness and relevance [1,5]	image
llava-13b	what is shown in this image?	The image shows a man holding a trophy, which appears to be a gold soccer ball. He is dressed in a suit and tie, and he is smiling as he holds the trophy.	4	test_images/0.png
fuyu-8b	what is shown in this image?	The image shows a man wearing a suit and holding a golden ball trophy.	4	test_images/0.png
minigpt-4	what is shown in this image?	The image shows a man in a black suit and tie holding a golden trophy.	4	test_images/0.png
llava-13b	what is shown in this image?	The image shows a large, illuminated building, likely a historical monument or a famous landmark, at night. The building is lit up with colored lights, creating a visually striking scene.	4	test_images/1.png
fuyu-8b	what is shown in this image?	The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater.	4	test_images/1.png
minigpt-4	what is shown in this image?	The image shows the Colosseum, a large amphitheater in Rome, Italy. It is one of the most famous landmarks in the city and is a popular tourist attraction.	5	test_images/1.png
llava-13b	what is shown in this image?	The image shows two animated minion characters from the movie "Despicable Me" standing next to each other.	5	test_images/2.png
fuyu-8b	what is shown in this image?	The image features two minion figures, one wearing overalls and the other wearing goggles. They are standing next to each other, possibly engaged in conversation or posing for a photo.	5	test_images/2.png
minigpt-4	what is shown in this image?	The image shows two minions, one with a yellow overalls and the other with a blue overalls, both wearing sunglasses and standing next to each other.	4	test_images/2.png
llava-13b	how many people are shown in the image?	There is only one person shown in the image, a man holding a trophy.	5	test_images/0.png
fuyu-8b	how many people are shown in the image?	In the image, there is a man wearing a suit and holding a golden ball or trophy.	5	test_images/0.png
minigpt-4	how many people are shown in the image?	I'm sorry, but I cannot see the image you provided. Can you please provide the image again?	NA	test_images/0.png
llava-13b	how many people are shown in the image?	There are two people shown in the image.	1	test_images/1.png
fuyu-8b	how many people are shown in the image?	The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater. \n\nThere are several people walking along the street, some carrying handbags. A few benches can be seen along the sidewalk, providing a place for pedestrians to rest. The atmosphere is lively, with people going about their daily activities.	4	test_images/1.png
minigpt-4	how many people are shown in the image?	I'm sorry, but I cannot see the image you provided. Can you please provide more information or a different image?	NA	test_images/1.png
llava-13b	how many people are shown in the image?	There are two people shown in the image.	4	test_images/2.png
fuyu-8b	how many people are shown in the image?	There are two people shown in the image.	4	test_images/2.png
minigpt-4	how many people are shown in the image?	There are two minions in the image.	5	test_images/2.png
llava-13b	is there anything unusual in the image?	Yes, there is an unusual element in the image. The man is holding a gold trophy, which is not uncommon. However, he is also holding a microphone, which is an unusual combination. Typically, a person holding a microphone would be a speaker or an announcer, while a trophy is usually associated with an award or achievement. The combination of the microphone and the trophy suggests that the man might be a presenter or an announcer who is also receiving an award or acknowledging someone else's achievement.	4	test_images/0.png
fuyu-8b	is there anything unusual in the image?	Yes, there is something unusual in the image, as there is a golden ball or trophy being held by a man wearing a suit. This is unusual because typically, sports balls or trophies are associated with competitions or achievements, rather than being worn by a person in a suit. The man holding the golden ball or trophy is wearing a suit, which adds to the peculiarity of the scene.	3	test_images/0.png
minigpt-4	is there anything unusual in the image?	The image shows a man in a black suit and tie holding a golden trophy. The trophy has a design on it that looks like a soccer ball. The man is smiling and appears to be happy.	3	test_images/0.png
llava-13b	is there anything unusual in the image?	Yes, there is an unusual aspect in the image. The Colosseum, a famous ancient Roman amphitheater, is lit up with colored lights, which is not a common sight. This illumination adds a unique and vibrant touch to the otherwise historical and majestic structure.	4	test_images/1.png
fuyu-8b	is there anything unusual in the image?	Yes, the image features an unusual scene, as multiple people are walking around the colossal structure, which is an old Roman amphitheater. The amphitheater is illuminated with colorful lights, giving it a vibrant and lively atmosphere.	3	test_images/1.png
minigpt-4	is there anything unusual in the image?	The image shows the Colosseum, a large amphitheater in Rome, Italy. It is a well known landmark and tourist attraction.	2	test_images/1.png
llava-13b	is there anything unusual in the image?	Yes, there is something unusual in the image. The two cartoon minions are standing next to each other, but one of them has a tooth missing. This is an unusual detail, as it is not common for animated characters to have imperfections like missing teeth. The missing tooth adds a unique and interesting aspect to the image, making it stand out from typical animated scenes.	3	test_images/2.png
fuyu-8b	is there anything unusual in the image?	Yes, there is an unusual aspect of the image, as there are two minions dressed in overalls, wearing goggles, and standing next to each other. This unusual combination is not typical, as minions are typically associated with their popular animation and movie franchises. The minions' overalls, goggles, and overalls-wearing, combined with the goggles they are wearing, adds to the peculiarity of the scene.	2	test_images/2.png
minigpt-4	is there anything unusual in the image?	The image appears to be a cartoon character with overalls and a yellow shirt. The character is smiling and has a blue hat on its head. There is nothing unusual in the image.	5	test_images/2.png

对评估的多模态模型初步发现的总结¶

首先，本文的目的是展示如何利用 Replicate 来为图像推理任务提供不同的多模态LLM服务。对于这样的比较存在一些限制：

我们比较并评估了 LLaVa-13B、Fuyu-8B 和 MiniGPT-4 在一些简单和有限的任务/提示中的表现。
请注意，在示例中，不同模型的超参数是相同的。超参数调整的能力对于质量MM LLMs模型可能具有重要意义。
人类评估可能存在一些偏见/主观性/噪音。

一些初步发现：

MiniGPT-4 有时可以给出更准确的答案，比如“图像中有两个小喽啰。”，而不是来自 LlaVa 或 Fuyu-8B 的“图像中显示了两个人。”。另一个例子是，MiniGPT-4 直接回答问题“图像中是什么”时给出“斗兽场”作为意大利斗兽场的答案。
MiniGPT-4 未能对两个提示给出结果。它回答“抱歉，我无法看到您提供的图像。” 但它可以回答同一图像的其他问题。不确定这是 Replicate 推理的问题还是 MiniGPT-4 模型本身的问题。
Fuyu-8B 和 LlaVa-13B 通常会对问题给出更长的冗长答案，以支持更多的上下文。
Llava-13B 和 Fuyu-8B 有时会对问题给出稍微更高的“幻觉”，特别是对于问题“图像中是否有什么异常之处？”

复制流完成，异步完成，异步流完成模式¶

初始化 Fuyu-8B 模型¶

In [ ]:

Copied!





multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)
multi_modal_llm = ReplicateMultiModal(
    model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
    max_new_tokens=100,
    temperature=0.1,
    num_input_files=1,
    top_p=0.9,
    num_beams=1,
    repetition_penalty=1,
)

使用异步流完成¶

In [ ]:

Copied!





resp = await multi_modal_llm.astream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.astream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

async for delta in resp:
    print(delta.delta, end="")
async for delta in resp:
    print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用异步完成¶

In [ ]:

Copied!





resp = await multi_modal_llm.acomplete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.acomplete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

print(resp)
print(resp)

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.

使用流完成¶

In [ ]:

Copied!





resp = multi_modal_llm.stream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)
resp = multi_modal_llm.stream_complete(
    prompt="tell me about this image",
    image_documents=[image_documents[0]],
)

In [ ]:

Copied!

for delta in resp:
    print(delta.delta, end="")
for delta in resp:
    print(delta.delta, end="")

 The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award.

In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.