In [ ]:
Copied!
%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-multi-modal-llms-replicate
In [ ]:
Copied!
% pip install replicate
% pip install replicate
加载和初始化 Replicate¶
In [ ]:
Copied!
import os
REPLICATE_API_TOKEN = "" # 在这里填入你的Relicate API令牌
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
import os
REPLICATE_API_TOKEN = "" # 在这里填入你的Relicate API令牌
os.environ["REPLICATE_API_TOKEN"] = REPLICATE_API_TOKEN
在本节中,我们将学习如何使用Python从互联网上下载图像,并将其加载到本地。
In [ ]:
Copied!
from PIL import Image
import requests
from io import BytesIO
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument
if not os.path.exists("test_images"):
os.makedirs("test_images")
# 目前fuyu-8b模型在复制品上大多可以很好地处理JPG图像URL,而不是本地文件
image_urls = [
# "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
"https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
"https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
"https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# 保存图像
for idx, image_url in enumerate(image_urls):
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
img.save(f"test_images/{idx}.png")
# 选项1:直接从URL加载图像
# image_documents = load_image_urls(image_urls)
# 选项2:从本地加载图像
image_documents = [
ImageDocument(image_path=f"test_images/{idx}.png")
for idx in range(len(image_urls))
]
from PIL import Image
import requests
from io import BytesIO
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core.schema import ImageDocument
if not os.path.exists("test_images"):
os.makedirs("test_images")
# 目前fuyu-8b模型在复制品上大多可以很好地处理JPG图像URL,而不是本地文件
image_urls = [
# "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg",
"https://www.sportsnet.ca/wp-content/uploads/2023/11/CP1688996471-1040x572.jpg",
"https://res.cloudinary.com/hello-tickets/image/upload/c_limit,f_auto,q_auto,w_1920/v1640835927/o3pfl41q7m5bj8jardk0.jpg",
"https://www.cleverfiles.com/howto/wp-content/uploads/2018/03/minion.jpg",
]
# 保存图像
for idx, image_url in enumerate(image_urls):
response = requests.get(image_url)
img = Image.open(BytesIO(response.content))
img.save(f"test_images/{idx}.png")
# 选项1:直接从URL加载图像
# image_documents = load_image_urls(image_urls)
# 选项2:从本地加载图像
image_documents = [
ImageDocument(image_path=f"test_images/{idx}.png")
for idx in range(len(image_urls))
]
可视化图像¶
In [ ]:
Copied!
import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris
image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)
import matplotlib.pyplot as plt
from llama_index.core.response.notebook_utils import display_image_uris
image_paths = [str(img_doc.image_path) for img_doc in image_documents]
display_image_uris(image_paths)
提供各种提示以测试不同的多模态LLM模型¶
In [ ]:
Copied!
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
REPLICATE_MULTI_MODAL_LLM_MODELS,
)
prompts = [
"what is shown in this image?",
"how many people are shown in the image?",
"is there anything unusual in the image?",
]
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.multi_modal_llms.replicate.base import (
REPLICATE_MULTI_MODAL_LLM_MODELS,
)
prompts = [
"what is shown in this image?",
"how many people are shown in the image?",
"is there anything unusual in the image?",
]
使用不同的LLMs和不同的提示生成不同图像的推理¶
In [ ]:
Copied!
res = []
for prompt_idx, prompt in enumerate(prompts):
for image_idx, image_doc in enumerate(image_documents):
for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
try:
## 初始化多模态LLM模型
multi_modal_llm = ReplicateMultiModal(
model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
max_new_tokens=100,
temperature=0.1,
num_input_files=1,
top_p=0.9,
num_beams=1,
repetition_penalty=1,
)
mm_resp = multi_modal_llm.complete(
prompt=prompt,
image_documents=[image_doc],
)
except Exception as e:
print(
f"使用提示 {prompt},图像 {image_idx},和MM模型 {llm_model} 推断LLM模型时出错"
)
print("推断失败原因: ", e)
continue
res.append(
{
"model": llm_model,
"prompt": prompt,
"response": mm_resp,
"image": str(image_doc.image_path),
}
)
res = []
for prompt_idx, prompt in enumerate(prompts):
for image_idx, image_doc in enumerate(image_documents):
for llm_idx, llm_model in enumerate(REPLICATE_MULTI_MODAL_LLM_MODELS):
try:
## 初始化多模态LLM模型
multi_modal_llm = ReplicateMultiModal(
model=REPLICATE_MULTI_MODAL_LLM_MODELS[llm_model],
max_new_tokens=100,
temperature=0.1,
num_input_files=1,
top_p=0.9,
num_beams=1,
repetition_penalty=1,
)
mm_resp = multi_modal_llm.complete(
prompt=prompt,
image_documents=[image_doc],
)
except Exception as e:
print(
f"使用提示 {prompt},图像 {image_idx},和MM模型 {llm_model} 推断LLM模型时出错"
)
print("推断失败原因: ", e)
continue
res.append(
{
"model": llm_model,
"prompt": prompt,
"response": mm_resp,
"image": str(image_doc.image_path),
}
)
显示多模态LLM的采样响应¶
In [ ]:
Copied!
from IPython.display import display
import pandas as pd
pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])
from IPython.display import display
import pandas as pd
pd.options.display.max_colwidth = None
df = pd.DataFrame(res)
display(df[:5])
model | prompt | response | image | |
---|---|---|---|---|
0 | llava-13b | what is shown in this image? | The image shows a man holding a gold trophy, possibly a soccer trophy, while wearing a suit and tie. | test_images/0.png |
1 | fuyu-8b | what is shown in this image? | The image shows a man wearing a suit and holding a golden ball trophy. | test_images/0.png |
2 | minigpt-4 | what is shown in this image? | The image shows a man in a black suit and tie holding a golden trophy. | test_images/0.png |
3 | llava-13b | what is shown in this image? | The image shows a large, illuminated building, which is the Colosseum in Rome, Italy. The building is lit up at night, and the lights create a beautiful and dramatic effect. | test_images/1.png |
4 | fuyu-8b | what is shown in this image? | The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater. | test_images/1.png |
人工标记多模态LLM推理结果的正确性和相关性¶
请注意,人工标记在标记相关性和正确性时可能存在一些偏见/主观性。
我们将正确性和相关性评分标记在[1, 5]之间
- 5:完美地回答了问题
- 4:在某种程度上回答了问题
- 3:部分回答了问题
- 2:用错误答案回答了问题
- 1:没有回答或出现
幻觉
Model | Prompt/Question | Model Reasoning Results | Correctness and relevance [1,5] | image |
---|---|---|---|---|
llava-13b | what is shown in this image? | The image shows a man holding a trophy, which appears to be a gold soccer ball. He is dressed in a suit and tie, and he is smiling as he holds the trophy. | 4 | test_images/0.png |
fuyu-8b | what is shown in this image? | The image shows a man wearing a suit and holding a golden ball trophy. | 4 | test_images/0.png |
minigpt-4 | what is shown in this image? | The image shows a man in a black suit and tie holding a golden trophy. | 4 | test_images/0.png |
llava-13b | what is shown in this image? | The image shows a large, illuminated building, likely a historical monument or a famous landmark, at night. The building is lit up with colored lights, creating a visually striking scene. | 4 | test_images/1.png |
fuyu-8b | what is shown in this image? | The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater. | 4 | test_images/1.png |
minigpt-4 | what is shown in this image? | The image shows the Colosseum, a large amphitheater in Rome, Italy. It is one of the most famous landmarks in the city and is a popular tourist attraction. | 5 | test_images/1.png |
llava-13b | what is shown in this image? | The image shows two animated minion characters from the movie "Despicable Me" standing next to each other. | 5 | test_images/2.png |
fuyu-8b | what is shown in this image? | The image features two minion figures, one wearing overalls and the other wearing goggles. They are standing next to each other, possibly engaged in conversation or posing for a photo. | 5 | test_images/2.png |
minigpt-4 | what is shown in this image? | The image shows two minions, one with a yellow overalls and the other with a blue overalls, both wearing sunglasses and standing next to each other. | 4 | test_images/2.png |
llava-13b | how many people are shown in the image? | There is only one person shown in the image, a man holding a trophy. | 5 | test_images/0.png |
fuyu-8b | how many people are shown in the image? | In the image, there is a man wearing a suit and holding a golden ball or trophy. | 5 | test_images/0.png |
minigpt-4 | how many people are shown in the image? | I'm sorry, but I cannot see the image you provided. Can you please provide the image again? | NA | test_images/0.png |
llava-13b | how many people are shown in the image? | There are two people shown in the image. | 1 | test_images/1.png |
fuyu-8b | how many people are shown in the image? | The image showcases a city street at night, with colorful lights illuminating the scene. The street is lined with buildings, including a prominent Roman-style amphitheater. \n\nThere are several people walking along the street, some carrying handbags. A few benches can be seen along the sidewalk, providing a place for pedestrians to rest. The atmosphere is lively, with people going about their daily activities. | 4 | test_images/1.png |
minigpt-4 | how many people are shown in the image? | I'm sorry, but I cannot see the image you provided. Can you please provide more information or a different image? | NA | test_images/1.png |
llava-13b | how many people are shown in the image? | There are two people shown in the image. | 4 | test_images/2.png |
fuyu-8b | how many people are shown in the image? | There are two people shown in the image. | 4 | test_images/2.png |
minigpt-4 | how many people are shown in the image? | There are two minions in the image. | 5 | test_images/2.png |
llava-13b | is there anything unusual in the image? | Yes, there is an unusual element in the image. The man is holding a gold trophy, which is not uncommon. However, he is also holding a microphone, which is an unusual combination. Typically, a person holding a microphone would be a speaker or an announcer, while a trophy is usually associated with an award or achievement. The combination of the microphone and the trophy suggests that the man might be a presenter or an announcer who is also receiving an award or acknowledging someone else's achievement. | 4 | test_images/0.png |
fuyu-8b | is there anything unusual in the image? | Yes, there is something unusual in the image, as there is a golden ball or trophy being held by a man wearing a suit. This is unusual because typically, sports balls or trophies are associated with competitions or achievements, rather than being worn by a person in a suit. The man holding the golden ball or trophy is wearing a suit, which adds to the peculiarity of the scene. | 3 | test_images/0.png |
minigpt-4 | is there anything unusual in the image? | The image shows a man in a black suit and tie holding a golden trophy. The trophy has a design on it that looks like a soccer ball. The man is smiling and appears to be happy. | 3 | test_images/0.png |
llava-13b | is there anything unusual in the image? | Yes, there is an unusual aspect in the image. The Colosseum, a famous ancient Roman amphitheater, is lit up with colored lights, which is not a common sight. This illumination adds a unique and vibrant touch to the otherwise historical and majestic structure. | 4 | test_images/1.png |
fuyu-8b | is there anything unusual in the image? | Yes, the image features an unusual scene, as multiple people are walking around the colossal structure, which is an old Roman amphitheater. The amphitheater is illuminated with colorful lights, giving it a vibrant and lively atmosphere. | 3 | test_images/1.png |
minigpt-4 | is there anything unusual in the image? | The image shows the Colosseum, a large amphitheater in Rome, Italy. It is a well known landmark and tourist attraction. | 2 | test_images/1.png |
llava-13b | is there anything unusual in the image? | Yes, there is something unusual in the image. The two cartoon minions are standing next to each other, but one of them has a tooth missing. This is an unusual detail, as it is not common for animated characters to have imperfections like missing teeth. The missing tooth adds a unique and interesting aspect to the image, making it stand out from typical animated scenes. | 3 | test_images/2.png |
fuyu-8b | is there anything unusual in the image? | Yes, there is an unusual aspect of the image, as there are two minions dressed in overalls, wearing goggles, and standing next to each other. This unusual combination is not typical, as minions are typically associated with their popular animation and movie franchises. The minions' overalls, goggles, and overalls-wearing, combined with the goggles they are wearing, adds to the peculiarity of the scene. | 2 | test_images/2.png |
minigpt-4 | is there anything unusual in the image? | The image appears to be a cartoon character with overalls and a yellow shirt. The character is smiling and has a blue hat on its head. There is nothing unusual in the image. | 5 | test_images/2.png |
对评估的多模态模型初步发现的总结¶
首先,本文的目的是展示如何利用 Replicate 来为图像推理任务提供不同的多模态LLM服务。对于这样的比较存在一些限制:
- 我们比较并评估了 LLaVa-13B、Fuyu-8B 和 MiniGPT-4 在一些简单和有限的任务/提示中的表现。
- 请注意,在示例中,不同模型的超参数是相同的。超参数调整的能力对于质量MM LLMs模型可能具有重要意义。
- 人类评估可能存在一些偏见/主观性/噪音。
一些初步发现:
MiniGPT-4
有时可以给出更准确的答案,比如“图像中有两个小喽啰。”,而不是来自LlaVa
或Fuyu-8B
的“图像中显示了两个人。”。另一个例子是,MiniGPT-4
直接回答问题“图像中是什么”时给出“斗兽场”作为意大利斗兽场的答案。MiniGPT-4
未能对两个提示给出结果。它回答“抱歉,我无法看到您提供的图像。” 但它可以回答同一图像的其他问题。不确定这是 Replicate 推理的问题还是 MiniGPT-4 模型本身的问题。Fuyu-8B
和LlaVa-13B
通常会对问题给出更长的冗长答案,以支持更多的上下文。Llava-13B
和Fuyu-8B
有时会对问题给出稍微更高的“幻觉”,特别是对于问题“图像中是否有什么异常之处?”
复制流完成,异步完成,异步流完成模式¶
初始化 Fuyu-8B 模型¶
In [ ]:
Copied!
multi_modal_llm = ReplicateMultiModal(
model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
max_new_tokens=100,
temperature=0.1,
num_input_files=1,
top_p=0.9,
num_beams=1,
repetition_penalty=1,
)
multi_modal_llm = ReplicateMultiModal(
model=REPLICATE_MULTI_MODAL_LLM_MODELS["fuyu-8b"],
max_new_tokens=100,
temperature=0.1,
num_input_files=1,
top_p=0.9,
num_beams=1,
repetition_penalty=1,
)
使用异步流完成¶
In [ ]:
Copied!
resp = await multi_modal_llm.astream_complete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.astream_complete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
In [ ]:
Copied!
async for delta in resp:
print(delta.delta, end="")
async for delta in resp:
print(delta.delta, end="")
The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award. In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.
使用异步完成¶
In [ ]:
Copied!
resp = await multi_modal_llm.acomplete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
resp = await multi_modal_llm.acomplete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
In [ ]:
Copied!
print(resp)
print(resp)
The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award. In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.
使用流完成¶
In [ ]:
Copied!
resp = multi_modal_llm.stream_complete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
resp = multi_modal_llm.stream_complete(
prompt="tell me about this image",
image_documents=[image_documents[0]],
)
In [ ]:
Copied!
for delta in resp:
print(delta.delta, end="")
for delta in resp:
print(delta.delta, end="")
The image features a man wearing a suit and tie, standing in front of a stage with a backdrop. He is holding a golden ball trophy, possibly an award, in his hands. The man appears to be posing for a photo, possibly celebrating his achievement or receiving an award. In the background, there are multiple people visible, possibly attending or participating in the event. The backdrop appears to be a large screen, possibly displaying information about the event or ceremony.