评估多模态RAG¶

在这个笔记本指南中，我们将演示如何评估多模态RAG系统。与仅文本情况类似，我们将分别考虑检索器和生成器的评估。正如我们在有关评估多模态RAG的博客中所暗示的，我们的方法涉及应用适应于评估检索器和生成器（用于仅文本情况）的通常技术的改编版本。这些改编版本是llama-index库的一部分（即evaluation模块），本笔记将指导您如何将它们应用于您的评估用例。

注意：这里进行的用例及其评估纯粹是为了说明，仅意在演示如何将我们的评估工具应用于特定需求。在这里进行的结果或分析绝不意味着严谨，尽管我们相信我们的工具可以帮助您为您的应用程序应用更高标准的关注水平。

In [ ]:

Copied!

%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate
%pip install llama-index-llms-openai
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-multi-modal-llms-replicate

In [ ]:

Copied!





# %pip 安装 llama_index ftfy regex tqdm -q
# %pip 安装 git+https://github.com/openai/CLIP.git -q
# %pip 安装 torch torchvision -q
# %pip 安装 matplotlib scikit-image -q
# %pip 安装 -U qdrant_client -q
# %pip 安装 llama_index ftfy regex tqdm -q
# %pip 安装 git+https://github.com/openai/CLIP.git -q
# %pip 安装 torch torchvision -q
# %pip 安装 matplotlib scikit-image -q
# %pip 安装 -U qdrant_client -q

In [ ]:

Copied!

from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd
from PIL import Image
import matplotlib.pyplot as plt
import pandas as pd

使用案例：手语拼写¶

在本演示中，我们将使用一个特定的使用案例，即使用图像和文本描述来表达美国手语（ASL）的字母表。

查询¶

在这个演示中，我们将只使用一种查询形式。（这并不是一个真正代表性的用例，但我们的主要重点是演示如何使用llama-index评估工具进行评估。）

In [ ]:

Copied!

QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?."

数据集¶

图像

这些图像来自Kaggle的ASL-Alphabet数据集。需要注意的是，这些图像经过修改，只是简单地在手势图像上包含了与之相关的字母标签。我们使用这些修改后的图像作为用户查询的上下文，并且它们可以从我们的谷歌驱动器中下载（请参见下面的单元，您可以取消注释以直接从此笔记本下载数据集）。

文本上下文

对于文本上下文，我们使用了从https://www.deafblind.com/asl.html获取的每个手势的描述。我们已经方便地将这些描述存储在名为`asl_text_descriptions.json`的`json`文件中，该文件包含在我们的谷歌驱动器的zip下载文件中。

In [ ]:

Copied!





#######################################################################
## 本笔记指南对gpt-4v进行了多次调用，该调用受到严格的速率限制。为了方便起见，您应该下载数据文件，以避免进行此类调用，同时仍然可以按照笔记本的步骤进行操作。解压缩zip文件并将其存储在与此笔记本相同目录中的asl_data文件夹中。 ##
#######################################################################

download_notebook_data = False
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q
#######################################################################
## 本笔记指南对gpt-4v进行了多次调用，该调用受到严格的速率限制。为了方便起见，您应该下载数据文件，以避免进行此类调用，同时仍然可以按照笔记本的步骤进行操作。解压缩zip文件并将其存储在与此笔记本相同目录中的asl_data文件夹中。 ##
#######################################################################

download_notebook_data = False
if download_notebook_data:
    !wget "https://www.dropbox.com/scl/fo/tpesl5m8ye21fqza6wq6j/h?rlkey=zknd9pf91w30m23ebfxiva9xn&dl=1" -O asl_data.zip -q

首先，让我们将上下文图像加载到 ImageDocument 中，将文本加载到 Documents 中。

In [ ]:

Copied!





import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# 上下文图片
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# 上下文文本
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "用ASL手语表达{letter}：{desc}。"
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]
import json
from llama_index.core.multi_modal_llms.generic_utils import load_image_urls
from llama_index.core import SimpleDirectoryReader, Document

# 上下文图片
image_path = "./asl_data/images"
image_documents = SimpleDirectoryReader(image_path).load_data()

# 上下文文本
with open("asl_data/asl_text_descriptions.json") as json_file:
    asl_text_descriptions = json.load(json_file)
text_format_str = "用ASL手语表达{letter}：{desc}。"
text_documents = [
    Document(text=text_format_str.format(letter=k, desc=v))
    for k, v in asl_text_descriptions.items()
]

有了我们手头的文档，我们就可以创建我们的MultiModalVectorStoreIndex。为此，我们将把我们的Documents解析为节点，然后简单地将这些节点传递给MultiModalVectorStoreIndex构造函数。

In [ ]:

Copied!





from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

node_parser = SentenceSplitter.from_defaults()
image_nodes = node_parser.get_nodes_from_documents(image_documents)
text_nodes = node_parser.get_nodes_from_documents(text_documents)

asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)

另一个需要考虑的RAG系统（用于GPT-4V图像描述的检索）¶

在之前的MultiModalVectorStoreIndex中，图像的默认嵌入模型是OpenAI的CLIP。为了与另一个RAG系统进行比较（这经常是进行RAG评估的原因），我们将建立另一个RAG系统，该系统使用与默认系统不同的图像嵌入。

具体来说，我们将提示GPT-4V撰写每个图像的文本描述，然后对这些描述应用通常的文本嵌入，并将这些嵌入关联到图像。换句话说，这些文本描述的嵌入将最终用于该RAG系统执行检索。

In [ ]:

Copied!

#######################################################################
## 如果您希望使用先前生成的包含在.zip下载文件中的gpt-4v文本描述，则将load_previously_generated_text_descriptions设置为True ##
#######################################################################

load_previously_generated_text_descriptions = True
#######################################################################
## 如果您希望使用先前生成的包含在.zip下载文件中的gpt-4v文本描述，则将load_previously_generated_text_descriptions设置为True ##
#######################################################################

load_previously_generated_text_descriptions = True

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import tqdm

if not load_previously_generated_text_descriptions:
    # 定义我们的lmm
    openai_mm_llm = OpenAIMultiModal(
        model="gpt-4-vision-preview", max_new_tokens=300
    )

    # 创建一个新的副本，因为我们想要将文本存储在其属性中
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # 获取文本描述并保存到文本属性中
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="将图像描述为替代文本",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # 保存，以便不必再次产生昂贵的gpt-4v调用
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # 加载先前保存的图像描述和文档
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# 解析为节点
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.schema import ImageDocument
import tqdm

if not load_previously_generated_text_descriptions:
    # 定义我们的lmm
    openai_mm_llm = OpenAIMultiModal(
        model="gpt-4-vision-preview", max_new_tokens=300
    )

    # 创建一个新的副本，因为我们想要将文本存储在其属性中
    image_with_text_documents = SimpleDirectoryReader(image_path).load_data()

    # 获取文本描述并保存到文本属性中
    for img_doc in tqdm.tqdm(image_with_text_documents):
        response = openai_mm_llm.complete(
            prompt="将图像描述为替代文本",
            image_documents=[img_doc],
        )
        img_doc.text = response.text

    # 保存，以便不必再次产生昂贵的gpt-4v调用
    desc_jsonl = [
        json.loads(img_doc.to_json()) for img_doc in image_with_text_documents
    ]
    with open("image_descriptions.json", "w") as f:
        json.dump(desc_jsonl, f)
else:
    # 加载先前保存的图像描述和文档
    with open("asl_data/image_descriptions.json") as f:
        image_descriptions = json.load(f)

    image_with_text_documents = [
        ImageDocument.from_dict(el) for el in image_descriptions
    ]

# 解析为节点
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

一个敏锐的读者会注意到，我们将文本描述存储在ImageDocument的text字段中。与之前一样，要创建一个MultiModalVectorStoreIndex，我们需要将ImageDocuments解析为ImageNodes，然后将这些节点传递给构造函数。

需要注意的是，当用于构建MultiModalVectorStoreIndex的ImageNodes具有填充的text字段时，我们可以选择使用这些文本来构建用于检索的嵌入。为此，我们只需将类属性is_image_to_text指定为True。

In [ ]:

Copied!





image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)
image_with_text_nodes = node_parser.get_nodes_from_documents(
    image_with_text_documents
)

asl_text_desc_index = MultiModalVectorStoreIndex(
    nodes=image_with_text_nodes + text_nodes, is_image_to_text=True
)

构建我们的多模态RAG系统¶

与仅文本情况类似，我们需要“连接”一个生成器到我们的索引（可以用作检索器），最终组装我们的RAG系统。然而，在多模态情况下，我们的生成器是多模态LLM（也经常简称为大型多模态模型或LMM）。在这个笔记本中，为了更多地比较不同的RAG系统，我们将使用GPT-4V以及LLaVA。我们可以通过调用我们的索引的as_query_engine方法来“连接”一个生成器，并获得一个可查询的RAG接口。

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.core import PromptTemplate

# 定义我们的问答提示模板
qa_tmpl_str = (
    "提供了美国手语手势的图像。\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "如果提供的图像无法帮助回答问题\n"
    "则回答无法回答该问题。否则，\n"
    "仅使用提供的上下文，而不是先前的知识，\n"
    "提供问题的答案。"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# 定义我们的llmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview",
    max_new_tokens=300,
)

llava_mm_llm = ReplicateMultiModal(
    model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    max_new_tokens=300,
)

# 定义我们的RAG查询引擎
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_clip_llava": asl_index.as_query_engine(
        multi_modal_llm=llava_mm_llm,
        text_qa_template=qa_tmpl,
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_llava": asl_text_desc_index.as_query_engine(
        multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl
    ),
}

# llava目前仅支持每次调用1张图像
rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1
rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.replicate import ReplicateMultiModal
from llama_index.core import PromptTemplate

# 定义我们的问答提示模板
qa_tmpl_str = (
    "提供了美国手语手势的图像。\n"
    "---------------------\n"
    "{context_str}\n"
    "---------------------\n"
    "如果提供的图像无法帮助回答问题\n"
    "则回答无法回答该问题。否则，\n"
    "仅使用提供的上下文，而不是先前的知识，\n"
    "提供问题的答案。"
    "Query: {query_str}\n"
    "Answer: "
)
qa_tmpl = PromptTemplate(qa_tmpl_str)

# 定义我们的llmms
openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview",
    max_new_tokens=300,
)

llava_mm_llm = ReplicateMultiModal(
    model="yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    max_new_tokens=300,
)

# 定义我们的RAG查询引擎
rag_engines = {
    "mm_clip_gpt4v": asl_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_clip_llava": asl_index.as_query_engine(
        multi_modal_llm=llava_mm_llm,
        text_qa_template=qa_tmpl,
    ),
    "mm_text_desc_gpt4v": asl_text_desc_index.as_query_engine(
        multi_modal_llm=openai_mm_llm, text_qa_template=qa_tmpl
    ),
    "mm_text_desc_llava": asl_text_desc_index.as_query_engine(
        multi_modal_llm=llava_mm_llm, text_qa_template=qa_tmpl
    ),
}

# llava目前仅支持每次调用1张图像
rag_engines["mm_clip_llava"].retriever.image_similarity_top_k = 1
rag_engines["mm_text_desc_llava"].retriever.image_similarity_top_k = 1

测试我们的多模态RAG¶

让我们来测试一下这些系统。为了漂亮地显示响应，我们将使用笔记本实用函数 display_query_and_multimodal_response。

In [ ]:

Copied!

letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)
letter = "R"
query = QUERY_STR_TEMPLATE.format(symbol=letter)
response = rag_engines["mm_text_desc_gpt4v"].query(query)

In [ ]:

Copied!

from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)
from llama_index.core.response.notebook_utils import (
    display_query_and_multimodal_response,
)

display_query_and_multimodal_response(query, response)

Query: How can I sign a R?.
=======
Retrieved Images:

No description has been provided for this image

=======
Response: To sign the letter "R" in American Sign Language (ASL), you would follow the instructions provided: the ring and little finger should be folded against the palm and held down by your thumb, while the index and middle finger are straight and crossed with the index finger in front to form the letter "R."
=======

检索器评估¶

在笔记本的这一部分，我们将对我们的检索器进行评估。回想一下，我们基本上有两个多模态检索器：一个使用默认的CLIP图像嵌入；另一个使用关联的gpt-4v文本描述的嵌入。在进行性能的定量分析之前，我们将为text_desc_retriever创建一个可视化，显示所有用户查询请求中每个ASL手语字母的top-1检索结果（如果需要，可以简单地切换到clip_retriever）。

注意：由于我们不会将检索到的文档发送到LLaVA，我们可以将image_simiarity_top_k设置为大于1的值。当我们进行生成评估时，对于那些使用LLaVA的RAG引擎，我们将不得不再次使用上面定义的rag_engine，该参数设置为1。

In [ ]:

Copied!





# 用作检索器
clip_retriever = asl_index.as_retriever(image_similarity_top_k=2)

# 用作检索器
text_desc_retriever = asl_text_desc_index.as_retriever(
    image_similarity_top_k=2
)
# 用作检索器
clip_retriever = asl_index.as_retriever(image_similarity_top_k=2)

# 用作检索器
text_desc_retriever = asl_text_desc_index.as_retriever(
    image_similarity_top_k=2
)

可视化¶

In [ ]:

Copied!





from llama_index.core.schema import TextNode, ImageNode

f, axarr = plt.subplots(3, 9)
f.set_figheight(6)
f.set_figwidth(15)
ix = 0
for jx, letter in enumerate(asl_text_descriptions.keys()):
    retrieval_results = text_desc_retriever.retrieve(
        QUERY_STR_TEMPLATE.format(symbol=letter)
    )
    image_node = None
    text_node = None
    for r in retrieval_results:
        if isinstance(r.node, TextNode):
            text_node = r
        if isinstance(r.node, ImageNode):
            image_node = r
            break

    img_path = image_node.node.image_path
    image = Image.open(img_path).convert("RGB")
    axarr[int(jx / 9), jx % 9].imshow(image)
    axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}")

plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200])
f.tight_layout()
plt.show()
from llama_index.core.schema import TextNode, ImageNode

f, axarr = plt.subplots(3, 9)
f.set_figheight(6)
f.set_figwidth(15)
ix = 0
for jx, letter in enumerate(asl_text_descriptions.keys()):
    retrieval_results = text_desc_retriever.retrieve(
        QUERY_STR_TEMPLATE.format(symbol=letter)
    )
    image_node = None
    text_node = None
    for r in retrieval_results:
        if isinstance(r.node, TextNode):
            text_node = r
        if isinstance(r.node, ImageNode):
            image_node = r
            break

    img_path = image_node.node.image_path
    image = Image.open(img_path).convert("RGB")
    axarr[int(jx / 9), jx % 9].imshow(image)
    axarr[int(jx / 9), jx % 9].set_title(f"Query: {letter}")

plt.setp(axarr, xticks=[0, 100, 200], yticks=[0, 100, 200])
f.tight_layout()
plt.show()

正如您所看到的，检索器在 top-1 检索方面表现相当不错。现在，我们将进行检索器性能的定量分析。

定量分析：命中率和MRR¶

在我们的博客中（在本笔记本的开头处有链接），我们提到评估多模态检索器的一个明智方法是分别计算图像和文本检索的常规检索评估指标。这样做会使您得到的评估指标数量翻倍，但这样做可以让您以更精细的方式调试您的 RAG/检索器，这是非常重要的。如果您需要一个单一的指标，那么应用加权平均值，并根据您的需求进行加权，似乎是一个合理的选择。

为了完成所有这些工作，我们使用 MultiModalRetrieverEvaluator，它类似于其单模态对应物，不同之处在于它可以分别处理图像和文本检索评估，而这正是我们想要做的。

In [ ]:

Copied!





from llama_index.core.evaluation import MultiModalRetrieverEvaluator

clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=clip_retriever
)

text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=text_desc_retriever
)
from llama_index.core.evaluation import MultiModalRetrieverEvaluator

clip_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=clip_retriever
)

text_desc_retriever_evaluator = MultiModalRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=text_desc_retriever
)

在计算评估时需要注意的一点是，很多时候需要使用基准数据（有时也称为标记数据）。对于检索任务，这些标记数据采用“查询”、“预期ID”对的形式，其中前者是用户查询，后者表示应该被检索的节点（通过它们的ID表示）。

在本指南中，我们将编写一个特定的辅助函数来构建LabelledQADataset对象，这正是我们需要的。

In [ ]:

Copied!





import uuid
import re
from llama_index.core.evaluation import LabelledQADataset


def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """返回一个QALabelledDataset，为每个查询提供期望的节点ID。

    注意：这是特定于ASL用例的。
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # 找到与图像/文本节点关联的字母
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "不支持的模式。请输入'image'或'text'。"
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # 构建查询
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # 存储查询和预期ID对
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )
import uuid
import re
from llama_index.core.evaluation import LabelledQADataset


def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """返回一个QALabelledDataset，为每个查询提供期望的节点ID。

    注意：这是特定于ASL用例的。
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # 找到与图像/文本节点关联的字母
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "不支持的模式。请输入'image'或'text'。"
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # 构建查询
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # 存储查询和预期ID对
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )

In [ ]:

Copied!





# 用 asl_index.as_retriever() 创建图像检索的标记数据集
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# 用 asl_index.as_retriever() 创建文本检索的标记数据集
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# 用 asl_text_desc_index.as_retriever() 创建文本描述检索的标记数据集
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image"
)
# 用 asl_index.as_retriever() 创建图像检索的标记数据集
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# 用 asl_index.as_retriever() 创建文本检索的标记数据集
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# 用 asl_text_desc_index.as_retriever() 创建文本描述检索的标记数据集
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_with_text_nodes, "image"
)

现在我们手头有了地面真实数据，我们可以调用MultiModalRetrieverEvaluator的evaluate_dataset（或其async版本）方法。

In [ ]:

Copied!





eval_results_image = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_image
)
eval_results_text = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text
)
eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text_desc
)
eval_results_image = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_image
)
eval_results_text = await clip_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text
)
eval_results_text_desc = await text_desc_retriever_evaluator.aevaluate_dataset(
    qa_dataset_text_desc
)

而且，我们将利用另一个笔记本实用程序函数 get_retrieval_results_df，它将把我们的评估结果漂亮地呈现成一个 pandas DataFrame。

In [ ]:

Copied!





from llama_index.core.evaluation import get_retrieval_results_df

get_retrieval_results_df(
    names=["asl_index-image", "asl_index-text", "asl_text_desc_index"],
    results_arr=[
        eval_results_image,
        eval_results_text,
        eval_results_text_desc,
    ],
)
from llama_index.core.evaluation import get_retrieval_results_df

get_retrieval_results_df(
    names=["asl_index-image", "asl_index-text", "asl_text_desc_index"],
    results_arr=[
        eval_results_image,
        eval_results_text,
        eval_results_text_desc,
    ],
)

Out[ ]:

	retrievers	hit_rate	mrr
0	asl_index-image	0.814815	0.814815
1	asl_index-text	1.000000	1.000000
2	asl_text_desc_index	0.925926	0.925926

观察结果¶

正如我们所看到的，asl_index检索器的文本检索效果非常好。这应该是因为创建存储在text_nodes中的文本时使用了非常相似的QUERY_STR_TEMPLATE和text_format_str。
图像的CLIP嵌入效果相当不错，尽管在这种情况下，从GPT-4V文本描述中产生的嵌入表示似乎导致更好的检索性能。
有趣的是，当两个检索器检索到正确的图像时，它们都会将其放在初始位置，这就是为什么hit_rate和mrr对于两者都是等效的原因。

生成结果评估¶

现在让我们开始评估生成的回复。为此，我们考虑我们之前构建的4个多模态RAG系统：

mm_clip_gpt4v = 带有CLIP图像编码器的多模态RAG，lmm = GPT-4V，同时使用image_nodes和text_nodes
mm_clip_llava = 带有CLIP图像编码器的多模态RAG，lmm = LLaVA，同时使用image_nodes和text_nodes
mm_text_desc_gpt4v = 带有文本描述和ada图像编码器的多模态RAG，lmm = GPT-4V，同时使用image_with_text_nodes和text_nodes
mm_text_desc_llava = 带有文本描述和ada图像编码器的多模态RAG，lmm = LLaVA，同时使用image_with_text_nodes和text_nodes

至于检索器评估，我们现在也需要一个用于评估生成回复的地面真实数据。（请注意，并非所有评估方法都需要地面真实数据，但我们将使用“正确性”，这需要一个参考答案来与生成的答案进行比较。

参考（地面真实）数据¶

为此，我们收集了一组手语的文本描述。我们发现这些描述更加详细，并且认为它们很好地代表了我们手语查询的参考答案。来源：https://www.signingtime.com/dictionary/category/letters/，这些数据已经被提取并存储在`human_responses.json`中，该文件也包含在本笔记本开头链接的数据压缩包中。

In [ ]:

Copied!

# 用于我们答案的参考（地面真相）
with open("asl_data/human_responses.json") as json_file:
    human_answers = json.load(json_file)
# 用于我们答案的参考（地面真相）
with open("asl_data/human_responses.json") as json_file:
    human_answers = json.load(json_file)

为每个系统生成对所有查询的响应¶

现在我们将循环遍历所有查询，并将其传递给所有4个RAGs（即QueryEngine.query()接口）。

In [ ]:

Copied!

#######################################################################
## 如果您希望对所有的反馈都使用先前生成的响应，则将load_previous_responses设置为True。json是.zip下载的一部分 ##
#######################################################################

load_previous_responses = True
#######################################################################
## 如果您希望对所有的反馈都使用先前生成的响应，则将load_previous_responses设置为True。json是.zip下载的一部分 ##
#######################################################################

load_previous_responses = True

In [ ]:

Copied!





import time
import tqdm

if not load_previous_responses:
    response_data = []
    for letter in tqdm.tqdm(asl_text_descriptions.keys()):
        data_entry = {}
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        data_entry["query"] = query

        responses = {}
        for name, engine in rag_engines.items():
            this_response = {}
            result = engine.query(query)
            this_response["response"] = result.response

            sources = {}
            source_image_nodes = []
            source_text_nodes = []

            # 图像来源
            source_image_nodes = [
                score_img_node.node.metadata["file_path"]
                for score_img_node in result.metadata["image_nodes"]
            ]

            # 文本来源
            source_text_nodes = [
                score_text_node.node.text
                for score_text_node in result.metadata["text_nodes"]
            ]

            sources["images"] = source_image_nodes
            sources["texts"] = source_text_nodes
            this_response["sources"] = sources

            responses[name] = this_response
        data_entry["responses"] = responses
        response_data.append(data_entry)

    # 保存昂贵的gpt-4v响应
    with open("expensive_response_data.json", "w") as json_file:
        json.dump(response_data, json_file)
else:
    # 加载先前保存的图像描述
    with open("asl_data/expensive_response_data.json") as json_file:
        response_data = json.load(json_file)
import time
import tqdm

if not load_previous_responses:
    response_data = []
    for letter in tqdm.tqdm(asl_text_descriptions.keys()):
        data_entry = {}
        query = QUERY_STR_TEMPLATE.format(symbol=letter)
        data_entry["query"] = query

        responses = {}
        for name, engine in rag_engines.items():
            this_response = {}
            result = engine.query(query)
            this_response["response"] = result.response

            sources = {}
            source_image_nodes = []
            source_text_nodes = []

            # 图像来源
            source_image_nodes = [
                score_img_node.node.metadata["file_path"]
                for score_img_node in result.metadata["image_nodes"]
            ]

            # 文本来源
            source_text_nodes = [
                score_text_node.node.text
                for score_text_node in result.metadata["text_nodes"]
            ]

            sources["images"] = source_image_nodes
            sources["texts"] = source_text_nodes
            this_response["sources"] = sources

            responses[name] = this_response
        data_entry["responses"] = responses
        response_data.append(data_entry)

    # 保存昂贵的gpt-4v响应
    with open("expensive_response_data.json", "w") as json_file:
        json.dump(response_data, json_file)
else:
    # 加载先前保存的图像描述
    with open("asl_data/expensive_response_data.json") as json_file:
        response_data = json.load(json_file)

正确性，忠实度，相关性¶

有了生成的响应结果（存储在专门为这个ASL用例量身定制的自定义数据对象中，即 response_data），我们现在可以计算它们的评估指标：

正确性（LLM作为评判者）：
忠实度（LLM作为评判者）：
相关性（LLM作为评判者）：

为了计算这三个指标，我们需要提示另一个生成模型提供评估每个标准的分数。对于正确性，由于我们不考虑上下文，所以评判者是LLM。相反，为了计算忠实度和相关性，我们需要传入上下文，这意味着需要传入用于生成响应的图像和文本。由于这个要求需要同时传入图像和文本，所以忠实度和相关性的评判者必须是LLM（或多模态LLM）。

我们的 evaluation 模块中有这些抽象，并将演示它们在循环遍历所有生成的响应时的使用方式。

In [ ]:

Copied!





from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

import os

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)
from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

import os

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

In [ ]:

Copied!





#######################################################################
## 本部分的笔记本可以生成总共约200 GPT-4V，这是受到严格限制的（每天100个）。要跟进之前生成的评估结果，请将load_previous_evaluations设置为True。要测试评估执行，请将number_evals设置为1到27之间的任意数字。json是.zip下载的一部分
#######################################################################

load_previous_evaluations = True
number_evals = 27
#######################################################################
## 本部分的笔记本可以生成总共约200 GPT-4V，这是受到严格限制的（每天100个）。要跟进之前生成的评估结果，请将load_previous_evaluations设置为True。要测试评估执行，请将number_evals设置为1到27之间的任意数字。json是.zip下载的一部分
#######################################################################

load_previous_evaluations = True
number_evals = 27

In [ ]:

Copied!





if not load_previous_evaluations:
    evals = {
        "names": [],
        "correctness": [],
        "relevancy": [],
        "faithfulness": [],
    }

    # 遍历所有响应并对其进行评估
    for data_entry in tqdm.tqdm(response_data[:number_evals]):
        reg_ex = r"(?:How can I sign a ([A-Z]+)?)"
        match = re.search(reg_ex, data_entry["query"])

        batch_names = []
        batch_correctness = []
        batch_relevancy = []
        batch_faithfulness = []
        if match:
            letter = match.group(1)
            reference_answer = human_answers[letter]
            for rag_name, rag_response_data in data_entry["responses"].items():
                correctness_result = await judges["correctness"].aevaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    reference=reference_answer,
                )

                relevancy_result = judges["relevancy"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                faithfulness_result = judges["faithfulness"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                batch_names.append(rag_name)
                batch_correctness.append(correctness_result)
                batch_relevancy.append(relevancy_result)
                batch_faithfulness.append(faithfulness_result)

            evals["names"] += batch_names
            evals["correctness"] += batch_correctness
            evals["relevancy"] += batch_relevancy
            evals["faithfulness"] += batch_faithfulness

    # 保存评估结果
    evaluations_objects = {
        "names": evals["names"],
        "correctness": [e.dict() for e in evals["correctness"]],
        "faithfulness": [e.dict() for e in evals["faithfulness"]],
        "relevancy": [e.dict() for e in evals["relevancy"]],
    }
    with open("asl_data/evaluations.json", "w") as json_file:
        json.dump(evaluations_objects, json_file)
else:
    from llama_index.core.evaluation import EvaluationResult

    # 加载先前保存的图像描述
    with open("asl_data/evaluations.json") as json_file:
        evaluations_objects = json.load(json_file)

    evals = {}
    evals["names"] = evaluations_objects["names"]
    evals["correctness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["correctness"]
    ]
    evals["faithfulness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["faithfulness"]
    ]
    evals["relevancy"] = [
        EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"]
    ]
if not load_previous_evaluations:
    evals = {
        "names": [],
        "correctness": [],
        "relevancy": [],
        "faithfulness": [],
    }

    # 遍历所有响应并对其进行评估
    for data_entry in tqdm.tqdm(response_data[:number_evals]):
        reg_ex = r"(?:How can I sign a ([A-Z]+)?)"
        match = re.search(reg_ex, data_entry["query"])

        batch_names = []
        batch_correctness = []
        batch_relevancy = []
        batch_faithfulness = []
        if match:
            letter = match.group(1)
            reference_answer = human_answers[letter]
            for rag_name, rag_response_data in data_entry["responses"].items():
                correctness_result = await judges["correctness"].aevaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    reference=reference_answer,
                )

                relevancy_result = judges["relevancy"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                faithfulness_result = judges["faithfulness"].evaluate(
                    query=data_entry["query"],
                    response=rag_response_data["response"],
                    contexts=rag_response_data["sources"]["texts"],
                    image_paths=rag_response_data["sources"]["images"],
                )

                batch_names.append(rag_name)
                batch_correctness.append(correctness_result)
                batch_relevancy.append(relevancy_result)
                batch_faithfulness.append(faithfulness_result)

            evals["names"] += batch_names
            evals["correctness"] += batch_correctness
            evals["relevancy"] += batch_relevancy
            evals["faithfulness"] += batch_faithfulness

    # 保存评估结果
    evaluations_objects = {
        "names": evals["names"],
        "correctness": [e.dict() for e in evals["correctness"]],
        "faithfulness": [e.dict() for e in evals["faithfulness"]],
        "relevancy": [e.dict() for e in evals["relevancy"]],
    }
    with open("asl_data/evaluations.json", "w") as json_file:
        json.dump(evaluations_objects, json_file)
else:
    from llama_index.core.evaluation import EvaluationResult

    # 加载先前保存的图像描述
    with open("asl_data/evaluations.json") as json_file:
        evaluations_objects = json.load(json_file)

    evals = {}
    evals["names"] = evaluations_objects["names"]
    evals["correctness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["correctness"]
    ]
    evals["faithfulness"] = [
        EvaluationResult.parse_obj(e)
        for e in evaluations_objects["faithfulness"]
    ]
    evals["relevancy"] = [
        EvaluationResult.parse_obj(e) for e in evaluations_objects["relevancy"]
    ]

要查看这些结果，我们再次使用笔记本实用函数 get_eval_results_df。

In [ ]:

Copied!





from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    evals["names"], evals["correctness"], metric="correctness"
)
_, mean_relevancy_df = get_eval_results_df(
    evals["names"], evals["relevancy"], metric="relevancy"
)
_, mean_faithfulness_df = get_eval_results_df(
    evals["names"], evals["faithfulness"], metric="faithfulness"
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])
from llama_index.core.evaluation.notebook_utils import get_eval_results_df

deep_eval_df, mean_correctness_df = get_eval_results_df(
    evals["names"], evals["correctness"], metric="correctness"
)
_, mean_relevancy_df = get_eval_results_df(
    evals["names"], evals["relevancy"], metric="relevancy"
)
_, mean_faithfulness_df = get_eval_results_df(
    evals["names"], evals["faithfulness"], metric="faithfulness"
)

mean_scores_df = pd.concat(
    [
        mean_correctness_df.reset_index(),
        mean_relevancy_df.reset_index(),
        mean_faithfulness_df.reset_index(),
    ],
    axis=0,
    ignore_index=True,
)
mean_scores_df = mean_scores_df.set_index("index")
mean_scores_df.index = mean_scores_df.index.set_names(["metrics"])

In [ ]:

Copied!

print(deep_eval_df[:4])
print(deep_eval_df[:4])

Out[ ]:

	rag	query	scores	feedbacks
0	mm_clip_gpt4v	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It accurately describes how to sign the letter 'A' in ASL, which matches the user query. However, it includes unnecessary information about images that were not mentioned in the user query, which slightly detracts from its overall correctness.
1	mm_clip_llava	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It provides the necessary steps to sign the letter 'A' in ASL, but it lacks the additional information about the hand position and the difference between 'A' and 'S' that the reference answer provides.
2	mm_text_desc_gpt4v	How can I sign a A?.	4.500000	The generated answer is relevant and mostly correct. It provides a clear description of how to sign the letter 'A' in American Sign Language, which matches the reference answer. However, it starts with an unnecessary statement about the lack of images, which is not relevant to the user's query.
3	mm_text_desc_llava	How can I sign a A?.	4.500000	The generated answer is relevant and almost fully correct. It accurately describes how to sign the letter 'A' in American Sign Language. However, it lacks the detail about the position of the hand (at shoulder height with palm facing out) that is present in the reference answer.

In [ ]:

Copied!

mean_scores_df
mean_scores_df

Out[ ]:

rag	mm_clip_gpt4v	mm_clip_llava	mm_text_desc_gpt4v	mm_text_desc_llava
metrics
mean_correctness_score	3.685185	4.092593	3.722222	3.870370
mean_relevancy_score	0.777778	0.851852	0.703704	0.740741
mean_faithfulness_score	0.777778	0.888889	0.851852	0.851852

观察¶

看起来使用LLaVA的RAGs比使用GPT-4V的RAGs在正确性、相关性和忠实度得分上更好。
在检查了一些回答后，我们注意到GPT-4V对于SPACE的回答如下，即使图像已经被正确检索出来："对不起，但是我无法根据提供的图像回答查询，因为系统目前不允许我对图像进行视觉分析。然而，根据提供的上下文，要用ASL手语表达“SPACE”，你应该将手掌朝天，手指向上卷曲，拇指向上。"
这种类型的生成回答可能是评委没有给GPT-4V的生成物打分与LLaVA相同的原因。更彻底的分析可能涉及更深入地挖掘生成的回答，甚至调整生成提示和评估提示。

总结¶

在本笔记本中，我们演示了如何评估多模态RAG的Retriever和Generator。具体来说，我们应用了现有的llama-index评估工具来处理ASL用例，以展示它们如何适用于您的评估需求。请注意，多模态LLM仍应被视为测试版，如果要将其用于生产系统以评估多模态响应，则应该采用特殊的标准。