多模态RAG将额外的模态集成到传统的基于文本的RAG中，通过提供额外的上下文和为了改善理解而基于文本数据的基础，增强了LLMs的问答能力。

采用了来自服装搭配指南食谱的方法，我们直接嵌入图像进行相似性搜索，绕过了文本字幕的损失过程，以提高检索准确性。

使用基于CLIP的嵌入进一步允许根据特定数据进行微调或根据未见图像进行更新。

通过使用用户提供的技术图像搜索企业知识库来展示这一技术，以提供相关信息。

安装说明

首先让我们安装相关的包。

#安装
%pip install clip
%pip install torch
%pip install pillow
%pip install faiss-cpu
%pip install numpy
%pip install git+https://github.com/openai/CLIP.git
%pip install openai

然后让我们导入所有需要的包。

# 模型导入
import faiss
import json
import torch
from openai import OpenAI
import torch.nn as nn
from torch.utils.data import DataLoader
import clip
client = OpenAI()

# 辅助导入
from tqdm import tqdm
import json
import os
import numpy as np
import pickle
from typing import List, Union, Tuple

# 可视化导入
from PIL import Image
import matplotlib.pyplot as plt
import base64

现在让我们加载CLIP模型。

#在设备上加载模型。您运行推理/训练的设备要么是CPU，要么是GPU（如果您有的话）。
device = "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

我们现在将： 1. 创建图像嵌入数据库 2. 设置一个查询到视觉模型 3. 执行语义搜索 4. 将用户查询传递给图像

创建图像嵌入数据库

接下来，我们将从一个图像目录中创建我们的图像嵌入知识库。这将是我们搜索的技术知识库，用于为用户上传的图像提供信息。

我们传入存储图像的目录（作为JPEG文件），并循环遍历每个图像以创建我们的嵌入。

我们还有一个description.json文件。其中包含我们知识库中每个图像的条目。它有两个键：‘image_path’和’description’。它将每个图像映射到一个有用的描述，以帮助回答用户的问题。

首先让我们编写一个函数来获取给定目录中的所有图像路径。然后我们将从名为’image_database’的目录中获取所有的jpeg文件。

def get_image_paths(directory: str, number: int = None) -> List[str]:
    image_paths = []
    count = 0
    for filename in os.listdir(directory):
        if filename.endswith('.jpeg'):
            image_paths.append(os.path.join(directory, filename))
            if number is not None and count == number:
                return [image_paths[-1]]
            count += 1
    return image_paths
direc = 'image_database/'
image_paths = get_image_paths(direc)

接下来，我们将编写一个函数，根据一系列路径从CLIP模型中获取图像嵌入。

我们首先使用之前得到的预处理函数对图像进行预处理。这个函数执行一些操作，以确保输入到CLIP模型的格式和维度正确，包括调整大小、归一化、颜色通道调整等。

然后，我们将这些预处理后的图像堆叠在一起，这样我们就可以一次将它们传递到模型中，而不是在循环中逐个传递。最后返回模型输出，这是一个嵌入数组。

def get_features_from_image_path(image_paths):
  images = [preprocess(Image.open(image_path).convert("RGB")) for image_path in image_paths]
  image_input = torch.tensor(np.stack(images))
  with torch.no_grad():
    image_features = model.encode_image(image_input).float()
  return image_features
image_features = get_features_from_image_path(image_paths)

我们现在可以创建我们的向量数据库。

index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

同时摄取我们的json文件，用于图像描述映射，并创建一个json列表。我们还创建一个辅助函数，用于在这个列表中搜索我们想要的图像，以便获取该图像的描述。

data = []
image_path = 'train1.jpeg'
with open('description.json', 'r') as file:
    for line in file:
        data.append(json.loads(line))
def find_entry(data, key, value):
    for entry in data:
        if entry.get(key) == value:
            return entry
    return None

让我们展示一个示例图片，这将是用户上传的图片。这是2024年CES上发布的一款科技产品。它是DELTA Pro Ultra全屋电池发电机。

im = Image.open(image_path)
plt.imshow(im)
plt.show()

查询视觉模型

现在让我们看看GPT-4 Vision（之前没有见过这项技术）会将其标记为什么。

首先，我们需要编写一个函数将我们的图像编码为base64格式，因为这是我们将传递给视觉模型的格式。然后，我们将创建一个通用的image_query函数，允许我们使用图像输入查询LLM。

def encode_image(image_path):
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return encoded_image.decode('utf-8')

def image_query(query, image_path):
    response = client.chat.completions.create(
        model='gpt-4-vision-preview',
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": query,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    # 从响应中提取相关特征
    return response.choices[0].message.content
image_query('Write a short label of what is show in this image?', image_path)

'Autonomous Delivery Robot'

正如我们所看到的，模型尽最大努力根据其训练数据进行预测，但由于在训练数据中没有看到类似的内容，因此会出现错误。这是因为这是一幅模糊的图像，使得难以推断和推测。

执行语义搜索

现在让我们执行相似性搜索，找到我们知识库中最相似的两幅图像。我们通过获取用户输入的图像路径的嵌入，检索数据库中相似图像的索引和距离来实现这一点。距离将是我们的相似性代理指标，距离越小表示相似度越高。然后，我们根据距离按降序排序。

image_search_embedding = get_features_from_image_path([image_path])
distances, indices = index.search(image_search_embedding.reshape(1, -1), 2) #2 表示要返回的最相似图像的数量。
distances = distances[0]
indices = indices[0]
indices_distances = list(zip(indices, distances))
indices_distances.sort(key=lambda x: x[1], reverse=True)

我们需要这些索引，因为我们将使用它们来搜索我们的图像目录，并选择索引位置处的图像，以供RAG视觉模型使用。

让我们看看它带回了什么（我们按相似度顺序显示这些内容）：

#显示相似图片
for idx, distance in indices_distances:
    print(idx)
    path = get_image_paths(direc, idx)[0]
    im = Image.open(path)
    plt.imshow(im)
    plt.show()

我们可以看到这里返回了两幅图片，其中包含DELTA Pro Ultra全屋电池发电机。在其中一幅图片中，还有一些可能会分散注意力的背景，但系统成功找到了正确的图片。

用户查询最相似的图像

现在对于我们最相似的图像，我们希望将其及其描述传递给gpt-v，并附带用户查询，以便他们可以查询他们可能购买的技术。这就是视觉模型的强大之处，您可以提出一般查询，即使模型没有明确接受过相关训练，它也能以高准确度进行回应。

在下面的示例中，我们将查询所讨论物品的容量。

similar_path = get_image_paths(direc, indices_distances[0][0])[0]
element = find_entry(data, 'image_path', similar_path)

user_query = 'What is the capacity of this item?'
prompt = f"""
以下是用户查询，我根据提供的描述和图像来回答该查询。

用户查询：
{user_query}

描述：
{element['description']}
"""
image_query(prompt, similar_path)

'The portable home battery DELTA Pro has a base capacity of 3.6kWh. This capacity can be expanded up to 25kWh with additional batteries. The image showcases the DELTA Pro, which has an impressive 3600W power capacity for AC output as well.'

我们看到它能够回答这个问题。这只有通过直接匹配图像并从中收集相关描述作为上下文才能实现。

结论

在这个笔记本中，我们已经学习了如何使用CLIP模型，创建一个使用CLIP模型的图像嵌入数据库的示例，执行语义搜索，并最终提供用户查询来回答问题。

这种使用模式的应用领域广泛，很容易进行改进以进一步增强技术。例如，您可以微调CLIP，可以改进像RAG中的检索过程，也可以引导工程GPT-V。