使用OpenAI GPT4V和LanceDB向量存储库处理视频的多模态RAG¶

在这个笔记本中，我们展示了一个专为视频处理而设计的多模态RAG架构。我们利用OpenAI GPT4V MultiModal LLM类，该类使用CLIP生成多模态嵌入。此外，我们使用LanceDBVectorStore进行高效的向量存储。

步骤：

从YouTube下载视频，进行处理并存储。
为文本和图像构建多模态索引和向量存储。
检索相关图像和内容，同时使用两者来增强提示。
使用GPT4V来推理输入查询和增强数据之间的相关性，并生成最终响应。

In [ ]:

Copied!

%pip install llama-index-vector-stores-lancedb
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-multi-modal-llms-openai

In [ ]:

Copied!

%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-embeddings-clip
%pip install llama-index-multi-modal-llms-openai
%pip install llama-index-vector-stores-lancedb
%pip install llama-index-embeddings-clip

In [ ]:

Copied!





%pip install llama_index ftfy regex tqdm
%pip install -U openai-whisper
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install lancedb
%pip install moviepy
%pip install pytube
%pip install pydub
%pip install SpeechRecognition
%pip install ffmpeg-python
%pip install soundfile
%pip install llama_index ftfy regex tqdm
%pip install -U openai-whisper
%pip install git+https://github.com/openai/CLIP.git
%pip install torch torchvision
%pip install matplotlib scikit-image
%pip install lancedb
%pip install moviepy
%pip install pytube
%pip install pydub
%pip install SpeechRecognition
%pip install ffmpeg-python
%pip install soundfile

In [ ]:

Copied!





from moviepy.editor import VideoFileClip
from pathlib import Path
import speech_recognition as sr
from pytube import YouTube
from pprint import pprint
from moviepy.editor import VideoFileClip
from pathlib import Path
import speech_recognition as sr
from pytube import YouTube
from pprint import pprint

In [ ]:

Copied!

import os

OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
import os

OPENAI_API_KEY = ""
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

设置以下输入的配置¶

In [ ]:

Copied!





video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00"
output_video_path = "./video_data/"
output_folder = "./mixed_data/"
output_audio_path = "./mixed_data/output_audio.wav"

filepath = output_video_path + "input_vid.mp4"
Path(output_folder).mkdir(parents=True, exist_ok=True)
video_url = "https://www.youtube.com/watch?v=d_qvLDhkg00"
output_video_path = "./video_data/"
output_folder = "./mixed_data/"
output_audio_path = "./mixed_data/output_audio.wav"

filepath = output_video_path + "input_vid.mp4"
Path(output_folder).mkdir(parents=True, exist_ok=True)

下载和处理视频，以生成/存储嵌入向量的适当格式¶

In [ ]:

Copied!





from PIL import Image
import matplotlib.pyplot as plt
import os


def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 7:
                break
from PIL import Image
import matplotlib.pyplot as plt
import os


def plot_images(image_paths):
    images_shown = 0
    plt.figure(figsize=(16, 9))
    for img_path in image_paths:
        if os.path.isfile(img_path):
            image = Image.open(img_path)

            plt.subplot(2, 3, images_shown + 1)
            plt.imshow(image)
            plt.xticks([])
            plt.yticks([])

            images_shown += 1
            if images_shown >= 7:
                break

In [ ]:

Copied!





def download_video(url, output_path):
    """
    从给定的url下载视频并将其保存到输出路径。

    参数:
    url (str): 要下载的视频的url。
    output_path (str): 要保存视频的路径。

    返回:
    dict: 包含视频元数据的字典。
    """
    yt = YouTube(url)
    metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views}
    yt.streams.get_highest_resolution().download(
        output_path=output_path, filename="input_vid.mp4"
    )
    return metadata


def video_to_images(video_path, output_folder):
    """
    将视频转换为一系列图像并将它们保存到输出文件夹。

    参数:
    video_path (str): 视频文件的路径。
    output_folder (str): 要保存图像的文件夹的路径。

    """
    clip = VideoFileClip(video_path)
    clip.write_images_sequence(
        os.path.join(output_folder, "frame%04d.png"), fps=0.2
    )


def video_to_audio(video_path, output_audio_path):
    """
    将视频转换为音频并将其保存到输出路径。

    参数:
    video_path (str): 视频文件的路径。
    output_audio_path (str): 要保存音频的路径。

    """
    clip = VideoFileClip(video_path)
    audio = clip.audio
    audio.write_audiofile(output_audio_path)


def audio_to_text(audio_path):
    """
    使用SpeechRecognition库将音频转换为文本。

    参数:
    audio_path (str): 音频文件的路径。

    返回:
    test (str): 从音频中识别出的文本。

    """
    recognizer = sr.Recognizer()
    audio = sr.AudioFile(audio_path)

    with audio as source:
        # 记录音频数据
        audio_data = recognizer.record(source)

        try:
            # 识别语音
            text = recognizer.recognize_whisper(audio_data)
        except sr.UnknownValueError:
            print("语音识别无法理解音频。")
        except sr.RequestError as e:
            print(f"无法从服务请求结果; {e}")

    return text
def download_video(url, output_path):
    """
    从给定的url下载视频并将其保存到输出路径。

    参数:
    url (str): 要下载的视频的url。
    output_path (str): 要保存视频的路径。

    返回:
    dict: 包含视频元数据的字典。
    """
    yt = YouTube(url)
    metadata = {"Author": yt.author, "Title": yt.title, "Views": yt.views}
    yt.streams.get_highest_resolution().download(
        output_path=output_path, filename="input_vid.mp4"
    )
    return metadata


def video_to_images(video_path, output_folder):
    """
    将视频转换为一系列图像并将它们保存到输出文件夹。

    参数:
    video_path (str): 视频文件的路径。
    output_folder (str): 要保存图像的文件夹的路径。

    """
    clip = VideoFileClip(video_path)
    clip.write_images_sequence(
        os.path.join(output_folder, "frame%04d.png"), fps=0.2
    )


def video_to_audio(video_path, output_audio_path):
    """
    将视频转换为音频并将其保存到输出路径。

    参数:
    video_path (str): 视频文件的路径。
    output_audio_path (str): 要保存音频的路径。

    """
    clip = VideoFileClip(video_path)
    audio = clip.audio
    audio.write_audiofile(output_audio_path)


def audio_to_text(audio_path):
    """
    使用SpeechRecognition库将音频转换为文本。

    参数:
    audio_path (str): 音频文件的路径。

    返回:
    test (str): 从音频中识别出的文本。

    """
    recognizer = sr.Recognizer()
    audio = sr.AudioFile(audio_path)

    with audio as source:
        # 记录音频数据
        audio_data = recognizer.record(source)

        try:
            # 识别语音
            text = recognizer.recognize_whisper(audio_data)
        except sr.UnknownValueError:
            print("语音识别无法理解音频。")
        except sr.RequestError as e:
            print(f"无法从服务请求结果; {e}")

    return text

In [ ]:

Copied!





try:
    metadata_vid = download_video(video_url, output_video_path)
    video_to_images(filepath, output_folder)
    video_to_audio(filepath, output_audio_path)
    text_data = audio_to_text(output_audio_path)

    with open(output_folder + "output_text.txt", "w") as file:
        file.write(text_data)
    print("Text data saved to file")
    file.close()
    os.remove(output_audio_path)
    print("Audio file removed")

except Exception as e:
    raise e
try:
    metadata_vid = download_video(video_url, output_video_path)
    video_to_images(filepath, output_folder)
    video_to_audio(filepath, output_audio_path)
    text_data = audio_to_text(output_audio_path)

    with open(output_folder + "output_text.txt", "w") as file:
        file.write(text_data)
    print("Text data saved to file")
    file.close()
    os.remove(output_audio_path)
    print("Audio file removed")

except Exception as e:
    raise e

创建多模态索引¶

In [ ]:

Copied!





from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import SimpleDirectoryReader, StorageContext

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.lancedb import LanceDBVectorStore


from llama_index.core import SimpleDirectoryReader

text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection")
image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection")
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

# 创建MultiModal索引
documents = SimpleDirectoryReader(output_folder).load_data()

index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.core import SimpleDirectoryReader, StorageContext

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.lancedb import LanceDBVectorStore


from llama_index.core import SimpleDirectoryReader

text_store = LanceDBVectorStore(uri="lancedb", table_name="text_collection")
image_store = LanceDBVectorStore(uri="lancedb", table_name="image_collection")
storage_context = StorageContext.from_defaults(
    vector_store=text_store, image_store=image_store
)

# 创建MultiModal索引
documents = SimpleDirectoryReader(output_folder).load_data()

index = MultiModalVectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
)

使用索引作为检索器从多模态向量索引中获取前k个结果（在本例中为5个）¶

In [ ]:

Copied!

retriever_engine = index.as_retriever(
    similarity_top_k=5, image_similarity_top_k=5
)
retriever_engine = index.as_retriever(
    similarity_top_k=5, image_similarity_top_k=5
)

设置RAG提示模板¶

In [ ]:

Copied!





import json

metadata_str = json.dumps(metadata_vid)

qa_tmpl_str = (
    "Given the provided information, including relevant images and retrieved context from the video, \
 accurately and precisely answer the query without any additional prior knowledge.\n"
    "Please ensure honesty and responsibility, refraining from any racist or sexist remarks.\n"
    "---------------------\n"
    "Context: {context_str}\n"
    "Metadata for video: {metadata_str} \n"
    "---------------------\n"
    "Query: {query_str}\n"
    "Answer: "
)
import json

metadata_str = json.dumps(metadata_vid)

qa_tmpl_str = (
    "Given the provided information, including relevant images and retrieved context from the video, \
 accurately and precisely answer the query without any additional prior knowledge.\n"
    "Please ensure honesty and responsibility, refraining from any racist or sexist remarks.\n"
    "---------------------\n"
    "Context: {context_str}\n"
    "Metadata for video: {metadata_str} \n"
    "---------------------\n"
    "Query: {query_str}\n"
    "Answer: "
)

从数据库中检索与用户查询最相似的文本/图像嵌入向量¶

In [ ]:

Copied!





from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode


def retrieve(retriever_engine, query_str):
    retrieval_results = retriever_engine.retrieve(query_str)

    retrieved_image = []
    retrieved_text = []
    for res_node in retrieval_results:
        if isinstance(res_node.node, ImageNode):
            retrieved_image.append(res_node.node.metadata["file_path"])
        else:
            display_source_node(res_node, source_length=200)
            retrieved_text.append(res_node.text)

    return retrieved_image, retrieved_text
from llama_index.core.response.notebook_utils import display_source_node
from llama_index.core.schema import ImageNode


def retrieve(retriever_engine, query_str):
    retrieval_results = retriever_engine.retrieve(query_str)

    retrieved_image = []
    retrieved_text = []
    for res_node in retrieval_results:
        if isinstance(res_node.node, ImageNode):
            retrieved_image.append(res_node.node.metadata["file_path"])
        else:
            display_source_node(res_node, source_length=200)
            retrieved_text.append(res_node.text)

    return retrieved_image, retrieved_text

现在添加查询，获取相关细节，包括图片，并增强提示模板¶

In [ ]:

Copied!





query_str = "Using examples from video, explain all things covered in the video regarding the gaussian function"

img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(
    input_dir=output_folder, input_files=img
).load_data()
context_str = "".join(txt)
plot_images(img)
query_str = "Using examples from video, explain all things covered in the video regarding the gaussian function"

img, txt = retrieve(retriever_engine=retriever_engine, query_str=query_str)
image_documents = SimpleDirectoryReader(
    input_dir=output_folder, input_files=img
).load_data()
context_str = "".join(txt)
plot_images(img)

Node ID: bda08ef1-137c-4d69-9bcc-b7005a41a13c
Similarity: 0.7431071996688843
Text: The basic function underlying a normal distribution, aka a Gaussian, is e to the negative x squared. But you might wonder why this function? Of all the expressions we could dream up that give you s...

Node ID: 7d6d0f32-ce16-461b-be54-883241252e50
Similarity: 0.7335695028305054
Text: This step is actually pretty technical, it goes a little beyond what I want to talk about here. Often use these objects called moment generating functions, that gives you a very abstract argument t...

Node ID: 519fb788-3927-4842-ad5c-88be61deaf65
Similarity: 0.7069740295410156
Text: The essence of what we want to compute is what the convolution between two copies of this function looks like. If you remember, in the last video, we had two different ways to visualize convolution...

Node ID: f265c3fb-3c9f-4f36-aa2a-fb15efff9783
Similarity: 0.706935465335846
Text: This is the important point. All of the stuff that's involving s is now entirely separate from the integrated variable. This remaining integral is a little bit tricky. I did a whole video on it. It...

No description has been provided for this image

使用GPT4V生成最终响应¶

In [ ]:

Copied!





from llama_index.multi_modal_llms.openai import OpenAIMultiModal

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_KEY, max_new_tokens=1500
)


response_1 = openai_mm_llm.complete(
    prompt=qa_tmpl_str.format(
        context_str=context_str, query_str=query_str, metadata_str=metadata_str
    ),
    image_documents=image_documents,
)

pprint(response_1.text)
from llama_index.multi_modal_llms.openai import OpenAIMultiModal

openai_mm_llm = OpenAIMultiModal(
    model="gpt-4-vision-preview", api_key=OPENAI_API_KEY, max_new_tokens=1500
)


response_1 = openai_mm_llm.complete(
    prompt=qa_tmpl_str.format(
        context_str=context_str, query_str=query_str, metadata_str=metadata_str
    ),
    image_documents=image_documents,
)

pprint(response_1.text)

('The video by 3Blue1Brown, titled "A pretty reason why Gaussian + Gaussian = '
 'Gaussian," covers several aspects of the Gaussian function, also known as '
 "the normal distribution. Here's a summary of the key points discussed in the "
 'video:\n'
 '\n'
 '1. **Central Limit Theorem**: The video begins by discussing the central '
 'limit theorem, which states that the sum of multiple copies of a random '
 'variable tends to look like a normal distribution. As the number of '
 'variables increases, the approximation to a normal distribution becomes '
 'better.\n'
 '\n'
 '2. **Convolution of Random Variables**: The process of adding two random '
 'variables is mathematically represented by a convolution of their respective '
 'distributions. The video explains the concept of convolution and how it is '
 'used to find the distribution of the sum of two random variables.\n'
 '\n'
 '3. **Gaussian Function**: The Gaussian function is more complex than just '
 '\\( e^{-x^2} \\). The full formula includes a scaling factor to ensure the '
 'area under the curve is 1 (making it a valid probability distribution), a '
 'standard deviation parameter \\( \\sigma \\) to describe the spread, and a '
 'mean parameter \\( \\mu \\) to shift the center. However, the video focuses '
 'on centered distributions with \\( \\mu = 0 \\).\n'
 '\n'
 '4. **Visualizing Convolution**: The video presents a visual method to '
 'understand the convolution of two Gaussian functions using diagonal slices '
 'on the xy-plane. This method involves looking at the probability density of '
 'landing on a point (x, y) as \\( f(x) \\times g(y) \\), where f and g are '
 'the two distributions being convolved.\n'
 '\n'
 '5. **Rotational Symmetry**: A key property of the Gaussian function is its '
 'rotational symmetry, which is unique to bell curves. This symmetry is '
 'exploited in the video to simplify the calculation of the convolution. By '
 'rotating the graph 45 degrees, the computation becomes easier because the '
 'integral only involves one variable.\n'
 '\n'
 '6. **Result of Convolution**: The video demonstrates that the convolution of '
 'two Gaussian functions is another Gaussian function. This is a special '
 'property because convolutions typically result in a different kind of '
 'function. The standard deviation of the resulting Gaussian is \\( \\sqrt{2} '
 '\\times \\sigma \\) if the original Gaussians had the same standard '
 'deviation.\n'
 '\n'
 '7. **Proof of Central Limit Theorem**: The video explains that the '
 'convolution of two Gaussians being another Gaussian is a crucial step in '
 'proving the central limit theorem. It shows that the Gaussian function is a '
 'fixed point in the space of distributions, and since all distributions with '
 'finite variance tend towards a single universal shape, that shape must be '
 'the Gaussian.\n'
 '\n'
 '8. **Connection to Pi**: The video also touches on the connection between '
 'the Gaussian function and the number Pi, which appears in the formula for '
 'the normal distribution.\n'
 '\n'
 'The video aims to provide an intuitive geometric argument for why the sum of '
 'two normally distributed random variables is also normally distributed, and '
 'how this relates to the central limit theorem and the special properties of '
 'the Gaussian function.')