使用GPT-4o的视觉能力和TTS API处理和叙述视频

这个笔记本演示了如何使用GPT的视觉能力处理视频。GPT-4o不能直接接受视频作为输入，但我们可以利用视觉和128K上下文窗口来一次性描述整个视频的静态帧。我们将演示两个示例：

使用GPT-4o获取视频的描述
使用GPT-o和TTS API为视频生成配音

from IPython.display import display, Image, Audio

import cv2  # 我们正在使用OpenCV来读取视频，要安装的话，请运行：`!pip install opencv-python`
import base64
import time
from openai import OpenAI
import os
import requests

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 使用GPT的视觉功能来获取视频的描述

首先，我们使用OpenCV从一个包含野牛和狼的自然视频中提取帧：

video = cv2.VideoCapture("data/bison.mp4")

base64Frames = []
while video.isOpened():
    success, frame = video.read()
    if not success:
        break
    _, buffer = cv2.imencode(".jpg", frame)
    base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

video.release()
print(len(base64Frames), "frames read.")

618 frames read.

显示帧以确保我们已正确读取它们：

display_handle = display(None, display_id=True)
for img in base64Frames:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8"))))
    time.sleep(0.025)

一旦我们获得视频帧，我们会构建我们的提示并向GPT发送请求（请注意，我们不需要发送每一帧给GPT，它就能理解发生了什么）。

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::50]),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 200,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

Title: "Epic Wildlife Showdown: Wolves vs. Bison in the Snow"

Description: 
Witness the raw power and strategy of nature in this intense and breathtaking video! A pack of wolves face off against a herd of bison in a dramatic battle for survival set against a stunning snowy backdrop. See how the wolves employ their cunning tactics while the bison demonstrate their strength and solidarity. This rare and unforgettable footage captures the essence of the wild like never before. Who will prevail in this ultimate test of endurance and skill? Watch to find out and experience the thrill of the wilderness! 🌨️🦊🐂 #Wildlife #NatureDocumentary #AnimalKingdom #SurvivalOfTheFittest #NatureLovers

2. 使用GPT-4和TTS API为视频生成配音

让我们以大卫·艾登堡的风格为这段视频创作配音。使用相同的视频帧，我们提示GPT为我们提供一个简短的脚本：

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames of a video. Create a short voiceover script in the style of David Attenborough. Only include the narration.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::60]),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

In the frozen expanses of the North American wilderness, a battle unfolds—a testament to the harsh realities of survival.

The pack of wolves, relentless and coordinated, closes in on the mighty bison. Exhausted and surrounded, the bison relies on its immense strength and bulk to fend off the predators.

But the wolves are cunning strategists. They work together, each member playing a role in the hunt, nipping at the bison's legs, forcing it into a corner.

The alpha female leads the charge, her pack following her cues. They encircle their prey, tightening the noose with every passing second.

The bison makes a desperate attempt to escape, but the wolves latch onto their target, wearing it down through sheer persistence and teamwork.

In these moments, nature's brutal elegance is laid bare—a primal dance where only the strongest and the most cunning can thrive.

The bison, now overpowered and exhausted, faces its inevitable fate. The wolves have triumphed, securing a meal that will sustain their pack for days to come.

And so, the cycle of life continues, as it has for millennia, in this unforgiving land where the struggle for survival is an unending battle.

现在我们可以将脚本传递给TTS API，它将生成语音解说的mp3文件：

response = requests.post(
    "https://api.openai.com/v1/audio/speech",
    headers={
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
    },
    json={
        "model": "tts-1-1106",
        "input": result.choices[0].message.content,
        "voice": "onyx",
    },
)

audio = b""
for chunk in response.iter_content(chunk_size=1024 * 1024):
    audio += chunk
Audio(audio)

1. 使用GPT的视觉功能来获取视频的描述​

2. 使用GPT-4和TTS API为视频生成配音​

1. 使用GPT的视觉功能来获取视频的描述

2. 使用GPT-4和TTS API为视频生成配音