跳到主要内容

GPT-4o简介

nbviewer

GPT-4o(“o”代表“omni”)旨在处理文本、音频和视频输入的组合,并能生成文本、音频和图像格式的输出。

背景

在GPT-4o之前,用户可以使用语音模式与ChatGPT进行交互,该模式使用三个单独的模型。GPT-4o将把这些功能集成到一个单一模型中,该模型在文本、视觉和音频方面进行训练。这种统一的方法确保所有输入(无论是文本、视觉还是听觉)都由同一个神经网络协同处理。

当前API功能

目前,API仅支持{text, image}输入,输出为{text},与gpt-4-turbo相同的模态。很快将引入其他模态,包括音频。本指南将帮助您开始使用GPT-4o来理解文本、图像和视频。

开始使用

安装用于Python的OpenAI SDK

%pip install --upgrade openai --quiet

配置OpenAI客户端并提交测试请求

为了为我们的使用设置客户端,我们需要创建一个API密钥来用于我们的请求。如果您已经有了用于使用的API密钥,请跳过这些步骤。

您可以按照以下步骤获取API密钥: 1. 创建一个新项目 2. 在您的项目中生成一个API密钥 3. (推荐,但不是必需的)将您的API密钥设置为环境变量,适用于所有项目

一旦我们完成了这些设置,让我们从一个简单的{text}输入开始,用于我们的第一个请求。我们将为我们的第一个请求使用systemuser消息,并将从assistant角色接收到一个响应。

from openai import OpenAI 
import os

# #设置API密钥和模型名称
MODEL="gpt-4o"
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as an env var>"))

completion = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant. Help me with my math homework!"}, # <-- 这是为模型提供上下文的系统消息
{"role": "user", "content": "Hello! Could you solve 2+2?"} # <-- 这是用户消息,模型将据此生成回复。
]
)

print("Assistant: " + completion.choices[0].message.content)

Assistant: Of course! 

\[ 2 + 2 = 4 \]

If you have any other questions, feel free to ask!

图像处理

GPT-4o可以直接处理图像,并根据图像采取智能行动。我们可以以两种格式提供图像: 1. Base64编码 2. URL

让我们首先查看我们将使用的图像,然后尝试将此图像作为Base64和URL链接发送到API。

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/triangle.png"

# 上下文预览图像
display(Image(IMAGE_PATH))

Base64 图像处理

# 打开图像文件,并将其编码为Base64字符串。
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}"}
}
]}
],
temperature=0.0,
)

print(response.choices[0].message.content)

To find the area of the triangle, we can use Heron's formula. First, we need to find the semi-perimeter of the triangle.

The sides of the triangle are 6, 5, and 9.

1. Calculate the semi-perimeter \( s \):
\[ s = \frac{a + b + c}{2} = \frac{6 + 5 + 9}{2} = 10 \]

2. Use Heron's formula to find the area \( A \):
\[ A = \sqrt{s(s-a)(s-b)(s-c)} \]

Substitute the values:
\[ A = \sqrt{10(10-6)(10-5)(10-9)} \]
\[ A = \sqrt{10 \cdot 4 \cdot 5 \cdot 1} \]
\[ A = \sqrt{200} \]
\[ A = 10\sqrt{2} \]

So, the area of the triangle is \( 10\sqrt{2} \) square units.

URL 图像处理

response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown. Help me with my math homework!"},
{"role": "user", "content": [
{"type": "text", "text": "What's the area of the triangle?"},
{"type": "image_url", "image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/e/e2/The_Algebra_of_Mohammed_Ben_Musa_-_page_82b.png"}
}
]}
],
temperature=0.0,
)

print(response.choices[0].message.content)

To find the area of the triangle, we can use Heron's formula. Heron's formula states that the area of a triangle with sides of length \(a\), \(b\), and \(c\) is:

\[ \text{Area} = \sqrt{s(s-a)(s-b)(s-c)} \]

where \(s\) is the semi-perimeter of the triangle:

\[ s = \frac{a + b + c}{2} \]

For the given triangle, the side lengths are \(a = 5\), \(b = 6\), and \(c = 9\).

First, calculate the semi-perimeter \(s\):

\[ s = \frac{5 + 6 + 9}{2} = \frac{20}{2} = 10 \]

Now, apply Heron's formula:

\[ \text{Area} = \sqrt{10(10-5)(10-6)(10-9)} \]
\[ \text{Area} = \sqrt{10 \cdot 5 \cdot 4 \cdot 1} \]
\[ \text{Area} = \sqrt{200} \]
\[ \text{Area} = 10\sqrt{2} \]

So, the area of the triangle is \(10\sqrt{2}\) square units.

视频处理

虽然无法直接将视频发送到API,但GPT-4o可以理解视频,如果你对帧进行采样,然后将它们作为图像提供。在这项任务中,它的表现比GPT-4 Turbo更好。

由于截至2024年5月,API中的GPT-4o尚不支持音频输入,我们将使用GPT-4o和Whisper的组合来处理提供的视频的音频和视觉,并展示两个用例: 1. 摘要 2. 问答

视频处理设置

我们将使用两个Python包进行视频处理 - opencv-python 和 moviepy。

这些包需要ffmpeg,所以请确保提前安装它。根据您的操作系统,您可能需要运行 brew install ffmpegsudo apt install ffmpeg

%pip install opencv-python --quiet
%pip install moviepy --quiet

将视频处理成两个部分:帧和音频

import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# 我们将使用OpenAI DevDay主题演讲回顾视频。您可以在此处观看视频:https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"

def process_video(video_path, seconds_per_frame=2):
base64Frames = []
base_video_path, _ = os.path.splitext(video_path)

video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = video.get(cv2.CAP_PROP_FPS)
frames_to_skip = int(fps * seconds_per_frame)
curr_frame=0

# 遍历视频并按指定的采样率提取帧
while curr_frame < total_frames - 1:
video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
curr_frame += frames_to_skip
video.release()

# 从视频中提取音频
audio_path = f"{base_video_path}.mp3"
clip = VideoFileClip(video_path)
clip.audio.write_audiofile(audio_path, bitrate="32k")
clip.audio.close()
clip.close()

print(f"Extracted {len(base64Frames)} frames")
print(f"Extracted audio to {audio_path}")
return base64Frames, audio_path

# 每秒提取1帧。您可以通过调整`seconds_per_frame`参数来改变采样率。
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)


MoviePy - Writing audio in data/keynote_recap.mp3
                                                                      
MoviePy - Done.
Extracted 218 frames
Extracted audio to data/keynote_recap.mp3
# #显示帧和音频以供上下文参考
display_handle = display(None, display_id=True)
for img in base64Frames:
display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
time.sleep(0.025)

Audio(audio_path)

示例1:总结

现在我们已经有了视频帧和音频,让我们运行一些不同的测试来生成一个视频摘要,以比较使用不同模态的模型的结果。我们应该期望看到,使用来自视觉和音频输入的上下文生成的摘要将是最准确的,因为模型能够使用视频的整个上下文。

  1. 视觉摘要
  2. 音频摘要
  3. 视觉 + 音频摘要

视觉摘要

通过仅向模型发送视频帧来生成视觉摘要。只有帧时,模型很可能会捕捉到视觉方面,但会错过演讲者讨论的任何细节。

response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames)
],
}
],
temperature=0,
)
print(response.choices[0].message.content)

## Video Summary: OpenAI DevDay Keynote Recap

The video appears to be a keynote recap from OpenAI's DevDay event. Here are the key points covered in the video:

1. **Introduction and Event Overview**:
- The video starts with the title "OpenAI DevDay" and transitions to "Keynote Recap."
- The event venue is shown, with attendees gathering and the stage set up.

2. **Keynote Presentation**:
- A speaker, presumably from OpenAI, takes the stage to present.
- The presentation covers various topics related to OpenAI's latest developments and announcements.

3. **Announcements**:
- **GPT-4 Turbo**: Introduction of GPT-4 Turbo, highlighting its enhanced capabilities and performance.
- **JSON Mode**: A new feature that allows for structured data output in JSON format.
- **Function Calling**: Demonstration of improved function calling capabilities, making interactions more efficient.
- **Context Length and Control**: Enhancements in context length and user control over the model's responses.
- **Better Knowledge Integration**: Improvements in the model's knowledge base and retrieval capabilities.

4. **Product Demonstrations**:
- **DALL-E 3**: Introduction of DALL-E 3 for advanced image generation.
- **Custom Models**: Announcement of custom models, allowing users to tailor models to specific needs.
- **API Enhancements**: Updates to the API, including threading, retrieval, and code interpreter functionalities.

5. **Pricing and Token Efficiency**:
- Discussion on GPT-4 Turbo pricing, emphasizing cost efficiency with reduced input and output tokens.

6. **New Features and Tools**:
- Introduction of new tools and features for developers, including a variety of GPT-powered applications.
- Emphasis on building with natural language and the ease of creating custom applications.

7. **Closing Remarks**:
- The speaker concludes the presentation, thanking the audience and highlighting the future of OpenAI's developments.

The video ends with the OpenAI logo and the event title "OpenAI DevDay."

结果如预期 - 模型能够捕捉视频视觉的高层次方面,但会错过演讲中提供的细节。

音频摘要

音频摘要是通过向模型发送音频转录来生成的。仅有音频时,模型可能会偏向于音频内容,并会错过演示和视觉提供的上下文。

{audio} 输入对于 GPT-4o 目前不可用,但即将推出!目前,我们使用现有的 whisper-1 模型来处理音频。

# 转录音频
transcription = client.audio.transcriptions.create(
model="whisper-1",
file=open(audio_path, "rb"),
)
# #可选:取消注释以下行以打印转录文本
#print("Transcript: ", transcription.text + "\n\n")

response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown."""},
{"role": "user", "content": [
{"type": "text", "text": f"The audio transcription is: {transcription.text}"}
],
}
],
temperature=0,
)
print(response.choices[0].message.content)

### Summary

Welcome to OpenAI's first-ever Dev Day. Key announcements include:

- **GPT-4 Turbo**: A new model supporting up to 128,000 tokens of context, featuring JSON mode for valid JSON responses, improved instruction following, and better knowledge retrieval from external documents or databases. It is also significantly cheaper than GPT-4.
- **New Features**:
- **Dolly 3**, **GPT-4 Turbo with Vision**, and a new **Text-to-Speech model** are now available in the API.
- **Custom Models**: A program where OpenAI researchers help companies create custom models tailored to their specific use cases.
- **Increased Rate Limits**: Doubling tokens per minute for established GPT-4 customers and allowing requests for further rate limit changes.
- **GPTs**: Tailored versions of ChatGPT for specific purposes, programmable through conversation, with options for private or public sharing, and a forthcoming GPT Store.
- **Assistance API**: Includes persistent threads, built-in retrieval, a code interpreter, and improved function calling.

OpenAI is excited about the future of AI integration and looks forward to seeing what users will create with these new tools. The event concludes with an invitation to return next year for more advancements.

音频摘要偏向于演讲中讨论的内容,但比视频摘要结构更少。

音频 + 视觉摘要

音频 + 视觉摘要是通过同时发送视频中的视觉和音频来生成的。当同时发送这两者时,模型被期望能够更好地总结,因为它可以一次感知整个视频。

# #生成视觉与音频的摘要
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"""},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
{"type": "text", "text": f"The audio transcription is: {transcription.text}"}
],
}
],
temperature=0,
)
print(response.choices[0].message.content)

## Video Summary: OpenAI Dev Day

### Introduction
- The video begins with the title "OpenAI Dev Day" and transitions to a keynote recap.

### Event Overview
- The event is held at a venue with a sign reading "OpenAI Dev Day."
- Attendees are seen entering and gathering in a large hall.

### Keynote Presentation
- The keynote speaker introduces the event and announces the launch of GPT-4 Turbo.
- **GPT-4 Turbo**:
- Supports up to 128,000 tokens of context.
- Introduces a new feature called JSON mode for valid JSON responses.
- Improved function calling capabilities.
- Enhanced instruction-following and knowledge retrieval from external documents or databases.
- Knowledge updated up to April 2023.
- Available in the API along with DALL-E 3, GPT-4 Turbo with Vision, and a new Text-to-Speech model.

### Custom Models
- Launch of a new program called Custom Models.
- Researchers will collaborate with companies to create custom models tailored to specific use cases.
- Higher rate limits and the ability to request changes to rate limits and quotas directly in API settings.

### Pricing and Performance
- **GPT-4 Turbo**:
- 3x cheaper for prompt tokens and 2x cheaper for completion tokens compared to GPT-4.
- Doubling the tokens per minute for established GPT-4 customers.

### Introduction of GPTs
- **GPTs**:
- Tailored versions of ChatGPT for specific purposes.
- Combine instructions, expanded knowledge, and actions for better performance and control.
- Can be created without coding, through conversation.
- Options to make GPTs private, share publicly, or create for company use in ChatGPT Enterprise.
- Announcement of the upcoming GPT Store.

### Assistance API
- **Assistance API**:
- Includes persistent threads for handling long conversation history.
- Built-in retrieval and code interpreter with a working Python interpreter in a sandbox environment.
- Improved function calling.

### Conclusion
- The speaker emphasizes the potential of integrating intelligence everywhere, providing "superpowers on demand."
- Encourages attendees to return next year, hinting at even more advanced developments.
- The event concludes with thanks to the attendees.

### Closing
- The video ends with the OpenAI logo and a final thank you message.

在合并视频和音频之后,我们能够获得一个更加详细和全面的活动摘要,该摘要利用了视频的视觉和音频元素的信息。

示例2:问答

对于问答环节,我们将使用之前的相同概念来对我们处理过的视频提出问题,同时运行相同的3个测试来展示结合输入模式的好处: 1. 视觉问答 2. 音频问答 3. 视觉 + 音频问答

QUESTION = "Question: Why did Sam Altman have an example about raising windows and turning the radio on?"

qa_visual_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content": "Use the video to answer the provided question. Respond in Markdown."},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url", "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
QUESTION
],
}
],
temperature=0,
)
print("Visual QA:\n" + qa_visual_response.choices[0].message.content)

Visual QA: 
Sam Altman used the example about raising windows and turning the radio on to demonstrate the function calling capability of GPT-4 Turbo. The example illustrated how the model can interpret and execute multiple commands in a more structured and efficient manner. The "before" and "after" comparison showed how the model can now directly call functions like `raise_windows()` and `radio_on()` based on natural language instructions, showcasing improved control and functionality.
qa_audio_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""Use the transcription to answer the provided question. Respond in Markdown."""},
{"role": "user", "content": f"The audio transcription is: {transcription.text}. \n\n {QUESTION}"},
],
temperature=0,
)
print("Audio QA:\n" + qa_audio_response.choices[0].message.content)

Audio QA:
The provided transcription does not include any mention of Sam Altman or an example about raising windows and turning the radio on. Therefore, I cannot provide an answer based on the given transcription.
qa_both_response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "system", "content":"""Use the video and transcription to answer the provided question."""},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames),
{"type": "text", "text": f"The audio transcription is: {transcription.text}"},
QUESTION
],
}
],
temperature=0,
)
print("Both QA:\n" + qa_both_response.choices[0].message.content)

Both QA:
Sam Altman used the example of raising windows and turning the radio on to demonstrate the improved function calling capabilities of GPT-4 Turbo. The example illustrated how the model can now handle multiple function calls more effectively and follow instructions better. In the "before" scenario, the model had to be prompted separately for each action, whereas in the "after" scenario, the model could handle both actions in a single prompt, showcasing its enhanced ability to manage and execute multiple tasks simultaneously.

通过比较这三个答案,使用视频中的音频和视觉生成的答案最准确。Sam Altman在主题演讲中没有讨论升降窗户或收音机,但在他身后展示示例时提到了模型执行多个功能的改进能力。

结论

集成多种输入模态,如音频、视觉和文本,显著提高了模型在各种任务上的性能。这种多模态方法可以更全面地理解和交互,更贴近人类感知和信息处理方式。

目前,API中的GPT-4o支持文本和图像输入,音频功能即将推出。