跳到主要内容

使用Batch API进行批处理

nbviewer

新的Batch API允许创建异步批处理作业,价格更低,速率限制更高

批处理作业将在24小时内完成,但根据全局使用情况,可能会更快地处理。

Batch API的理想用例包括:

  • 在市场或博客上为内容添加标签、标题或丰富内容
  • 对支持票据进行分类并提供答案建议
  • 对大量客户反馈数据进行情感分析
  • 为文档或文章集合生成摘要或翻译

等等!

本手册将指导您如何使用Batch API,并提供一些实际示例。

我们将从一个示例开始,使用gpt-3.5-turbo对电影进行分类,然后介绍如何使用gpt-4-turbo的视觉功能为图像添加标题。

请注意,通过Batch API可以使用多个模型,并且您可以在Batch API调用中使用与聊天完成端点相同的参数。

设置

# 确保您使用的是最新版本的SDK,以便使用批处理API。
%pip install openai --upgrade

import json
from openai import OpenAI
import pandas as pd
from IPython.display import Image, display

# 正在初始化OpenAI客户端 - 请参阅 https://platform.openai.com/docs/quickstart?context=python
client = OpenAI()

第一个示例:对电影进行分类

在这个示例中,我们将使用gpt-3.5-turbo从电影描述中提取电影类别。我们还将从这个描述中提取一个句子的摘要。

我们将使用JSON模式以数组形式提取类别和以结构化格式提取1句摘要。

对于每部电影,我们希望得到如下结果:

{
categories: ['category1', 'category2', 'category3'],
summary: '1句摘要'
}

加载数据

在本例中,我们将使用IMDB前1000部电影的数据集。

dataset_path = "data/imdb_top_1000.csv"

df = pd.read_csv(dataset_path)
df.head()

Poster_Link Series_Title Released_Year Certificate Runtime Genre IMDB_Rating Overview Meta_score Director Star1 Star2 Star3 Star4 No_of_Votes Gross
0 https://m.media-amazon.com/images/M/MV5BMDFkYT... The Shawshank Redemption 1994 A 142 min Drama 9.3 Two imprisoned men bond over a number of years... 80.0 Frank Darabont Tim Robbins Morgan Freeman Bob Gunton William Sadler 2343110 28,341,469
1 https://m.media-amazon.com/images/M/MV5BM2MyNj... The Godfather 1972 A 175 min Crime, Drama 9.2 An organized crime dynasty's aging patriarch t... 100.0 Francis Ford Coppola Marlon Brando Al Pacino James Caan Diane Keaton 1620367 134,966,411
2 https://m.media-amazon.com/images/M/MV5BMTMxNT... The Dark Knight 2008 UA 152 min Action, Crime, Drama 9.0 When the menace known as the Joker wreaks havo... 84.0 Christopher Nolan Christian Bale Heath Ledger Aaron Eckhart Michael Caine 2303232 534,858,444
3 https://m.media-amazon.com/images/M/MV5BMWMwMG... The Godfather: Part II 1974 A 202 min Crime, Drama 9.0 The early life and career of Vito Corleone in ... 90.0 Francis Ford Coppola Al Pacino Robert De Niro Robert Duvall Diane Keaton 1129952 57,300,000
4 https://m.media-amazon.com/images/M/MV5BMWU4N2... 12 Angry Men 1957 U 96 min Crime, Drama 9.0 A jury holdout attempts to prevent a miscarria... 96.0 Sidney Lumet Henry Fonda Lee J. Cobb Martin Balsam John Fiedler 689845 4,360,000

处理步骤

在这里,我们将通过首先尝试使用聊天完成端点来准备我们的请求。

一旦我们对结果满意,我们就可以继续创建批处理文件。

categorize_system_prompt = '''
Your goal is to extract movie categories from movie descriptions, as well as a 1-sentence summary for these movies.
You will be provided with a movie description, and you will output a json object containing the following information:

{
categories: string[] // Array of categories based on the movie description,
summary: string // 1-sentence summary of the movie based on the movie description
}

Categories refer to the genre or type of the movie, like "action", "romance", "comedy", etc. Keep category names simple and use only lower case letters.
Movies can have several categories, but try to keep it under 3-4. Only mention the categories that are the most obvious based on the description.
'''

def get_categories(description):
response = client.chat.completions.create(
model="gpt-3.5-turbo",
temperature=0.1,
# 这是为了启用JSON模式,确保响应是有效的JSON对象。
response_format={
"type": "json_object"
},
messages=[
{
"role": "system",
"content": categorize_system_prompt
},
{
"role": "user",
"content": description
}
],
)

return response.choices[0].message.content

# 在几个例子上进行测试
for _, row in df[:5].iterrows():
description = row['Overview']
title = row['Series_Title']
result = get_categories(description)
print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
print("\n\n----------------------------\n\n")

TITLE: The Shawshank Redemption
OVERVIEW: Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.

RESULT: {
"categories": ["drama"],
"summary": "Two imprisoned men bond over the years and find redemption through acts of common decency."
}


----------------------------


TITLE: The Godfather
OVERVIEW: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

RESULT: {
"categories": ["crime", "drama"],
"summary": "A crime drama about an aging patriarch passing on his empire to his son."
}


----------------------------


TITLE: The Dark Knight
OVERVIEW: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

RESULT: {
"categories": ["action", "thriller"],
"summary": "A thrilling action movie where Batman faces the chaotic Joker in a battle of justice."
}


----------------------------


TITLE: The Godfather: Part II
OVERVIEW: The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands and tightens his grip on the family crime syndicate.

RESULT: {
"categories": ["crime", "drama"],
"summary": "A portrayal of Vito Corleone's early life and career in 1920s New York City, as his son Michael expands the family crime syndicate."
}


----------------------------


TITLE: 12 Angry Men
OVERVIEW: A jury holdout attempts to prevent a miscarriage of justice by forcing his colleagues to reconsider the evidence.

RESULT: {
"categories": ["drama"],
"summary": "A gripping drama about a jury holdout trying to prevent a miscarriage of justice by challenging his colleagues to reconsider the evidence."
}


----------------------------

创建批处理文件

批处理文件以 jsonl 格式保存,每个请求应该包含一行(一个json对象)。 每个请求的定义如下:

{
"custom_id": <REQUEST_ID>,
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": <MODEL>,
"messages": <MESSAGES>,
// 其他参数
}
}

注意:每个批次中的请求ID应该是唯一的。这是您可以用来将结果与初始输入文件匹配的标识,因为请求的返回顺序可能不同。

# 创建一个 JSON 任务数组

tasks = []

for index, row in df.iterrows():

description = row['Overview']

task = {
"custom_id": f"task-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# 这就是你在调用聊天补全API时会包含的内容。
"model": "gpt-3.5-turbo",
"temperature": 0.1,
"response_format": {
"type": "json_object"
},
"messages": [
{
"role": "system",
"content": categorize_system_prompt
},
{
"role": "user",
"content": description
}
],
}
}

tasks.append(task)

# 创建文件

file_name = "data/batch_tasks_movies.jsonl"

with open(file_name, 'w') as file:
for obj in tasks:
file.write(json.dumps(obj) + '\n')

上传文件

batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)

print(batch_file)

FileObject(id='file-nG1JDPSMRMinN8FOdaL30kVD', bytes=1127310, created_at=1714045723, filename='batch_tasks_movies.jsonl', object='file', purpose='batch', status='processed', status_details=None)

创建批处理作业

batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)

检查批处理状态

注意:这可能需要最多24小时,但通常会更快完成。

您可以继续检查,直到状态变为“已完成”。

batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

Batch(id='batch_xU74ytOBYUpaUQE3Cwi8SCbA', completion_window='24h', created_at=1714049780, endpoint='/v1/chat/completions', input_file_id='file-6y0JPmkHU42qtaEK8x8ZYzkp', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1714049914, error_file_id=None, errors=None, expired_at=None, expires_at=1714136180, failed_at=None, finalizing_at=1714049896, in_progress_at=1714049821, metadata=None, output_file_id='file-XPfkEFZSaM4Avps7mcD3i8BY', request_counts=BatchRequestCounts(completed=312, failed=0, total=312))

检索结果

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content

result_file_name = "data/batch_job_results_movies.jsonl"

with open(result_file_name, 'wb') as file:
file.write(result)

# 从保存文件加载数据
results = []
with open(result_file_name, 'r') as file:
for line in file:
# 将JSON字符串解析为字典,并追加到结果列表中
json_object = json.loads(line.strip())
results.append(json_object)

读取结果

提醒:结果的顺序与输入文件中的顺序不同。 请确保检查custom_id以将结果与输入请求进行匹配。

# 只阅读第一个结果
for res in results[:5]:
task_id = res['custom_id']
# 从任务ID获取索引
index = task_id.split('-')[-1]
result = res['response']['body']['choices'][0]['message']['content']
movie = df.iloc[int(index)]
description = movie['Overview']
title = movie['Series_Title']
print(f"TITLE: {title}\nOVERVIEW: {description}\n\nRESULT: {result}")
print("\n\n----------------------------\n\n")

TITLE: American Psycho
OVERVIEW: A wealthy New York City investment banking executive, Patrick Bateman, hides his alternate psychopathic ego from his co-workers and friends as he delves deeper into his violent, hedonistic fantasies.

RESULT: {
"categories": ["thriller", "psychological", "drama"],
"summary": "A wealthy investment banker in New York City conceals his psychopathic alter ego while indulging in violent and hedonistic fantasies."
}


----------------------------


TITLE: Lethal Weapon
OVERVIEW: Two newly paired cops who are complete opposites must put aside their differences in order to catch a gang of drug smugglers.

RESULT: {
"categories": ["action", "comedy", "crime"],
"summary": "An action-packed comedy about two mismatched cops teaming up to take down a drug smuggling gang."
}


----------------------------


TITLE: A Star Is Born
OVERVIEW: A musician helps a young singer find fame as age and alcoholism send his own career into a downward spiral.

RESULT: {
"categories": ["drama", "music"],
"summary": "A musician's career spirals downward as he helps a young singer find fame amidst struggles with age and alcoholism."
}


----------------------------


TITLE: From Here to Eternity
OVERVIEW: In Hawaii in 1941, a private is cruelly punished for not boxing on his unit's team, while his captain's wife and second-in-command are falling in love.

RESULT: {
"categories": ["drama", "romance", "war"],
"summary": "A drama set in Hawaii in 1941, where a private faces punishment for not boxing on his unit's team, amidst a forbidden love affair between his captain's wife and second-in-command."
}


----------------------------


TITLE: The Jungle Book
OVERVIEW: Bagheera the Panther and Baloo the Bear have a difficult time trying to convince a boy to leave the jungle for human civilization.

RESULT: {
"categories": ["adventure", "animation", "family"],
"summary": "An adventure-filled animated movie about a panther and a bear trying to persuade a boy to leave the jungle for human civilization."
}


----------------------------

第二个例子:为图片加标题

在这个例子中,我们将使用 gpt-4-turbo 为家具物品的图片加标题。

我们将利用模型的视觉能力来分析这些图片并生成标题。

加载数据

我们将使用亚马逊家具数据集作为示例。

dataset_path = "data/amazon_furniture_dataset.csv"
df = pd.read_csv(dataset_path)
df.head()

asin url title brand price availability categories primary_image images upc ... color material style important_information product_overview about_item description specifications uniq_id scraped_at
0 B0CJHKVG6P https://www.amazon.com/dp/B0CJHKVG6P GOYMFK 1pc Free Standing Shoe Rack, Multi-laye... GOYMFK $24.99 Only 13 left in stock - order soon. ['Home & Kitchen', 'Storage & Organization', '... https://m.media-amazon.com/images/I/416WaLx10j... ['https://m.media-amazon.com/images/I/416WaLx1... NaN ... White Metal Modern [] [{'Brand': ' GOYMFK '}, {'Color': ' White '}, ... ['Multiple layers: Provides ample storage spac... multiple shoes, coats, hats, and other items E... ['Brand: GOYMFK', 'Color: White', 'Material: M... 02593e81-5c09-5069-8516-b0b29f439ded 2024-02-02 15:15:08
1 B0B66QHB23 https://www.amazon.com/dp/B0B66QHB23 subrtex Leather ding Room, Dining Chairs Set o... subrtex NaN NaN ['Home & Kitchen', 'Furniture', 'Dining Room F... https://m.media-amazon.com/images/I/31SejUEWY7... ['https://m.media-amazon.com/images/I/31SejUEW... NaN ... Black Sponge Black Rubber Wood [] NaN ['【Easy Assembly】: Set of 2 dining room chairs... subrtex Dining chairs Set of 2 ['Brand: subrtex', 'Color: Black', 'Product Di... 5938d217-b8c5-5d3e-b1cf-e28e340f292e 2024-02-02 15:15:09
2 B0BXRTWLYK https://www.amazon.com/dp/B0BXRTWLYK Plant Repotting Mat MUYETOL Waterproof Transpl... MUYETOL $5.98 In Stock ['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo... https://m.media-amazon.com/images/I/41RgefVq70... ['https://m.media-amazon.com/images/I/41RgefVq... NaN ... Green Polyethylene Modern [] [{'Brand': ' MUYETOL '}, {'Size': ' 26.8*26.8 ... ['PLANT REPOTTING MAT SIZE: 26.8" x 26.8", squ... NaN ['Brand: MUYETOL', 'Size: 26.8*26.8', 'Item We... b2ede786-3f51-5a45-9a5b-bcf856958cd8 2024-02-02 15:15:09
3 B0C1MRB2M8 https://www.amazon.com/dp/B0C1MRB2M8 Pickleball Doormat, Welcome Doormat Absorbent ... VEWETOL $13.99 Only 10 left in stock - order soon. ['Patio, Lawn & Garden', 'Outdoor Décor', 'Doo... https://m.media-amazon.com/images/I/61vz1Igler... ['https://m.media-amazon.com/images/I/61vz1Igl... NaN ... A5589 Rubber Modern [] [{'Brand': ' VEWETOL '}, {'Size': ' 16*24INCH ... ['Specifications: 16x24 Inch ', " High-Quality... The decorative doormat features a subtle textu... ['Brand: VEWETOL', 'Size: 16*24INCH', 'Materia... 8fd9377b-cfa6-5f10-835c-6b8eca2816b5 2024-02-02 15:15:10
4 B0CG1N9QRC https://www.amazon.com/dp/B0CG1N9QRC JOIN IRON Foldable TV Trays for Eating Set of ... JOIN IRON Store $89.99 Usually ships within 5 to 6 weeks ['Home & Kitchen', 'Furniture', 'Game & Recrea... https://m.media-amazon.com/images/I/41p4d4VJnN... ['https://m.media-amazon.com/images/I/41p4d4VJ... NaN ... Grey Set of 4 Iron X Classic Style [] NaN ['Includes 4 Folding Tv Tray Tables And one Co... Set of Four Folding Trays With Matching Storag... ['Brand: JOIN IRON', 'Shape: Rectangular', 'In... bdc9aa30-9439-50dc-8e89-213ea211d66a 2024-02-02 15:15:11

5 rows × 25 columns

处理步骤

同样,我们将首先使用Chat Completions端点准备我们的请求,然后创建批处理文件。

caption_system_prompt = '''
Your goal is to generate short, descriptive captions for images of items.
You will be provided with an item image and the name of that item and you will output a caption that captures the most important information about the item.
If there are multiple items depicted, refer to the name provided to understand which item you should describe.
Your generated caption should be short (1 sentence), and include only the most important information about the item.
The most important information could be: the type of item, the style (if mentioned), the material or color if especially relevant and/or any distinctive features.
Keep it short and to the point.
'''

def get_caption(img_url, title):
response = client.chat.completions.create(
model="gpt-4-turbo",
temperature=0.2,
max_tokens=300,
messages=[
{
"role": "system",
"content": caption_system_prompt
},
{
"role": "user",
"content": [
{
"type": "text",
"text": title
},
# The content type should be "image_url" to use gpt-4-turbo's vision capabilities
{
"type": "image_url",
"image_url": {
"url": img_url
}
},
],
}
]
)

return response.choices[0].message.content

# 对几张图片进行测试
for _, row in df[:5].iterrows():
img_url = row['primary_image']
caption = get_caption(img_url, row['title'])
img = Image(url=img_url)
display(img)
print(f"CAPTION: {caption}\n\n")

CAPTION: White multi-layer metal shoe rack featuring eight double hooks for hanging accessories, ideal for organizing footwear and small items in living spaces.

CAPTION: A set of two elegant black leather dining chairs with a sleek design and vertical stitching detail on the backrest.

CAPTION: A green, waterproof, square, foldable repotting mat designed for indoor gardening, featuring raised edges and displayed with gardening tools and small potted plants.

CAPTION: A brown, absorbent non-slip doormat featuring the phrase "It's a good day to play PICKLEBALL" with a pickleball paddle graphic, ideal for sports enthusiasts.

CAPTION: Set of four foldable grey TV trays with a stand, featuring a sleek, space-saving design suitable for small areas.

创建批处理作业

与第一个示例一样,我们将创建一个json任务数组来生成一个jsonl文件,并使用它来创建批处理作业。

# 创建一个 JSON 任务数组

tasks = []

for index, row in df.iterrows():

title = row['title']
img_url = row['primary_image']

task = {
"custom_id": f"task-{index}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
# 这就是你在调用聊天补全API时会包含的内容。
"model": "gpt-4-turbo",
"temperature": 0.2,
"max_tokens": 300,
"messages": [
{
"role": "system",
"content": caption_system_prompt
},
{
"role": "user",
"content": [
{
"type": "text",
"text": title
},
{
"type": "image_url",
"image_url": {
"url": img_url
}
},
],
}
]
}
}

tasks.append(task)

# 创建文件

file_name = "data/batch_tasks_furniture.jsonl"

with open(file_name, 'w') as file:
for obj in tasks:
file.write(json.dumps(obj) + '\n')

# 上传文件 

batch_file = client.files.create(
file=open(file_name, "rb"),
purpose="batch"
)

# 创造就业机会

batch_job = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)

batch_job = client.batches.retrieve(batch_job.id)
print(batch_job)

Batch(id='batch_xU74ytOBYUpaUQE3Cwi8SCbA', completion_window='24h', created_at=1714049780, endpoint='/v1/chat/completions', input_file_id='file-6y0JPmkHU42qtaEK8x8ZYzkp', object='batch', status='completed', cancelled_at=None, cancelling_at=None, completed_at=1714049914, error_file_id=None, errors=None, expired_at=None, expires_at=1714136180, failed_at=None, finalizing_at=1714049896, in_progress_at=1714049821, metadata=None, output_file_id='file-XPfkEFZSaM4Avps7mcD3i8BY', request_counts=BatchRequestCounts(completed=312, failed=0, total=312))

获取结果

与第一个示例类似,我们可以在批处理作业完成后检索结果。

提醒:结果的顺序与输入文件中的顺序不同。 请确保检查custom_id以将结果与输入请求进行匹配。

# 检索结果文件

result_file_id = batch_job.output_file_id
result = client.files.content(result_file_id).content

result_file_name = "data/batch_job_results_furniture.jsonl"

with open(result_file_name, 'wb') as file:
file.write(result)

# 从保存的文件加载数据

results = []
with open(result_file_name, 'r') as file:
for line in file:
# 将JSON字符串解析为字典,并追加到结果列表中
json_object = json.loads(line.strip())
results.append(json_object)

# 仅阅读首个结果
for res in results[:5]:
task_id = res['custom_id']
# 从任务ID获取索引
index = task_id.split('-')[-1]
result = res['response']['body']['choices'][0]['message']['content']
item = df.iloc[int(index)]
img_url = item['primary_image']
img = Image(url=img_url)
display(img)
print(f"CAPTION: {result}\n\n")

CAPTION: Brushed brass pedestal towel rack with a sleek, modern design, featuring multiple bars for hanging towels, measuring 25.75 x 14.44 x 32 inches.

CAPTION: Black round end table featuring a tempered glass top and a metal frame, with a lower shelf for additional storage.

CAPTION: Black collapsible and height-adjustable telescoping stool, portable and designed for makeup artists and hairstylists, shown in various stages of folding for easy transport.

CAPTION: Ergonomic pink gaming chair featuring breathable fabric, adjustable height, lumbar support, a footrest, and a swivel recliner function.

CAPTION: A set of two Glitzhome adjustable bar stools featuring a mid-century modern design with swivel seats, PU leather upholstery, and wooden backrests.

总结

在本手册中,我们看到了如何使用新的Batch API的两个示例,但请记住,Batch API的工作方式与Chat Completions端点相同,支持相同的参数和大多数最新的模型(如gpt-3.5-turbo、gpt-4、gpt-4-vision-preview、gpt-4-turbo等)。

通过使用这个API,您可以显著降低成本,因此我们建议将每个可以异步发生的工作负载切换到使用这个新API的批处理作业。