如何流式传输完成内容
默认情况下,当您从OpenAI请求完成时,整个完成内容会在发送回来之前生成在单个响应中。
如果您正在生成较长的完成内容,等待响应可能需要花费很多秒钟。
为了更快地获得响应,您可以在生成过程中“流式传输”完成内容。这样可以在完成内容完全生成之前开始打印或处理完成内容的开头部分。
要流式传输完成内容,在调用聊天完成或完成端点时设置
stream=True
。这将返回一个对象,以
仅数据的服务器发送事件
的形式流式传输响应。从 delta
字段而不是 message
字段中提取块。
缺点
请注意,在生产应用程序中使用 stream=True
会使完成内容的内容更难以进行审核,因为部分完成内容可能更难评估。这可能会对批准的使用方式产生影响。
示例代码
下面,这个笔记本展示了: 1. 典型的聊天完成响应的样子 2. 流式传输聊天完成响应的样子 3. 通过流式传输聊天完成节省了多少时间 4. 如何获取用于流式传输聊天完成响应的令牌使用数据
# !pip install openai
# 导入
import time # 用于测量API调用时间持续性的工具
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
1. 典型的聊天完成响应是什么样的
通过典型的ChatCompletions API调用,响应首先被计算,然后一次性返回。
# OpenAI ChatCompletion请求示例
# https://platform.openai.com/docs/guides/text-generation/chat-completions-api
# 记录请求发送前的时间
start_time = time.time()
# 发送一个ChatCompletion请求,要求从1数到100。
response = client.chat.completions.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': 'Count to 100, with a comma between each number and no newlines. E.g., 1, 2, 3, ...'}
],
temperature=0,
)
# 计算接收响应所需的时间
response_time = time.time() - start_time
# 打印接收到的延迟时间和文本
print(f"Full response received {response_time:.2f} seconds after request")
print(f"Full response received:\n{response}")
Full response received 5.27 seconds after request
Full response received:
ChatCompletion(id='chatcmpl-8ZB8ywkV5DuuJO7xktqUcNYfG8j6I', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.', role='assistant', function_call=None, tool_calls=None))], created=1703395008, model='gpt-3.5-turbo-0613', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=299, prompt_tokens=36, total_tokens=335))
可以使用response.choices[0].message
来提取回复。
可以使用response.choices[0].message.content
来提取回复的内容。
reply = response.choices[0].message
print(f"Extracted reply: \n{reply}")
reply_content = response.choices[0].message.content
print(f"Extracted content: \n{reply_content}")
Extracted reply:
ChatCompletionMessage(content='1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.', role='assistant', function_call=None, tool_calls=None)
Extracted content:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100.
2. 如何流式传输聊天完成
通过流式API调用,响应将以逐步增量的方式通过事件流的形式返回。在Python中,您可以使用for
循环迭代这些事件。
让我们看看它是什么样子的:
# 使用 stream=True 参数的 OpenAI ChatCompletion 请求示例
# https://platform.openai.com/docs/api-reference/streaming#chat/create-stream
# 一次聊天完成请求
response = client.chat.completions.create(
model='gpt-3.5-turbo',
messages=[
{'role': 'user', 'content': "What's 1+1? Answer in one word."}
],
temperature=0,
stream=True # 这次,我们设置 stream=True
)
for chunk in response:
print(chunk)
print(chunk.choices[0].delta.content)
print("****************")
ChatCompletionChunk(id='chatcmpl-8ZB9m2Ubv8FJs3CIb84WvYwqZCHST', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role='assistant', tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703395058, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
****************
ChatCompletionChunk(id='chatcmpl-8ZB9m2Ubv8FJs3CIb84WvYwqZCHST', choices=[Choice(delta=ChoiceDelta(content='2', function_call=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None)], created=1703395058, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
2
****************
ChatCompletionChunk(id='chatcmpl-8ZB9m2Ubv8FJs3CIb84WvYwqZCHST', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1703395058, model='gpt-3.5-turbo-0613', object='chat.completion.chunk', system_fingerprint=None)
None
****************
正如您在上面看到的,流式响应具有delta
字段,而不是message
字段。delta
可以包含以下内容: -
一个角色令牌(例如,{"role": "assistant"}
) -
一个内容令牌(例如,{"content": "\n\n"}
) -
什么都没有(例如,{}
),当流结束时