
本笔记演示了在Chat Completions API中使用logprobs参数的方法。当启用logprobs时,API会返回每个输出token的对数概率,以及在每个token位置上最可能的有限数量的token及其对数概率。相关的请求参数包括: * logprobs:是否返回输出token的对数概率。如果为true,则返回消息内容中每个输出token的对数概率。目前在gpt-4-vision-preview模型上不可用。 * top_logprobs:一个介于0和5之间的整数,指定要在每个token位置返回的最可能token的数量,每个token都有一个关联的对数概率。如果使用该参数,logprobs必须设置为true。


根据上下文中的先前token在特定位置出现的token的概率。关于logprobs的一些关键点: * 较高的对数概率表明在该上下文中该token的可能性较高。这使用户可以评估模型对其输出的信心或探索模型考虑的替代响应。 * 对数概率可以是任何负数或0.00.0对应于100%的概率。 * 对数概率使我们能够计算序列的联合概率,作为各个token的对数概率之和。这对于评分和排名模型输出很有用。另一种常见的方法是取句子的每个token的平均对数概率来选择最佳生成。 * 我们可以检查分配给不同候选token的logprobs,以了解模型考虑的哪些选项是可信的或不可信的。


  1. 分类任务
  • 大型语言模型在许多分类任务上表现出色,但准确衡量模型对其输出的信心可能具有挑战性。logprobs为每个类别预测提供了一个概率,使用户能够设置自己的分类或置信阈值。
  1. 检索(问答)评估
  • logprobs可以帮助在检索应用中进行自我评估。在问答示例中,模型输出一个虚构的has_sufficient_context_for_answer布尔值,可以作为答案是否包含在检索内容中的置信度分数。这种类型的评估可以减少基于检索的幻觉,并提高准确性。
  1. 自动完成
  • logprobs可以帮助我们决定在用户输入时如何建议单词。
  1. Token高亮显示和输出字节
  • 用户可以轻松地使用启用logprobs时附带的内置标记化创建一个标记高亮器。此外,字节参数包括每个输出字符的ASCII编码,这对于再现表情符号和特殊字符特别有用。
  1. 计算困惑度
  • logprobs可用于帮助我们评估模型对结果的整体信心,并帮助我们比较来自不同提示的结果的置信度。

0. 导入和工具

from openai import OpenAI
from math import exp
import numpy as np
from IPython.display import display, HTML
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

def get_completion(
messages: list[dict[str, str]],
model: str = "gpt-4",
logprobs=None, # 是否返回输出标记的对数概率。如果为真,则在消息内容中返回每个输出标记的对数概率。
) -> str:
params = {
"model": model,
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
"stop": stop,
"seed": seed,
"logprobs": logprobs,
"top_logprobs": top_logprobs,
if tools:
params["tools"] = tools

completion = client.chat.completions.create(**params)
return completion

1. 使用logprobs来评估分类任务的置信度




CLASSIFICATION_PROMPT = """You will be given a headline of a news article.
Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article headline: {headline}"""

让我们看看三个样本标题,并首先从一个没有logprobs的标准Chat Completions输出开始。

headlines = [
"Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
"Local Mayor Launches Initiative to Enhance Urban Public Transport.",
"Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut",

for headline in headlines:
print(f"\nHeadline: {headline}")
API_RESPONSE = get_completion(
[{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
print(f"Category: {API_RESPONSE.choices[0].message.content}\n")

Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Category: Technology

Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
Category: Politics

Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
Category: Art


for headline in headlines:
print(f"\nHeadline: {headline}")
API_RESPONSE = get_completion(
[{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
html_content = ""
for i, logprob in enumerate(top_two_logprobs, start=1):
html_content += (
f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"

Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Output token 1: Technology, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Techn, logprobs: -13.781253, linear probability: 0.0%

Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
Output token 1: Politics, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Technology, logprobs: -13.937503, linear probability: 0.0%

Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
Output token 1: Art, logprobs: -0.009169078, linear probability: 99.09%
Output token 2: Sports, logprobs: -4.696669, linear probability: 0.91%

正如从前两个标题所预期的那样,gpt-4 对其分类几乎有100%的信心,因为内容分别明显聚焦在技术和政治领域。然而,第三个标题结合了体育和与艺术相关的主题,因此我们看到模型对其选择的信心较低。


2. 检索置信度评分以减少幻觉



# 文章获取
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.

# 根据文章内容可以轻松回答的问题
easy_questions = [
"What nationality was Ada Lovelace?",
"What was an important finding from Lovelace's seventh note?",

# 文章中未完全涵盖的问题
medium_questions = [
"Did Lovelace collaborate with Charles Dickens",
"What concepts did Lovelace build with Charles Babbage",

现在,我们可以要求模型回答问题,然后评估其回答。具体来说,我们将要求模型输出一个布尔值 has_sufficient_context_for_answer。然后,我们可以评估 logprobs 来查看模型对其回答是否包含在提供的上下文中有多自信。

PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.

html_output = ""
html_output += "Questions clearly answered in article"

for question in easy_questions:
API_RESPONSE = get_completion(
"role": "user",
"content": PROMPT.format(
article=ada_lovelace_article, question=question
html_output += f'<p style="color:green">Question: {question}</p>'
for logprob in API_RESPONSE.choices[0].logprobs.content:
html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

html_output += "Questions only partially covered in the article"

for question in medium_questions:
API_RESPONSE = get_completion(
"role": "user",
"content": PROMPT.format(
article=ada_lovelace_article, question=question
html_output += f'<p style="color:green">Question: {question}</p>'
for logprob in API_RESPONSE.choices[0].logprobs.content:
html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'


Questions clearly answered in article

Question: What nationality was Ada Lovelace?

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, linear probability: 100.0%

Question: What was an important finding from Lovelace's seventh note?

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, linear probability: 100.0%

Questions only partially covered in the article

Question: Did Lovelace collaborate with Charles Dickens

has_sufficient_context_for_answer: True, logprobs: -0.06993677, linear probability: 93.25%

Question: What concepts did Lovelace build with Charles Babbage

has_sufficient_context_for_answer: False, logprobs: -0.61807257, linear probability: 53.9%




3. 自动完成功能

logprobs 的另一个用例是自动补全系统。在不需要从头到尾创建整个自动补全系统的情况下,让我们演示一下如何利用 logprobs 来帮助我们决定在用户输入时如何建议单词。

首先,让我们构造一个示例句子:"我最不喜欢的电视节目是绝命毒师。" 假设我们希望在我们输入句子时动态推荐下一个单词或标记,但仅当模型非常确定下一个单词是什么时。为了演示这一点,让我们将句子分解为顺序组件。

sentence_list = [
"My least",
"My least favorite",
"My least favorite TV",
"My least favorite TV show",
"My least favorite TV show is",
"My least favorite TV show is Breaking Bad",


high_prob_completions = {}
low_prob_completions = {}
html_output = ""

for sentence in sentence_list:
PROMPT = """完成这个句子。你正在扮演自动补全的角色。只需尽你所能地完成这个句子,确保它只是一句话:{sentence}"""
API_RESPONSE = get_completion(
[{"role": "user", "content": PROMPT.format(sentence=sentence)}],
html_output += f'<p>Sentence: {sentence}</p>'
first_token = True
for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
html_output += f'<p style="color:cyan">Predicted next token: {token.token}, <span style="color:darkorange">logprobs: {token.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'
if first_token:
if np.exp(token.logprob) > 0.95:
high_prob_completions[sentence] = token.token
if np.exp(token.logprob) < 0.60:
low_prob_completions[sentence] = token.token
first_token = False
html_output += "<br>"


Sentence: My

Predicted next token: favorite, logprobs: -0.18245785, linear probability: 83.32%

Predicted next token: dog, logprobs: -2.397172, linear probability: 9.1%

Predicted next token: ap, logprobs: -3.8732424, linear probability: 2.08%

Sentence: My least

Predicted next token: favorite, logprobs: -0.0146376295, linear probability: 98.55%

Predicted next token: My, logprobs: -4.2417912, linear probability: 1.44%

Predicted next token: favorite, logprobs: -9.748788, linear probability: 0.01%

Sentence: My least favorite

Predicted next token: food, logprobs: -0.9481721, linear probability: 38.74%

Predicted next token: My, logprobs: -1.3447137, linear probability: 26.06%

Predicted next token: color, logprobs: -1.3887696, linear probability: 24.94%

Sentence: My least favorite TV

Predicted next token: show, logprobs: -0.0007898556, linear probability: 99.92%

Predicted next token: My, logprobs: -7.711523, linear probability: 0.04%

Predicted next token: series, logprobs: -9.348547, linear probability: 0.01%

Sentence: My least favorite TV show

Predicted next token: is, logprobs: -0.2851253, linear probability: 75.19%

Predicted next token: of, logprobs: -1.55335, linear probability: 21.15%

Predicted next token: My, logprobs: -3.4928775, linear probability: 3.04%

Sentence: My least favorite TV show is

Predicted next token: "My, logprobs: -0.69349754, linear probability: 49.98%

Predicted next token: "The, logprobs: -1.2899293, linear probability: 27.53%

Predicted next token: My, logprobs: -2.4170141, linear probability: 8.92%

Sentence: My least favorite TV show is Breaking Bad

Predicted next token: because, logprobs: -0.17786823, linear probability: 83.71%

Predicted next token: ,, logprobs: -2.3946173, linear probability: 9.12%

Predicted next token: ., logprobs: -3.1861975, linear probability: 4.13%



{'My least': 'favorite', 'My least favorite TV': 'show'}

这些看起来很合理!我们可以对这些建议感到自信。在写完’My least favorite TV’后,很可能你想写’show’!现在让我们看看模型对自动完成建议不太有信心的部分:


{'My least favorite': 'food', 'My least favorite TV show is': '"My'}

这些也是逻辑的。仅凭前缀“我最不喜欢”,用户要表达什么并不清楚,作者最喜欢的电视节目是什么也只能由任何人猜测。 因此,使用gpt-3.5-turbo,我们可以使用logprobs创建一个动态自动完成引擎的根!

4. 高亮器和字节参数


PROMPT = """英语中最长的单词是什么?"""

API_RESPONSE = get_completion(
[{"role": "user", "content": PROMPT}], model="gpt-4", logprobs=True, top_logprobs=5

def highlight_text(api_response):
colors = [
"#FF00FF", # Magenta
"#008000", # Green
"#FF8C00", # Dark Orange
"#FF0000", # Red
"#0000FF", # Blue
tokens = api_response.choices[0].logprobs.content

color_idx = 0 # Initialize color index
html_output = "" # Initialize HTML output
for t in tokens:
token_str = bytes(t.bytes).decode("utf-8") # Decode bytes to string

# Add colored token to HTML output
html_output += f"<span style='color: {colors[color_idx]}'>{token_str}</span>"

# 切换到下一个颜色
color_idx = (color_idx + 1) % len(colors)
display(HTML(html_output)) # 显示 HTML 输出
print(f"Total number of tokens: {len(tokens)}")


The longest word in the English language, according to the Guinness World Records, is 'pneumonoultramicroscopicsilicovolcanoconiosis'. It is a type of lung disease caused by inhaling ash and sand dust.
Total number of tokens: 51


PROMPT = """输出蓝色心形表情符号及其名称。"""
API_RESPONSE = get_completion(
[{"role": "user", "content": PROMPT}], model="gpt-4", logprobs=True

aggregated_bytes = []
joint_logprob = 0.0

# 遍历各个词元,聚合字节并计算联合对数概率
for token in API_RESPONSE.choices[0].logprobs.content:
print("Token:", token.token)
print("Log prob:", token.logprob)
print("Linear prob:", np.round(exp(token.logprob) * 100, 2), "%")
print("Bytes:", token.bytes, "\n")
aggregated_bytes += token.bytes
joint_logprob += token.logprob

# 将聚合的字节解码为文本
aggregated_text = bytes(aggregated_bytes).decode("utf-8")

# 断言解码后的文本与消息内容相同
assert API_RESPONSE.choices[0].message.content == aggregated_text

# 打印结果
print("Bytes array:", aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print("Joint prob:", np.round(exp(joint_logprob) * 100, 2), "%")

Token: \xf0\x9f\x92
Log prob: -0.0003056686
Linear prob: 99.97 %
Bytes: [240, 159, 146]

Token: \x99
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [153]

Token: -
Log prob: -0.0096905725
Linear prob: 99.04 %
Bytes: [32, 45]

Token: Blue
Log prob: -0.00042042506
Linear prob: 99.96 %
Bytes: [32, 66, 108, 117, 101]

Token: Heart
Log prob: -7.302705e-05
Linear prob: 99.99 %
Bytes: [32, 72, 101, 97, 114, 116]

Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]
Decoded bytes: 💙 - Blue Heart
Joint prob: 98.96 %



5. 计算困惑度



prompts = [
"In a short sentence, has artifical intelligence grown in the last decade?",
"In a short sentence, what are your thoughts on the future of artificial intelligence?",

for prompt in prompts:
API_RESPONSE = get_completion(
[{"role": "user", "content": prompt}],

logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]
response_text = API_RESPONSE.choices[0].message.content
response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]
max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
max_token_length = max(len(s) for s in response_text_tokens)

formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

perplexity_score = np.exp(-np.mean(logprobs))
print("Prompt:".ljust(max_starter_length), prompt)
print("Response:".ljust(max_starter_length), response_text, "\n")
print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
print("Perplexity:".ljust(max_starter_length), perplexity_score, "\n")

Prompt:     In a short sentence, has artifical intelligence grown in the last decade?
Response: Yes, artificial intelligence has grown significantly in the last decade.

Tokens: Yes , artificial intelligence has grown significantly in the last decade .
Logprobs: -0.00 -0.00 -0.00 -0.00 -0.00 -0.53 -0.11 -0.00 -0.00 -0.01 -0.00 -0.00
Perplexity: 1.0564125277713383

Prompt: In a short sentence, what are your thoughts on the future of artificial intelligence?
Response: The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed.

Tokens: The future of artificial intelligence holds great potential for transforming industries and improving efficiency , but also raises ethical and societal concerns that must be carefully addressed .
Logprobs: -0.19 -0.03 -0.00 -0.00 -0.00 -0.30 -0.51 -0.24 -0.03 -1.45 -0.23 -0.03 -0.22 -0.83 -0.48 -0.01 -0.38 -0.07 -0.47 -0.63 -0.18 -0.26 -0.01 -0.14 -0.00 -0.59 -0.55 -0.00
Perplexity: 1.3220795252314004

在这个例子中,gpt-3.5-turbo 对于关于最近历史的更确定性问题返回了一个较低的困惑度分数,对于关于不久的未来的更推测性评估返回了一个较高的困惑度分数。再次强调,虽然这些差异并不能保证准确性,但它们有助于指引我们对模型结果的解释以及未来的使用方式。

6. 结论


7. 可能的扩展

在这本食谱中没有涵盖的logprobs的许多其他用例。我们可以将logprobs用于: - 内容审核 - 关键词选择 - 改进提示和输出的可解释性 - 令牌修复 - 等等!