本笔记演示了在Chat Completions API中使用logprobs参数的方法。当启用logprobs时，API会返回每个输出token的对数概率，以及在每个token位置上最可能的有限数量的token及其对数概率。相关的请求参数包括： * logprobs：是否返回输出token的对数概率。如果为true，则返回消息内容中每个输出token的对数概率。目前在gpt-4-vision-preview模型上不可用。 * top_logprobs：一个介于0和5之间的整数，指定要在每个token位置返回的最可能token的数量，每个token都有一个关联的对数概率。如果使用该参数，logprobs必须设置为true。

输出token的对数概率表示在给定上下文的情况下每个token出现在序列中的可能性。简单来说，对数概率是`log(p)`，其中`p`

根据上下文中的先前token在特定位置出现的token的概率。关于logprobs的一些关键点： * 较高的对数概率表明在该上下文中该token的可能性较高。这使用户可以评估模型对其输出的信心或探索模型考虑的替代响应。 * 对数概率可以是任何负数或0.0。0.0对应于100%的概率。 * 对数概率使我们能够计算序列的联合概率，作为各个token的对数概率之和。这对于评分和排名模型输出很有用。另一种常见的方法是取句子的每个token的平均对数概率来选择最佳生成。 * 我们可以检查分配给不同候选token的logprobs，以了解模型考虑的哪些选项是可信的或不可信的。

虽然logprobs有各种用例，但本笔记将重点介绍其用于以下方面：

分类任务

大型语言模型在许多分类任务上表现出色，但准确衡量模型对其输出的信心可能具有挑战性。logprobs为每个类别预测提供了一个概率，使用户能够设置自己的分类或置信阈值。

检索（问答）评估

logprobs可以帮助在检索应用中进行自我评估。在问答示例中，模型输出一个虚构的has_sufficient_context_for_answer布尔值，可以作为答案是否包含在检索内容中的置信度分数。这种类型的评估可以减少基于检索的幻觉，并提高准确性。

自动完成

logprobs可以帮助我们决定在用户输入时如何建议单词。

Token高亮显示和输出字节

用户可以轻松地使用启用logprobs时附带的内置标记化创建一个标记高亮器。此外，字节参数包括每个输出字符的ASCII编码，这对于再现表情符号和特殊字符特别有用。

计算困惑度

logprobs可用于帮助我们评估模型对结果的整体信心，并帮助我们比较来自不同提示的结果的置信度。

0. 导入和工具

from openai import OpenAI
from math import exp
import numpy as np
from IPython.display import display, HTML
import os

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

def get_completion(
    messages: list[dict[str, str]],
    model: str = "gpt-4",
    max_tokens=500,
    temperature=0,
    stop=None,
    seed=123,
    tools=None,
    logprobs=None,  # 是否返回输出标记的对数概率。如果为真，则在消息内容中返回每个输出标记的对数概率。
    top_logprobs=None,
) -> str:
    params = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stop": stop,
        "seed": seed,
        "logprobs": logprobs,
        "top_logprobs": top_logprobs,
    }
    if tools:
        params["tools"] = tools

    completion = client.chat.completions.create(**params)
    return completion

1. 使用`logprobs`来评估分类任务的置信度

假设我们想要创建一个系统，将新闻文章分类到一组预定义的类别中。如果没有启用logprobs，我们可以使用对话完成来实现这一点，但要评估模型对其分类的确定性要困难得多。

现在，启用了logprobs，我们可以准确地看到模型对其预测的信心程度，这对于创建准确可信的分类器至关重要。例如，如果所选类别的对数概率很高，这表明模型对其分类非常有信心。如果很低，则表明模型的信心较低。这在模型的分类不符合预期或需要人工审核或验证模型输出的情况下特别有用。

我们将从一个提示开始，该提示向模型展示四个类别：技术，政治，体育和艺术。然后，模型被要求仅根据文章标题将文章分类到这些类别中。

CLASSIFICATION_PROMPT = """You will be given a headline of a news article.
Classify the article into one of the following categories: Technology, Politics, Sports, and Art.
Return only the name of the category, and nothing else.
MAKE SURE your output is one of the four categories stated.
Article headline: {headline}"""

让我们看看三个样本标题，并首先从一个没有logprobs的标准Chat Completions输出开始。

headlines = [
    "Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.",
    "Local Mayor Launches Initiative to Enhance Urban Public Transport.",
    "Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut",
]

for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4",
    )
    print(f"Category: {API_RESPONSE.choices[0].message.content}\n")

Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.
Category: Technology


Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.
Category: Politics


Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut
Category: Art

这里我们可以看到每个标题的选定类别。然而，我们无法看到模型对其预测的置信度。让我们重新运行相同的提示，但启用logprobs，并将top_logprobs设置为2（这将显示每个标记的2个最可能的输出标记）。此外，我们还可以输出每个输出标记的线性概率，以便将对数概率转换为更容易解释的0-100%的比例。

for headline in headlines:
    print(f"\nHeadline: {headline}")
    API_RESPONSE = get_completion(
        [{"role": "user", "content": CLASSIFICATION_PROMPT.format(headline=headline)}],
        model="gpt-4",
        logprobs=True,
        top_logprobs=2,
    )
    top_two_logprobs = API_RESPONSE.choices[0].logprobs.content[0].top_logprobs
    html_content = ""
    for i, logprob in enumerate(top_two_logprobs, start=1):
        html_content += (
            f"<span style='color: cyan'>Output token {i}:</span> {logprob.token}, "
            f"<span style='color: darkorange'>logprobs:</span> {logprob.logprob}, "
            f"<span style='color: magenta'>linear probability:</span> {np.round(np.exp(logprob.logprob)*100,2)}%<br>"
        )
    display(HTML(html_content))
    print("\n")


Headline: Tech Giant Unveils Latest Smartphone Model with Advanced Photo-Editing Features.

Output token 1: Technology, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Techn, logprobs: -13.781253, linear probability: 0.0%

Headline: Local Mayor Launches Initiative to Enhance Urban Public Transport.

Output token 1: Politics, logprobs: -2.4584822e-06, linear probability: 100.0%
Output token 2: Technology, logprobs: -13.937503, linear probability: 0.0%

Headline: Tennis Champion Showcases Hidden Talents in Symphony Orchestra Debut

Output token 1: Art, logprobs: -0.009169078, linear probability: 99.09%
Output token 2: Sports, logprobs: -4.696669, linear probability: 0.91%

正如从前两个标题所预期的那样，gpt-4 对其分类几乎有100%的信心，因为内容分别明显聚焦在技术和政治领域。然而，第三个标题结合了体育和与艺术相关的主题，因此我们看到模型对其选择的信心较低。

这显示了使用logprobs的重要性，因为如果我们将LLMs用于分类任务，我们可以设置置信阈值，或者如果所选输出的对数概率不够高，可以输出几个潜在的输出标记。例如，如果我们正在创建一个用于标记文章的推荐引擎，我们可以自动分类跨越一定阈值的标题，并将不够确定的标题发送进行手动审核。

2. 检索置信度评分以减少幻觉

为了减少幻觉，并提高基于RAG的问答系统的性能，我们可以使用logprobs来评估模型在检索方面的自信程度。

假设我们已经使用RAG构建了一个用于问答的检索系统，但是我们在回答问题时遇到了虚构的答案困难。*注意：*我们将在这个示例中使用硬编码的文章，但是请查看食谱中的其他条目，了解如何使用RAG进行问答的教程。

# 文章获取
ada_lovelace_article = """Augusta Ada King, Countess of Lovelace (née Byron; 10 December 1815 – 27 November 1852) was an English mathematician and writer, chiefly known for her work on Charles Babbage's proposed mechanical general-purpose computer, the Analytical Engine. She was the first to recognise that the machine had applications beyond pure calculation.
Ada Byron was the only legitimate child of poet Lord Byron and reformer Lady Byron. All Lovelace's half-siblings, Lord Byron's other children, were born out of wedlock to other women. Byron separated from his wife a month after Ada was born and left England forever. He died in Greece when Ada was eight. Her mother was anxious about her upbringing and promoted Ada's interest in mathematics and logic in an effort to prevent her from developing her father's perceived insanity. Despite this, Ada remained interested in him, naming her two sons Byron and Gordon. Upon her death, she was buried next to him at her request. Although often ill in her childhood, Ada pursued her studies assiduously. She married William King in 1835. King was made Earl of Lovelace in 1838, Ada thereby becoming Countess of Lovelace.
Her educational and social exploits brought her into contact with scientists such as Andrew Crosse, Charles Babbage, Sir David Brewster, Charles Wheatstone, Michael Faraday, and the author Charles Dickens, contacts which she used to further her education. Ada described her approach as "poetical science" and herself as an "Analyst (& Metaphysician)".
When she was eighteen, her mathematical talents led her to a long working relationship and friendship with fellow British mathematician Charles Babbage, who is known as "the father of computers". She was in particular interested in Babbage's work on the Analytical Engine. Lovelace first met him in June 1833, through their mutual friend, and her private tutor, Mary Somerville.
Between 1842 and 1843, Ada translated an article by the military engineer Luigi Menabrea (later Prime Minister of Italy) about the Analytical Engine, supplementing it with an elaborate set of seven notes, simply called "Notes".
Lovelace's notes are important in the early history of computers, especially since the seventh one contained what many consider to be the first computer program—that is, an algorithm designed to be carried out by a machine. Other historians reject this perspective and point out that Babbage's personal notes from the years 1836/1837 contain the first programs for the engine. She also developed a vision of the capability of computers to go beyond mere calculating or number-crunching, while many others, including Babbage himself, focused only on those capabilities. Her mindset of "poetical science" led her to ask questions about the Analytical Engine (as shown in her notes) examining how individuals and society relate to technology as a collaborative tool.
"""

# 根据文章内容可以轻松回答的问题
easy_questions = [
    "What nationality was Ada Lovelace?",
    "What was an important finding from Lovelace's seventh note?",
]

# 文章中未完全涵盖的问题
medium_questions = [
    "Did Lovelace collaborate with Charles Dickens",
    "What concepts did Lovelace build with Charles Babbage",
]

现在，我们可以要求模型回答问题，然后评估其回答。具体来说，我们将要求模型输出一个布尔值 has_sufficient_context_for_answer。然后，我们可以评估 logprobs 来查看模型对其回答是否包含在提供的上下文中有多自信。

PROMPT = """You retrieved this article: {article}. The question is: {question}.
Before even answering the question, consider whether you have sufficient information in the article to answer the question fully.
Your output should JUST be the boolean true or false, of if you have sufficient information in the article to answer the question.
Respond with just one word, the boolean true or false. You must output the word 'True', or the word 'False', nothing else.
"""

html_output = ""
html_output += "Questions clearly answered in article"

for question in easy_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4",
        logprobs=True,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

html_output += "Questions only partially covered in the article"

for question in medium_questions:
    API_RESPONSE = get_completion(
        [
            {
                "role": "user",
                "content": PROMPT.format(
                    article=ada_lovelace_article, question=question
                ),
            }
        ],
        model="gpt-4",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p style="color:green">Question: {question}</p>'
    for logprob in API_RESPONSE.choices[0].logprobs.content:
        html_output += f'<p style="color:cyan">has_sufficient_context_for_answer: {logprob.token}, <span style="color:darkorange">logprobs: {logprob.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(logprob.logprob)*100,2)}%</span></p>'

display(HTML(html_output))

Questions clearly answered in article

Question: What nationality was Ada Lovelace?

has_sufficient_context_for_answer: True, logprobs: -3.1281633e-07, linear probability: 100.0%

Question: What was an important finding from Lovelace's seventh note?

has_sufficient_context_for_answer: True, logprobs: -7.89631e-07, linear probability: 100.0%

Questions only partially covered in the article

Question: Did Lovelace collaborate with Charles Dickens

has_sufficient_context_for_answer: True, logprobs: -0.06993677, linear probability: 93.25%

Question: What concepts did Lovelace build with Charles Babbage

has_sufficient_context_for_answer: False, logprobs: -0.61807257, linear probability: 53.9%

对于前两个问题，我们的模型断言（几乎）有100%的信心认为文章具有足够的上下文来回答提出的问题。

另一方面，对于那些在文章中回答不太明确的更棘手的问题，模型对自己是否具有足够上下文的信心较低。这是一个很好的防护措施，有助于确保我们检索到的内容是足够的。

这种自我评估可以帮助减少幻觉，因为当您的sufficient_context_for_answer对数概率低于一定阈值时，您可以限制答案或重新提示用户。已经证明，这样的方法可以显著减少问答幻觉和错误的发生率（示例）。

3. 自动完成功能

logprobs 的另一个用例是自动补全系统。在不需要从头到尾创建整个自动补全系统的情况下，让我们演示一下如何利用 logprobs 来帮助我们决定在用户输入时如何建议单词。

首先，让我们构造一个示例句子："我最不喜欢的电视节目是绝命毒师。" 假设我们希望在我们输入句子时动态推荐下一个单词或标记，但仅当模型非常确定下一个单词是什么时。为了演示这一点，让我们将句子分解为顺序组件。

sentence_list = [
    "My",
    "My least",
    "My least favorite",
    "My least favorite TV",
    "My least favorite TV show",
    "My least favorite TV show is",
    "My least favorite TV show is Breaking Bad",
]

现在，我们可以要求gpt-3.5-turbo充当一个自动补全引擎，使用模型所给定的任何上下文。我们可以启用logprobs，并查看模型对其预测的自信程度。

high_prob_completions = {}
low_prob_completions = {}
html_output = ""

for sentence in sentence_list:
    PROMPT = """完成这个句子。你正在扮演自动补全的角色。只需尽你所能地完成这个句子，确保它只是一句话：{sentence}"""
    API_RESPONSE = get_completion(
        [{"role": "user", "content": PROMPT.format(sentence=sentence)}],
        model="gpt-3.5-turbo",
        logprobs=True,
        top_logprobs=3,
    )
    html_output += f'<p>Sentence: {sentence}</p>'
    first_token = True
    for token in API_RESPONSE.choices[0].logprobs.content[0].top_logprobs:
        html_output += f'<p style="color:cyan">Predicted next token: {token.token}, <span style="color:darkorange">logprobs: {token.logprob}, <span style="color:magenta">linear probability: {np.round(np.exp(token.logprob)*100,2)}%</span></p>'
        if first_token:
            if np.exp(token.logprob) > 0.95:
                high_prob_completions[sentence] = token.token
            if np.exp(token.logprob) < 0.60:
                low_prob_completions[sentence] = token.token
        first_token = False
    html_output += "<br>"

display(HTML(html_output))

Sentence: My

Predicted next token: favorite, logprobs: -0.18245785, linear probability: 83.32%

Predicted next token: dog, logprobs: -2.397172, linear probability: 9.1%

Predicted next token: ap, logprobs: -3.8732424, linear probability: 2.08%

Sentence: My least

Predicted next token: favorite, logprobs: -0.0146376295, linear probability: 98.55%

Predicted next token: My, logprobs: -4.2417912, linear probability: 1.44%

Predicted next token: favorite, logprobs: -9.748788, linear probability: 0.01%

Sentence: My least favorite

Predicted next token: food, logprobs: -0.9481721, linear probability: 38.74%

Predicted next token: My, logprobs: -1.3447137, linear probability: 26.06%

Predicted next token: color, logprobs: -1.3887696, linear probability: 24.94%

Sentence: My least favorite TV

Predicted next token: show, logprobs: -0.0007898556, linear probability: 99.92%

Predicted next token: My, logprobs: -7.711523, linear probability: 0.04%

Predicted next token: series, logprobs: -9.348547, linear probability: 0.01%

Sentence: My least favorite TV show

Predicted next token: is, logprobs: -0.2851253, linear probability: 75.19%

Predicted next token: of, logprobs: -1.55335, linear probability: 21.15%

Predicted next token: My, logprobs: -3.4928775, linear probability: 3.04%

Sentence: My least favorite TV show is

Predicted next token: "My, logprobs: -0.69349754, linear probability: 49.98%

Predicted next token: "The, logprobs: -1.2899293, linear probability: 27.53%

Predicted next token: My, logprobs: -2.4170141, linear probability: 8.92%

Sentence: My least favorite TV show is Breaking Bad

Predicted next token: because, logprobs: -0.17786823, linear probability: 83.71%

Predicted next token: ,, logprobs: -2.3946173, linear probability: 9.12%

Predicted next token: ., logprobs: -3.1861975, linear probability: 4.13%

让我们来看一下高置信度的自动补全：

high_prob_completions

{'My least': 'favorite', 'My least favorite TV': 'show'}

这些看起来很合理！我们可以对这些建议感到自信。在写完’My least favorite TV’后，很可能你想写’show’！现在让我们看看模型对自动完成建议不太有信心的部分：

low_prob_completions

{'My least favorite': 'food', 'My least favorite TV show is': '"My'}

这些也是逻辑的。仅凭前缀“我最不喜欢”，用户要表达什么并不清楚，作者最喜欢的电视节目是什么也只能由任何人猜测。因此，使用gpt-3.5-turbo，我们可以使用logprobs创建一个动态自动完成引擎的根！

4. 高亮器和字节参数

让我们快速了解如何使用logprobs和bytes参数创建一个简单的标记高亮器。首先，我们可以创建一个函数来计算并突出显示每个标记。虽然这不使用对数概率，但它使用了启用logprobs时附带的内置标记化功能。

PROMPT = """英语中最长的单词是什么？"""

API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4", logprobs=True, top_logprobs=5
)


def highlight_text(api_response):
    colors = [
        "#FF00FF",  # Magenta
        "#008000",  # Green
        "#FF8C00",  # Dark Orange
        "#FF0000",  # Red
        "#0000FF",  # Blue
    ]
    tokens = api_response.choices[0].logprobs.content

    color_idx = 0  # Initialize color index
    html_output = ""  # Initialize HTML output
    for t in tokens:
        token_str = bytes(t.bytes).decode("utf-8")  # Decode bytes to string

        # Add colored token to HTML output
        html_output += f"<span style='color: {colors[color_idx]}'>{token_str}</span>"

        # 切换到下一个颜色
        color_idx = (color_idx + 1) % len(colors)
    display(HTML(html_output))  # 显示 HTML 输出
    print(f"Total number of tokens: {len(tokens)}")

highlight_text(API_RESPONSE)

The longest word in the English language, according to the Guinness World Records, is 'pneumonoultramicroscopicsilicovolcanoconiosis'. It is a type of lung disease caused by inhaling ash and sand dust.

Total number of tokens: 51

接下来，让我们使用bytes参数重构一个句子。启用logprobs后，我们会得到每个标记以及该标记字符串的ASCII（十进制utf-8）值。处理包含表情符号或特殊字符的标记时，这些ASCII值可能会很有帮助。

PROMPT = """输出蓝色心形表情符号及其名称。"""
API_RESPONSE = get_completion(
    [{"role": "user", "content": PROMPT}], model="gpt-4", logprobs=True
)

aggregated_bytes = []
joint_logprob = 0.0

# 遍历各个词元，聚合字节并计算联合对数概率
for token in API_RESPONSE.choices[0].logprobs.content:
    print("Token:", token.token)
    print("Log prob:", token.logprob)
    print("Linear prob:", np.round(exp(token.logprob) * 100, 2), "%")
    print("Bytes:", token.bytes, "\n")
    aggregated_bytes += token.bytes
    joint_logprob += token.logprob

# 将聚合的字节解码为文本
aggregated_text = bytes(aggregated_bytes).decode("utf-8")

# 断言解码后的文本与消息内容相同
assert API_RESPONSE.choices[0].message.content == aggregated_text

# 打印结果
print("Bytes array:", aggregated_bytes)
print(f"Decoded bytes: {aggregated_text}")
print("Joint prob:", np.round(exp(joint_logprob) * 100, 2), "%")

Token: \xf0\x9f\x92
Log prob: -0.0003056686
Linear prob: 99.97 %
Bytes: [240, 159, 146] 

Token: \x99
Log prob: 0.0
Linear prob: 100.0 %
Bytes: [153] 

Token:  -
Log prob: -0.0096905725
Linear prob: 99.04 %
Bytes: [32, 45] 

Token:  Blue
Log prob: -0.00042042506
Linear prob: 99.96 %
Bytes: [32, 66, 108, 117, 101] 

Token:  Heart
Log prob: -7.302705e-05
Linear prob: 99.99 %
Bytes: [32, 72, 101, 97, 114, 116] 

Bytes array: [240, 159, 146, 153, 32, 45, 32, 66, 108, 117, 101, 32, 72, 101, 97, 114, 116]
Decoded bytes: 💙 - Blue Heart
Joint prob: 98.96 %

在这里，我们看到第一个标记是\xf0\x9f\x92'，我们可以获取它的ASCII值并将其附加到一个字节数组中。然后，我们可以轻松地将这个数组解码为一个完整的句子，并通过我们的断言语句验证解码后的字节与我们的完成消息相同！

此外，我们可以得到整个完成的联合概率，这是每个标记对数概率的指数乘积。这给出了在给定提示的情况下这个完成有多么可能。由于我们的提示相当明确（要求特定的表情符号及其名称），这个输出的联合概率很高！然而，如果我们要求一个随机输出，我们会看到一个更低的联合概率。这也可以是开发人员在提示工程中的一个好策略。

5. 计算困惑度

在评估模型对结果的信心时，计算困惑度可能是有用的，它是一种衡量不确定性的指标。困惑度可以通过对logprobs的平均值取负指数来计算。通常，较高的困惑度表示结果更不确定，而较低的困惑度表示结果更有信心。因此，困惑度可以用于评估单个模型运行的结果，也可以用于比较不同模型运行之间结果的相对信心水平。虽然高信心并不能保证结果的准确性，但它可以作为一个有用的信号，可以与其他评估指标配合使用，以更好地理解您的提示行为。

例如，假设我想使用gpt-3.5-turbo来了解更多关于人工智能的信息。我可以提出一个关于最近历史的问题和一个关于未来的问题：

prompts = [
    "In a short sentence, has artifical intelligence grown in the last decade?",
    "In a short sentence, what are your thoughts on the future of artificial intelligence?",
]

for prompt in prompts:
    API_RESPONSE = get_completion(
        [{"role": "user", "content": prompt}],
        model="gpt-3.5-turbo",
        logprobs=True,
    )

    logprobs = [token.logprob for token in API_RESPONSE.choices[0].logprobs.content]
    response_text = API_RESPONSE.choices[0].message.content
    response_text_tokens = [token.token for token in API_RESPONSE.choices[0].logprobs.content]
    max_starter_length = max(len(s) for s in ["Prompt:", "Response:", "Tokens:", "Logprobs:", "Perplexity:"])
    max_token_length = max(len(s) for s in response_text_tokens)
    

    formatted_response_tokens = [s.rjust(max_token_length) for s in response_text_tokens]
    formatted_lps = [f"{lp:.2f}".rjust(max_token_length) for lp in logprobs]

    perplexity_score = np.exp(-np.mean(logprobs))
    print("Prompt:".ljust(max_starter_length), prompt)
    print("Response:".ljust(max_starter_length), response_text, "\n")
    print("Tokens:".ljust(max_starter_length), " ".join(formatted_response_tokens))
    print("Logprobs:".ljust(max_starter_length), " ".join(formatted_lps))
    print("Perplexity:".ljust(max_starter_length), perplexity_score, "\n")

Prompt:     In a short sentence, has artifical intelligence grown in the last decade?
Response:   Yes, artificial intelligence has grown significantly in the last decade. 

Tokens:                Yes              ,     artificial   intelligence            has          grown  significantly             in            the           last         decade              .
Logprobs:            -0.00          -0.00          -0.00          -0.00          -0.00          -0.53          -0.11          -0.00          -0.00          -0.01          -0.00          -0.00
Perplexity: 1.0564125277713383 

Prompt:     In a short sentence, what are your thoughts on the future of artificial intelligence?
Response:   The future of artificial intelligence holds great potential for transforming industries and improving efficiency, but also raises ethical and societal concerns that must be carefully addressed. 

Tokens:               The        future            of    artificial  intelligence         holds         great     potential           for  transforming    industries           and     improving    efficiency             ,           but          also        raises       ethical           and      societal      concerns          that          must            be     carefully     addressed             .
Logprobs:           -0.19         -0.03         -0.00         -0.00         -0.00         -0.30         -0.51         -0.24         -0.03         -1.45         -0.23         -0.03         -0.22         -0.83         -0.48         -0.01         -0.38         -0.07         -0.47         -0.63         -0.18         -0.26         -0.01         -0.14         -0.00         -0.59         -0.55         -0.00
Perplexity: 1.3220795252314004 

在这个例子中，gpt-3.5-turbo 对于关于最近历史的更确定性问题返回了一个较低的困惑度分数，对于关于不久的未来的更推测性评估返回了一个较高的困惑度分数。再次强调，虽然这些差异并不能保证准确性，但它们有助于指引我们对模型结果的解释以及未来的使用方式。

6. 结论

太棒了！我们成功地使用了logprobs参数来构建一个更健壮的分类器，评估我们问答系统的检索，以及对我们的标记的每个“字节”进行编码和解码！logprobs为我们的完成输出添加了有用的信息和信号，我们很期待看到开发者如何将其整合到应用程序中以改进应用。

7. 可能的扩展

在这本食谱中没有涵盖的logprobs的许多其他用例。我们可以将logprobs用于： - 内容审核 - 关键词选择 - 改进提示和输出的可解释性 - 令牌修复 - 等等！

输出token的对数概率表示在给定上下文的情况下每个token出现在序列中的可能性。简单来说，对数概率是log(p)，其中p

0. 导入和工具​

1. 使用logprobs来评估分类任务的置信度​

2. 检索置信度评分以减少幻觉​

3. 自动完成功能​

4. 高亮器和字节参数​

5. 计算困惑度​

6. 结论​

7. 可能的扩展​