跳到主要内容

如何评估摘要任务

nbviewer

在这个笔记本中,我们将通过一个简单的示例探讨抽象摘要任务的评估技术。我们将探讨传统的评估方法,如ROUGEBERTScore,并展示一种更新颖的方法,即使用LLMs作为评估器。

评估摘要质量是一个耗时的过程,因为它涉及不同的质量指标,如连贯性、简洁性、可读性和内容。传统的自动评估指标,如ROUGEBERTScore等,是具体而可靠的,但它们可能与摘要的实际质量关联不强。它们与人类判断的相关性相对较低,特别是对于开放式生成任务(Liu等,2023)。在需要依赖人类评估、用户反馈或基于模型的指标的情况下,需要警惕潜在的偏见。虽然人类判断提供了宝贵的见解,但往往不具备可扩展性,并且可能成本过高。

除了这些传统指标之外,我们展示了一种方法(G-Eval),它利用大型语言模型(LLMs)作为一种新颖的、无参考的度量标准,用于评估抽象摘要。在这种情况下,我们使用gpt-4来评分候选输出。gpt-4已经有效地学习了一种语言质量的内部模型,使其能够区分流畅、连贯的文本和低质量的文本。利用这种内部评分机制,可以自动评估LLM生成的新候选输出。

设置

# 安装评估所需的必要软件包
# ROUGE:使用ROUGE指标进行评估
# bert_score: 用于使用BERTScore进行评估
# OpenAI:与OpenAI的API进行交互
!pip install rouge --quiet
!pip install bert_score --quiet
!pip install openai --quiet

from openai import OpenAI
import os
import re
import pandas as pd

# Python实现的ROUGE评价指标
from rouge import Rouge

# BERTScore利用BERT预训练的上下文嵌入,通过余弦相似度匹配候选句和参考句中的词语。
from bert_score import BERTScorer

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))


<IPython.core.display.Javascript object>

示例任务

为了本笔记本的目的,我们将使用下面的示例摘要。请注意,我们提供了两个生成的摘要进行比较,以及一个参考的人工撰写的摘要,评估指标如ROUGEBERTScore需要。

摘录(excerpt):

OpenAI的使命是确保人工通用智能(AGI)造福全人类。OpenAI将直接构建安全和有益的AGI,但如果其工作有助于他人实现这一目标,也将认为其使命已完成。OpenAI遵循几项关键原则。首先,广泛分布的利益 - 对AGI部署的任何影响都将用于造福所有人,并避免有害用途或权力过度集中。其次,长期安全性 - OpenAI致力于进行研究,使AGI变得安全,并促进这些研究在人工智能社区中的采纳。第三,技术领导力 - OpenAI旨在成为人工智能能力的领先者。第四,合作导向 - OpenAI积极与其他研究和政策机构合作,并寻求创建一个全球社区,共同努力解决AGI的全球挑战。

摘要:

参考摘要 / ref_summary(人工生成) 评估摘要 1 / eval_summary_1(系统生成) 评估摘要 2 / eval_summary_2(系统生成)
OpenAI旨在确保人工通用智能(AGI)造福所有人,避免有害用途或权力过度集中。它致力于研究AGI安全性,推动这些研究在人工智能社区中的普及。OpenAI致力于在人工智能能力方面领先,并与全球研究和政策机构合作,解决AGI的挑战。 OpenAI旨在使AGI造福全人类,避免有害用途和权力集中。它领先研究安全和有益的AGI,并在全球推广。OpenAI在人工智能方面保持技术领先地位,同时与全球机构合作,解决AGI挑战。它致力于领导一个协作的全球努力,为集体利益发展AGI。 OpenAI旨在确保AGI造福所有人使用,完全避免有害内容或权力过度集中。致力于研究AGI的安全方面,在人工智能领域推广这些研究。OpenAI希望在人工智能领域处于领先地位,并与全球研究、政策团体合作,解决AGI的问题。

花点时间来确定您个人偏好的摘要,以及那个真正捕捉了OpenAI的使命。

excerpt = "OpenAI's mission is to ensure that artificial general intelligence (AGI) benefits all of humanity. OpenAI will build safe and beneficial AGI directly, but will also consider its mission fulfilled if its work aids others to achieve this outcome. OpenAI follows several key principles for this purpose. First, broadly distributed benefits - any influence over AGI's deployment will be used for the benefit of all, and to avoid harmful uses or undue concentration of power. Second, long-term safety - OpenAI is committed to doing the research to make AGI safe, and to promote the adoption of such research across the AI community. Third, technical leadership - OpenAI aims to be at the forefront of AI capabilities. Fourth, a cooperative orientation - OpenAI actively cooperates with other research and policy institutions, and seeks to create a global community working together to address AGI's global challenges."
ref_summary = "OpenAI aims to ensure artificial general intelligence (AGI) is used for everyone's benefit, avoiding harmful uses or undue power concentration. It is committed to researching AGI safety, promoting such studies among the AI community. OpenAI seeks to lead in AI capabilities and cooperates with global research and policy institutions to address AGI's challenges."
eval_summary_1 = "OpenAI aims to AGI benefits all humanity, avoiding harmful uses and power concentration. It pioneers research into safe and beneficial AGI and promotes adoption globally. OpenAI maintains technical leadership in AI while cooperating with global institutions to address AGI challenges. It seeks to lead a collaborative worldwide effort developing AGI for collective good."
eval_summary_2 = "OpenAI aims to ensure AGI is for everyone's use, totally avoiding harmful stuff or big power concentration. Committed to researching AGI's safe side, promoting these studies in AI folks. OpenAI wants to be top in AI things and works with worldwide research, policy groups to figure AGI's stuff."

<IPython.core.display.Javascript object>

使用ROUGE进行评估

ROUGE,全称为召回导向的摘要评估助手,主要衡量生成的输出与参考文本之间的词重叠。这是评估自动摘要任务的一种流行度量标准。在其变体中,ROUGE-L提供了关于系统生成摘要和参考摘要之间最长连续匹配的见解,评估系统保留原摘要要义的能力。

# 计算Rouge分数的函数
def get_rouge_scores(text1, text2):
rouge = Rouge()
return rouge.get_scores(text1, text2)


rouge_scores_out = []

# 使用参考摘要计算这两个摘要的ROUGE分数。
eval_1_rouge = get_rouge_scores(eval_summary_1, ref_summary)
eval_2_rouge = get_rouge_scores(eval_summary_2, ref_summary)

for metric in ["rouge-1", "rouge-2", "rouge-l"]:
for label in ["F-Score"]:
eval_1_score = eval_1_rouge[0][metric][label[0].lower()]
eval_2_score = eval_2_rouge[0][metric][label[0].lower()]

row = {
"Metric": f"{metric} ({label})",
"Summary 1": eval_1_score,
"Summary 2": eval_2_score,
}
rouge_scores_out.append(row)


def highlight_max(s):
is_max = s == s.max()
return [
"background-color: lightgreen" if v else "background-color: white"
for v in is_max
]


rouge_scores_out = (
pd.DataFrame(rouge_scores_out)
.set_index("Metric")
.style.apply(highlight_max, axis=1)
)

rouge_scores_out

  Summary 1 Summary 2
Metric    
rouge-1 (F-Score) 0.488889 0.511628
rouge-2 (F-Score) 0.230769 0.163265
rouge-l (F-Score) 0.488889 0.511628
<IPython.core.display.Javascript object>

表格显示了评估两个不同摘要与参考文本的ROUGE分数。在rouge-1的情况下,摘要2优于摘要1,表明单词之间有更好的重叠,而对于rouge-l,摘要2得分更高,暗示最长公共子序列匹配更接近,因此在捕捉原始文本的主要内容和顺序方面可能有更好的总结。由于摘要2中有许多直接从摘录中提取的单词和短语,它与参考摘要的重叠可能更高,从而导致更高的ROUGE分数。

虽然ROUGE和类似的度量标准,如BLEUMETEOR,提供了定量度量,但它们通常无法捕捉良好生成摘要的真正本质。它们与人类评分的相关性也较差。鉴于LLM的进展,它们擅长生成流畅连贯的摘要,传统度量标准如ROUGE可能会无意中对这些模型进行惩罚。特别是如果摘要以不同方式表达但仍准确概括核心信息的情况下,这一点尤为真实。

使用BERTScore进行评估

ROUGE依赖于预测文本和参考文本中单词的确切存在,无法解释其潜在语义。这就是BERTScore发挥作用的地方,它利用了来自BERT模型的上下文嵌入,旨在评估在机器生成文本的背景下预测句子和参考句子之间的相似性。通过比较两个句子的嵌入,BERTScore捕捉到传统n-gram基于度量可能忽略的语义相似性。

# 实例化用于英语语言的BERTScorer对象
scorer = BERTScorer(lang="en")

# 计算摘要1与摘录之间的BERTScore。
# P1、R1、F1_1 分别代表精确度、召回率和F1分数。
P1, R1, F1_1 = scorer.score([eval_summary_1], [ref_summary])

# 计算BERTScore,将摘要2与摘录进行对比。
# P2、R2、F2_2分别代表精确度、召回率和F1分数。
P2, R2, F2_2 = scorer.score([eval_summary_2], [ref_summary])

print("Summary 1 F1 Score:", F1_1.tolist()[0])
print("Summary 2 F1 Score:", F2_2.tolist()[0])

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Summary 1 F1 Score: 0.9227314591407776
Summary 2 F1 Score: 0.9189572930335999
<IPython.core.display.Javascript object>

总结之间接近的F1分数表明它们在捕捉关键信息方面可能表现类似。然而,应谨慎解释这种小差异。由于BERTScore可能无法完全理解人类评估者可能理解的微妙之处和高层概念,仅依赖这一指标可能导致对摘要的实际质量和细微差别进行错误解读。将BERTScore与人类判断和其他指标相结合的综合方法可能提供更可靠的评估。

使用GPT-4进行评估

在这里,我们使用gpt-4实现了一个示例的无参考文本评估器,灵感来自于G-Eval框架,该框架使用大型语言模型评估生成文本的质量。与依赖于与参考摘要比较的指标如ROUGEBERTScore不同,基于gpt-4的评估器仅根据输入提示和文本评估生成内容的质量,而不依赖于任何基准参考。这使得它适用于新的数据集和任务,其中人类参考稀少或不可用。

以下是这种方法的概述:

  1. 我们定义了四个不同的标准:
    1. 相关性:评估摘要是否仅包含重要信息并排除冗余内容。
    2. 连贯性:评估摘要的逻辑流程和组织。
    3. 一致性:检查摘要是否与源文档中的事实一致。
    4. 流畅性:评估摘要的语法和可读性。
  2. 我们为每个标准制定提示,将原始文档和摘要作为输入,并利用思维链生成引导模型输出每个标准的1-5的数字评分。
  3. 我们使用定义的提示从gpt-4生成评分,并对摘要进行比较。

在这个演示中,我们使用直接评分函数,其中gpt-4为每个指标生成一个离散分数(1-5)。对分数进行归一化并进行加权求和可能会产生更稳健、连续的分数,更好地反映摘要的质量和多样性。

# 基于G-Eval的评估提示模板
EVALUATION_PROMPT_TEMPLATE = """
You will be given one summary written for an article. Your task is to rate the summary on one metric.
Please make sure you read and understand these instructions very carefully.
Please keep this document open while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Source Text:

{document}

Summary:

{summary}

Evaluation Form (scores ONLY):

- {metric_name}
"""

# 指标1:相关性

RELEVANCY_SCORE_CRITERIA = """
Relevance(1-5) - selection of important content from the source. \
The summary should include only important information from the source document. \
Annotators were instructed to penalize summaries which contained redundancies and excess information.
"""

RELEVANCY_SCORE_STEPS = """
1. Read the summary and the source document carefully.
2. Compare the summary to the source document and identify the main points of the article.
3. Assess how well the summary covers the main points of the article, and how much irrelevant or redundant information it contains.
4. Assign a relevance score from 1 to 5.
"""

# Metric 2: Coherence

COHERENCE_SCORE_CRITERIA = """
Coherence(1-5) - the collective quality of all sentences. \
We align this dimension with the DUC quality question of structure and coherence \
whereby "the summary should be well-structured and well-organized. \
The summary should not just be a heap of related information, but should build from sentence to a\
coherent body of information about a topic."
"""

COHERENCE_SCORE_STEPS = """
1. Read the article carefully and identify the main topic and key points.
2. Read the summary and compare it to the article. Check if the summary covers the main topic and key points of the article,
and if it presents them in a clear and logical order.
3. Assign a score for coherence on a scale of 1 to 5, where 1 is the lowest and 5 is the highest based on the Evaluation Criteria.
"""

# 指标3:一致性

CONSISTENCY_SCORE_CRITERIA = """
Consistency(1-5) - the factual alignment between the summary and the summarized source. \
A factually consistent summary contains only statements that are entailed by the source document. \
Annotators were also asked to penalize summaries that contained hallucinated facts.
"""

CONSISTENCY_SCORE_STEPS = """
1. Read the article carefully and identify the main facts and details it presents.
2. Read the summary and compare it to the article. Check if the summary contains any factual errors that are not supported by the article.
3. Assign a score for consistency based on the Evaluation Criteria.
"""

# Metric 4: Fluency

FLUENCY_SCORE_CRITERIA = """
Fluency(1-3): the quality of the summary in terms of grammar, spelling, punctuation, word choice, and sentence structure.
1: Poor. The summary has many errors that make it hard to understand or sound unnatural.
2: Fair. The summary has some errors that affect the clarity or smoothness of the text, but the main points are still comprehensible.
3: Good. The summary has few or no errors and is easy to read and follow.
"""

FLUENCY_SCORE_STEPS = """
阅读摘要并根据给定标准评估其流畅度。根据流畅度从1到3评分。
"""


def get_geval_score(
criteria: str, steps: str, document: str, summary: str, metric_name: str
):
prompt = EVALUATION_PROMPT_TEMPLATE.format(
criteria=criteria,
steps=steps,
metric_name=metric_name,
document=document,
summary=summary,
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=5,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)
return response.choices[0].message.content


evaluation_metrics = {
"Relevance": (RELEVANCY_SCORE_CRITERIA, RELEVANCY_SCORE_STEPS),
"Coherence": (COHERENCE_SCORE_CRITERIA, COHERENCE_SCORE_STEPS),
"Consistency": (CONSISTENCY_SCORE_CRITERIA, CONSISTENCY_SCORE_STEPS),
"Fluency": (FLUENCY_SCORE_CRITERIA, FLUENCY_SCORE_STEPS),
}

summaries = {"Summary 1": eval_summary_1, "Summary 2": eval_summary_2}

data = {"Evaluation Type": [], "Summary Type": [], "Score": []}

for eval_type, (criteria, steps) in evaluation_metrics.items():
for summ_type, summary in summaries.items():
data["Evaluation Type"].append(eval_type)
data["Summary Type"].append(summ_type)
result = get_geval_score(criteria, steps, excerpt, summary, eval_type)
score_num = int(result.strip())
data["Score"].append(score_num)

pivot_df = pd.DataFrame(data, index=None).pivot(
index="Evaluation Type", columns="Summary Type", values="Score"
)
styled_pivot_df = pivot_df.style.apply(highlight_max, axis=1)
display(styled_pivot_df)

Summary Type Summary 1 Summary 2
Evaluation Type    
Coherence 5 3
Consistency 5 5
Fluency 3 2
Relevance 5 4
<IPython.core.display.Javascript object>

总体而言,总结1在四个类别中的三个(连贯性、相关性和流畅性)表现似乎优于总结2。两个总结在一致性方面表现一致。结果可能表明,基于给定的评估标准,总结1通常更可取。

限制

请注意,基于LLM的指标可能会偏向于偏爱LLM生成的文本而不是人类撰写的文本。此外,基于LLM的指标对系统消息/提示很敏感。我们建议尝试其他可以帮助提高性能和/或获得一致分数的技术,找到高质量昂贵评估和自动评估之间的平衡。还值得注意的是,这种评分方法目前受到gpt-4上下文窗口的限制。

结论

评估抽象总结仍然是一个需要进一步改进的开放领域。传统指标如ROUGEBLEUBERTScore提供了有用的自动评估,但在捕捉语义相似性和摘要质量的微妙方面存在局限性。此外,它们需要参考输出,这可能很昂贵或难以收集/标记。基于LLM的指标作为一种无参考方法,为评估连贯性、流畅性和相关性提供了希望。然而,它们也存在可能偏向于LLM生成文本的潜在偏见。最终,自动指标和人类评估的结合是可靠评估抽象总结系统的理想选择。虽然人类评估对于全面了解摘要质量至关重要,但应辅以自动评估,以实现高效、大规模的测试。该领域将继续发展更健壮的评估技术,平衡质量、可扩展性和公平性。推进评估方法对推动生产应用的进展至关重要。

参考文献