评估一个简单的 LLM 应用

本指南的目的是演示使用 ragas 对 LLM 应用进行测试和评估的一个简单工作流程。它假定读者在 AI 应用构建和评估方面只具备最基本的知识。请参阅我们的 installation instruction 来安装 ragas

评估

在本指南中，您将评估一个 文本摘要管道。目标是确保输出的摘要准确捕捉文本中指定的所有关键细节，例如增长数据、市场洞察和其他重要信息。

ragas 提供了多种用于分析 LLM 应用性能的方法，称为 metrics。每个指标需要一组预定义的数据点，并使用这些数据点来计算表示性能的分数。

使用非LLM指标进行评估

这是一个使用 BleuScore 对摘要进行评分的简单示例

from ragas import SingleTurnSample
from ragas.metrics import BleuScore

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
    "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter."
}
metric = BleuScore()
test_data = SingleTurnSample(**test_data)
metric.single_turn_score(test_data)

输出

0.137

在这里我们使用：

一个测试样本，包含 user_input、response（来自 LLM 的输出）和 reference（LLM 的预期输出），作为用于评估摘要的数据点。
一个非LLM的指标，称为 BleuScore

正如你可能观察到的，这种方法有两个关键的局限：

耗时的准备工作：评估该应用需要为每个输入准备预期输出 (reference)，这既费时又具有挑战性。
得分不准确：尽管 response 和 reference 相似，但输出分数很低。这是像 BleuScore 这样的非LLM指标的已知局限性。

信息

一个 非 LLM 指标 指的是不依赖 LLM 进行评估的指标。

为了解决这些问题，我们来尝试一种基于LLM的度量。

使用基于 LLM 的度量进行评估

选择你的 LLM

OpenAIAWSGoogle CloudAzureOthers

安装 langchain-openai 包

pip install langchain-openai

请确保在您的环境中已准备好并可使用 OpenAI 密钥。

import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"

将 LLMs 包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

安装 langchain-aws 包

pip install langchain-aws

然后你必须设置你的 AWS 凭证和配置

config = {
    "credentials_profile_name": "your-profile-name",  # E.g "default"
    "region_name": "your-region-name",  # E.g. "us-east-1"
    "llm": "your-llm-model-id",  # E.g "anthropic.claude-3-5-sonnet-20241022-v2:0"
    "embeddings": "your-embedding-model-id",  # E.g "amazon.titan-embed-text-v2:0"
    "temperature": 0.4,
}

定义你的 LLMs，并将它们用 LangchainLLMWrapper 包装，以便可以与 ragas 一起使用。

from langchain_aws import ChatBedrockConverse
from langchain_aws import BedrockEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

evaluator_llm = LangchainLLMWrapper(ChatBedrockConverse(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    base_url=f"https://bedrock-runtime.{config['region_name']}.amazonaws.com",
    model=config["llm"],
    temperature=config["temperature"],
))
evaluator_embeddings = LangchainEmbeddingsWrapper(BedrockEmbeddings(
    credentials_profile_name=config["credentials_profile_name"],
    region_name=config["region_name"],
    model_id=config["embeddings"],
))

如果您想了解有关如何使用其他 AWS 服务的更多信息，请参阅 langchain-aws 文档。

Google 提供两种访问其模型的方式：Google AI Studio 和 Google Cloud Vertex AI。使用 Google AI Studio 仅需一个 Google 帐户和 API key，而 Vertex AI 则需要一个 Google Cloud 帐户。如果你刚开始使用，建议使用 Google AI Studio。

首先，安装所需的软件包（仅安装根据您所选的 API 所需的软件包）：

# for Google AI Studio
pip install langchain-google-genai
# for Google Cloud Vertex AI
pip install langchain-google-vertexai

然后根据您选择的 API 设置您的凭据：

适用于 Google AI Studio：

import os
os.environ["GOOGLE_API_KEY"] = "your-google-ai-key"  # 来自 https://ai.google.dev/

适用于 Google Cloud Vertex AI：

# 确保您已配置凭证（gcloud、workload identity 等）
# 或设置服务账号 JSON 路径：
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/service-account.json"

定义您的配置：

config = {
    "model": "gemini-1.5-pro",  # or other model IDs
    "temperature": 0.4,
    "max_tokens": None,
    "top_p": 0.8,
    # For Vertex AI only:
    "project": "your-project-id",  # Required for Vertex AI
    "location": "us-central1",     # Required for Vertex AI
}

初始化 LLM 并将其封装以供 ragas 使用：

from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

# Choose the appropriate import based on your API:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_vertexai import ChatVertexAI

# Initialize with Google AI Studio
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
    model=config["model"],
    temperature=config["temperature"],
    max_tokens=config["max_tokens"],
    top_p=config["top_p"],
))

# Or initialize with Vertex AI
evaluator_llm = LangchainLLMWrapper(ChatVertexAI(
    model=config["model"],
    temperature=config["temperature"],
    max_tokens=config["max_tokens"],
    top_p=config["top_p"],
    project=config["project"],
    location=config["location"],
))

您可以选择配置安全设置：

from langchain_google_genai import HarmCategory, HarmBlockThreshold

safety_settings = {
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
    # Add other safety settings as needed
}

# Apply to your LLM initialization
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
    model=config["model"],
    temperature=config["temperature"],
    safety_settings=safety_settings,
))

初始化嵌入并将其封装以供 ragas 使用（选择以下之一）：

# Google AI Studio Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings

evaluator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",  # Google's text embedding model
    task_type="retrieval_document"  # Optional: specify the task type
))

# Vertex AI Embeddings
from langchain_google_vertexai import VertexAIEmbeddings

evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(
    model_name="textembedding-gecko@001",  # or other available model
    project=config["project"],  # Your GCP project ID
    location=config["location"]  # Your GCP location
))

有关可用模型、功能和配置的更多信息，请参阅： Google AI Studio documentation, Google Cloud Vertex AI documentation, LangChain Google AI integration, LangChain Vertex AI integration

安装 langchain-openai 包

pip install langchain-openai

请确保在环境中已准备好并可用 Azure OpenAI key。

import os
os.environ["AZURE_OPENAI_API_KEY"] = "your-azure-openai-key"

# other configuration
azure_config = {
    "base_url": "",  # your endpoint
    "model_deployment": "",  # your model deployment name
    "model_name": "",  # your model name
    "embedding_deployment": "",  # your embedding deployment name
    "embedding_name": "",  # your embedding name
}

定义你的 LLMs 并将它们封装在 LangchainLLMWrapper 中，以便可与 ragas 一起使用。

from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
evaluator_llm = LangchainLLMWrapper(AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_config["base_url"],
    azure_deployment=azure_config["model_deployment"],
    model=azure_config["model_name"],
    validate_base_url=False,
))

# init the embeddings for answer_relevancy, answer_correctness and answer_similarity
evaluator_embeddings = LangchainEmbeddingsWrapper(AzureOpenAIEmbeddings(
    openai_api_version="2023-05-15",
    azure_endpoint=azure_config["base_url"],
    azure_deployment=azure_config["embedding_deployment"],
    model=azure_config["embedding_name"],
))

如果您想了解有关如何使用其他 Azure 服务的更多信息，请参阅 langchain-azure 文档。

如果您使用的是不同的 LLM 提供商并且使用 Langchain 与其交互，您可以将您的 LLM 包装在 LangchainLLMWrapper 中，以便可以与 ragas 一起使用。

from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(your_llm_instance)

有关更详细的指南，请查看 the guide on customizing models。

如果您正在使用 LlamaIndex，您可以使用 LlamaIndexLLMWrapper 将您的 LLM 包装起来，以便可以与 ragas 一起使用。

from ragas.llms import LlamaIndexLLMWrapper
evaluator_llm = LlamaIndexLLMWrapper(your_llm_instance)

有关如何使用 LlamaIndex 的更多信息，请参阅 LlamaIndex Integration guide。

如果你仍然无法将 Ragas 与你最喜欢的 LLM 提供商一起使用，请在这个 issue 上发表评论告诉我们，我们会为其添加支持 🙂。

评估

在这里我们将使用 AspectCritic，它是一个基于LLM的度量，根据评估标准输出通过/不通过。

from ragas import SingleTurnSample
from ragas.metrics import AspectCritic

test_data = {
    "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.",
    "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.",
}

metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.")
test_data = SingleTurnSample(**test_data)
await metric.single_turn_ascore(test_data)

输出

成功！这里1表示通过，0表示失败

信息

在 ragas 中还有许多其他类型的指标（有的带 reference，有的不带），如果这些都不适合你的情况，你也可以创建自己的指标。要进一步了解，请查看 more on metrics。

在数据集上评估

在上面的示例中，我们只使用了单个样本来评估我们的应用程序。然而，仅对一个样本进行评估不足以信任结果。为了确保评估可靠，您应该在测试数据中添加更多测试样本。

在这里，我们将从 Hugging Face Hub 加载数据集，但你也可以从任何来源加载数据，例如生产日志或其他数据集。只需确保每个样本都包含所选指标所需的所有属性。

在我们的情况下，所需的属性为：
- user_input: 提供给应用的输入（此处为输入的文本报告）。
- response: 应用生成的输出（此处为生成的摘要）。

例如

[
    # Sample 1
    {
        "user_input": "summarise given text\nThe Q2 earnings report revealed a significant 15% increase in revenue, ...",
        "response": "The Q2 earnings report showed a 15% revenue increase, ...",
    },
    # Additional samples in the dataset
    ....,
    # Sample N
    {
        "user_input": "summarise given text\nIn 2023, North American sales experienced a 5% decline, ...",
        "response": "Companies are strategizing to adapt to market challenges and ...",
    }
]

from datasets import load_dataset
from ragas import EvaluationDataset
eval_dataset = load_dataset("explodinggradients/earning_report_summary",split="train")
eval_dataset = EvaluationDataset.from_hf_dataset(eval_dataset)
print("Features in dataset:", eval_dataset.features())
print("Total samples in dataset:", len(eval_dataset))

输出

Features in dataset: ['user_input', 'response']
Total samples in dataset: 50

使用数据集评估

from ragas import evaluate

results = evaluate(eval_dataset, metrics=[metric])
results

输出

{'summary_accuracy': 0.84}

这个得分表明，在我们测试数据的所有样本中，只有84%的摘要通过了给定的评估标准。现在，重要的是要弄清楚为什么会出现这种情况。

将样本级别的分数导出到 pandas dataframe

results.to_pandas()

输出

    user_input                                          response                                            summary_accuracy
0   总结给定文本\n第二季度财报 r...   该第二季度财报显示收入增长了15%...   1
1   总结给定文本\n在2023年，北美...   公司正在制定战略以适应市场...   1
2   总结给定文本\n在2022年，欧洲扩张...   许多公司经历了显著的15%增长...   1
3   总结给定文本\n供应链挑战 ...   北美的供应链挑战导致...   1

如上所示以 CSV 文件查看样本级结果适合快速检查，但不适合用于详细分析或比较不同评估运行间的结果。

想通过 evals 改进您的 AI 应用，需要帮助吗？

在过去的两年里，我们看到并帮助改进了许多使用 evals 的 AI 应用。

我们正在将这些知识打包成一款产品，用评估循环替代凭感觉的检查，以便你能专注于构建优秀的 AI 应用。

如果您想获得有关使用 evals 改进和扩展您的 AI 应用的帮助。

🔗 预订一个 slot 或给我们写信： founders@explodinggradients.com.

接下来

评估一个简单的RAG应用