Griptape 集成

如果你熟悉 Griptape 的 RAG 引擎并且想开始评估你的 RAG 系统的性能，你来对地方了。在本教程中，我们将探讨如何使用 Ragas 来评估由你的 Griptape RAG 引擎生成的响应。

Griptape 设置

设置我们的环境

首先，让我们确保已安装所有必需的包：

%pip install "griptape[all]" ragas -q

创建我们的数据集

我们将使用一个关于主要 LLM 提供商的小型文本片段数据集，并搭建一个简单的 RAG 管道：

chunks = [
    "OpenAI is one of the most recognized names in the large language model space, known for its GPT series of models. These models excel at generating human-like text and performing tasks like creative writing, answering questions, and summarizing content. GPT-4, their latest release, has set benchmarks in understanding context and delivering detailed responses.",
    "Anthropic is well-known for its Claude series of language models, designed with a strong focus on safety and ethical AI behavior. Claude is particularly praised for its ability to follow complex instructions and generate text that aligns closely with user intent.",
    "DeepMind, a division of Google, is recognized for its cutting-edge Gemini models, which are integrated into various Google products like Bard and Workspace tools. These models are renowned for their conversational abilities and their capacity to handle complex, multi-turn dialogues.",
    "Meta AI is best known for its LLaMA (Large Language Model Meta AI) series, which has been made open-source for researchers and developers. LLaMA models are praised for their ability to support innovation and experimentation due to their accessibility and strong performance.",
    "Meta AI with it's LLaMA models aims to democratize AI development by making high-quality models available for free, fostering collaboration across industries. Their open-source approach has been a game-changer for researchers without access to expensive resources.",
    "Microsoft’s Azure AI platform is famous for integrating OpenAI’s GPT models, enabling businesses to use these advanced models in a scalable and secure cloud environment. Azure AI powers applications like Copilot in Office 365, helping users draft emails, generate summaries, and more.",
    "Amazon’s Bedrock platform is recognized for providing access to various language models, including its own models and third-party ones like Anthropic’s Claude and AI21’s Jurassic. Bedrock is especially valued for its flexibility, allowing users to choose models based on their specific needs.",
    "Cohere is well-known for its language models tailored for business use, excelling in tasks like search, summarization, and customer support. Their models are recognized for being efficient, cost-effective, and easy to integrate into workflows.",
    "AI21 Labs is famous for its Jurassic series of language models, which are highly versatile and capable of handling tasks like content creation and code generation. The Jurassic models stand out for their natural language understanding and ability to generate detailed and coherent responses.",
    "In the rapidly advancing field of artificial intelligence, several companies have made significant contributions with their large language models. Notable players include OpenAI, known for its GPT Series (including GPT-4); Anthropic, which offers the Claude Series; Google DeepMind with its Gemini Models; Meta AI, recognized for its LLaMA Series; Microsoft Azure AI, which integrates OpenAI’s GPT Models; Amazon AWS (Bedrock), providing access to various models including Claude (Anthropic) and Jurassic (AI21 Labs); Cohere, which offers its own models tailored for business use; and AI21 Labs, known for its Jurassic Series. These companies are shaping the landscape of AI by providing powerful models with diverse capabilities.",
]

在向量存储中摄取数据

import getpass
import os

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API key: ")

from griptape.drivers.embedding.openai import OpenAiEmbeddingDriver
from griptape.drivers.vector.local import LocalVectorStoreDriver

# Set up a simple vector store with our data
vector_store = LocalVectorStoreDriver(embedding_driver=OpenAiEmbeddingDriver())
vector_store.upsert_collection({"major_llm_providers": chunks})

设置 RAG 引擎

from griptape.engines.rag import RagContext, RagEngine
from griptape.engines.rag.modules import (
    PromptResponseRagModule,
    VectorStoreRetrievalRagModule,
)
from griptape.engines.rag.stages import (
    ResponseRagStage,
    RetrievalRagStage,
)

# Create a basic RAG pipeline
rag_engine = RagEngine(
    # Stage for retrieving relevant chunks
    retrieval_stage=RetrievalRagStage(
        retrieval_modules=[
            VectorStoreRetrievalRagModule(
                name="VectorStore_Retriever",
                vector_store_driver=vector_store,
                query_params={"namespace": "major_llm_providers"},
            ),
        ],
    ),
    # Stage for generating a response
    response_stage=ResponseRagStage(
        response_modules=[
            PromptResponseRagModule(),
        ]
    ),
)

测试我们的 RAG 管道

让我们通过对一个示例查询进行测试，来确保我们的 RAG 管道能够正常工作：

rag_context = RagContext(query="What makes Meta AI’s LLaMA models stand out?")
rag_context = rag_engine.process(rag_context)
rag_context.outputs[0].to_text()

输出：

"Meta AI 的 LLaMA 模型因其开源特性而脱颖而出，这使得研究人员和开发者能够使用。 这种可获得性支持创新和实验，促进跨行业的协作。 通过免费提供高质量模型，Meta AI 旨在使 AI 开发大众化，这对无法获得昂贵资源的研究人员来说是一个改变游戏规则的举措。"

Ragas 评估

创建 Ragas 评估数据集

questions = [
    "Who are the major players in the large language model space?",
    "What is Microsoft’s Azure AI platform known for?",
    "What kind of models does Cohere provide?",
]

references = [
    "The major players include OpenAI (GPT Series), Anthropic (Claude Series), Google DeepMind (Gemini Models), Meta AI (LLaMA Series), Microsoft Azure AI (integrating GPT Models), Amazon AWS (Bedrock with Claude and Jurassic), Cohere (business-focused models), and AI21 Labs (Jurassic Series).",
    "Microsoft’s Azure AI platform is known for integrating OpenAI’s GPT models, enabling businesses to use these models in a scalable and secure cloud environment.",
    "Cohere provides language models tailored for business use, excelling in tasks like search, summarization, and customer support.",
]

griptape_rag_contexts = []

for que in questions:
    rag_context = RagContext(query=que)
    griptape_rag_contexts.append(rag_engine.process(rag_context))

from ragas.integrations.griptape import transform_to_ragas_dataset

ragas_eval_dataset = transform_to_ragas_dataset(
    grip_tape_rag_contexts=griptape_rag_contexts, references=references
)

ragas_eval_dataset.to_pandas()

	用户输入	检索到的上下文	响应	参考
0	谁是大型语言...	[在人工智能快速发展的领域...	大型语言模型的主要参与者...	主要参与者包括 OpenAI (GPT Series),...
1	微软的 Azure AI 平台以什么闻名？	[微软的 Azure AI 平台以...著名]	微软的 Azure AI 平台以...著称	微软的 Azure AI 平台以...著称
2	Cohere 提供哪种类型的模型？	[Cohere 以其语言模型而闻名 ...	Cohere 提供为 b... 定制的语言模型	Cohere 提供为 b... 定制的语言模型

运行 Ragas 评估

现在，让我们使用 Ragas 指标评估我们的 RAG 系统：

Evaluating Retrieval

要评估我们的检索性能，我们可以使用 Ragas 的内置指标，或根据我们的具体需求创建自定义指标。有关所有可用指标和自定义选项的完整列表，请访问 documentation。

我们将使用 ContextPrecision, ContextRecall 和 ContextRelevance 来衡量检索性能：

ContextPrecision: 衡量 RAG 系统的检索器在给定查询下将相关块排在检索到的上下文顶部的效果，计算为所有块的 precision@k 的平均值。
ContextRecall: 衡量从知识库中成功检索到相关信息的比例。
ContextRelevance: 通过两个LLM的判断评估检索到的上下文的相关性，从而衡量这些上下文在多大程度上回答用户的查询。

from ragas.metrics import ContextPrecision, ContextRecall, ContextRelevance
from ragas import evaluate
from langchain_openai import ChatOpenAI
from ragas.llms import LangchainLLMWrapper

llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)

ragas_metrics = [
    ContextPrecision(llm=evaluator_llm),
    ContextRecall(llm=evaluator_llm),
    ContextRelevance(llm=evaluator_llm),
]

retrieval_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
retrieval_results.to_pandas()

评估中: 100%|██████████| 9/9 [00:15<00:00,  1.77s/it]

	用户输入	检索上下文	响应	参考	上下文精确度	上下文召回率	上下文相关性
0	在大型语言模型领域的主要参与者是谁...	[在快速发展的人工智能领域...	大型语言模型的主要参与者是...	主要参与者包括 OpenAI (GPT 系列),...	1.000000	1.0	1.0
1	微软的 Azure AI 平台以什么著称？	[微软的 Azure AI 平台以其...	微软的 Azure AI 平台以 int...	微软的 Azure AI 平台以 int...	1.000000	1.0	1.0
2	Cohere 提供什么类型的模型？	[Cohere 以其语言模型而闻名 ...	Cohere 提供为 b... 定制的语言模型	Cohere 提供为 b... 定制的语言模型	0.833333	1.0	1.0

Evaluating Generation

为了衡量生成性能，我们将使用 FactualCorrectness、Faithfulness 和 ContextRelevance：

FactualCorrectness: 检查回答中的所有陈述是否由参考答案支持。
Faithfulness: 衡量回应与检索到的上下文在事实方面的一致性。
ResponseGroundedness: 衡量响应是否基于所提供的上下文，有助于识别幻觉或捏造的信息。

from ragas.metrics import FactualCorrectness, Faithfulness, ResponseGroundedness

ragas_metrics = [
    FactualCorrectness(llm=evaluator_llm),
    Faithfulness(llm=evaluator_llm),
    ResponseGroundedness(llm=evaluator_llm),
]

genration_results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
genration_results.to_pandas()

评估中: 100%|██████████| 9/9 [00:17<00:00,  1.90s/it]

	用户输入	检索到的上下文	响应	参考	事实正确性(mode=f1)	忠实度	nv_响应有据性
0	谁是大型语言模型领域的主要参与者...	[在快速发展的人工智能领域...	大型语言模型领域的主要参与者是...	主要参与者包括 OpenAI（GPT 系列），...	1.00	1.000000	1.0
1	Microsoft 的 Azure AI 平台以什么而著称？	[Microsoft 的 Azure AI 平台以 i...	Microsoft 的 Azure AI 平台以 int...	Microsoft 的 Azure AI 平台以 int...	0.57	0.833333	1.0
2	Cohere 提供什么样的模型？	[Cohere 以其语言模型而闻名 ...	Cohere 提供为 b... 量身定制的语言模型	Cohere 提供为 b... 量身定制的语言模型	0.57	1.000000	1.0

结论

恭喜！你已成功为你的 Griptape RAG 系统设置了一个 Ragas 评估管道。此评估可提供有价值的见解，帮助了解你的系统在检索相关信息和生成准确回答方面的表现。

请记住，RAG 评估是一个迭代的过程。使用这些指标来识别系统中的薄弱环节，进行改进，并反复评估，直到达到所需的性能水平。

祝你 RAG 之旅愉快！ 😄