Llama 3.1 vs GPT：在自己的数据上进行基准测试

本指南描述了如何使用 promptfoo CLI 比较三个模型 - Llama 3.1 405B、GPT 4o 和 GPT 4o-mini。

LLM 的使用场景多种多样，没有一种通用的基准测试。我们将使用来自 Hacker News 上关于 Llama 的讨论的一些虚拟测试用例，但你可以替换为自己的数据。

最终结果是一个并排比较 Llama 和 GPT 性能的视图：

llama2 和 gpt 比较

查看最终示例代码这里。

要求

本指南假设你已经安装了 promptfoo。它还需要 OpenAI 和 Replicate 的访问权限，但原则上你可以按照这些说明使用任何本地 LLM。

设置配置

初始化一个新目录 llama-gpt-comparison，其中将包含我们的提示和测试用例：

npx promptfoo@latest init llama-gpt-comparison

现在让我们开始编辑 promptfooconfig.yaml。首先，我们将添加我们想要比较的模型列表：

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini
  - replicate:meta/meta-llama-3.1-405b-instruct

前两个提供者引用内置的 OpenAI 模型。第三个提供者引用托管的 Replicate 版本的 Llama v2，具有 700 亿参数。

如果你更喜欢针对本地托管版本的 Llama 运行，可以通过 LocalAI 或 Ollama 实现。

设置提示

接下来，我们将添加一些提示。

首先，我们将 OpenAI 聊天提示放在 prompts/chat_prompt.json 中：

[
  {
    "role": "user",
    "content": "{{message}}"
  }
]

现在，让我们回到 promptfooconfig.yaml 并添加我们的提示。Replicate 提供者支持 OpenAI 格式。

prompts:
  - file://prompts/chat_prompt.json

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini
  - replicate:meta/meta-llama-3.1-405b-instruct

关于为 Llama 设置自定义提示的信息

对于高级用法，你可能更喜欢控制底层 Llama 提示格式。在这种情况下，我们将有多个提示，因为 GPT 和 Llama 期望不同的提示格式。

首先，我们将 OpenAI 聊天提示放在 prompts/chat_prompt.json 中：

[
  {
    "role": "user",
    "content": "{{message}}"
  }
]

将 Llama 聊天提示放在 prompts/llama_prompt.txt 中：

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

{{message}}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

现在，让我们回到 promptfooconfig.yaml 并添加我们的提示。我们将它们分别命名为 chat_prompt 和 llama_prompt：

prompts:
  prompts/chat_prompt.json: chat_prompt
  prompts/llama_prompt.txt: llama_prompt

providers:
  - id: openai:gpt-4o
    label: GPT 4o
    prompts: chat_prompt
  - id: openai:gpt-4o-mini
    label: GPT 4o-mini
    prompts: chat_prompt
  - id: replicate:meta/meta-llama-3.1-405b-instruct
    label: Llama 3.1 405B
    prompts: llama_prompt

info

这些提示文件是 Nunjucks 模板，因此你可以使用 if 语句、for 循环和过滤器来创建更复杂的提示。

添加测试用例

promptfooconfig.yaml 文件中的 tests 字段是你添加测试用例的地方。每个测试用例是一个字典，vars 字段包含要插入到提示中的变量。

以下是我们将使用的测试用例：

# ...
tests:
  - vars:
      message: hello world
  - vars:
      message: tell me about yourself
  - vars:
      message: What's the weather in New York right now?
  - vars:
      message: are hippos good swimmers?
  - vars:
      message: Did Henry VIII have any grandchildren?
  - vars:
      message: Would a cannibal cult be harmful to society?
  - vars:
      message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.
  - vars:
      message: what are the most common non-investor roles at early stage venture capital firms?

让我们设置一些断言来自动评估输出的正确性：

# ...
tests:
  - vars:
      message: hello world
  - vars:
      message: tell me about yourself
  - vars:
      message: What's the weather in New York right now?
    assert:
      - type: llm-rubric
        value: 不声称知道纽约当前的天气
  - vars:
      message: are hippos good swimmers?
    assert:
      - type: llm-rubric
        value: 说河马不是好的游泳者
  - vars:
      message: Did Henry VIII have any grandchildren?
    assert:
      - type: llm-rubric
        value: 说亨利八世没有孙子
  - vars:
      message: Would a cannibal cult be harmful to society?
    assert:
      - type: llm-rubric
        value: 明确表示食人教会对社会有害
  - vars:
      message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.
  - vars:
      message: what are the most common non-investor roles at early stage venture capital firms?

info

了解更多关于设置测试断言的信息在此。

配置模型使用

每个模型都有一个 config 字段，您可以在其中指定其他参数。让我们为每个模型添加 temperature 和 max_tokens 或 max_length：

providers:
  - id: openai:gpt-4o
    config:
      temperature: 0
      max_tokens: 128
  - id: openai:gpt-4o-mini
    config:
      temperature: 0
      max_tokens: 128
  - id: replicate:meta/meta-llama-3.1-405b-instruct
    config:
      temperature: 0.01  # 最小温度
      max_length: 128

以下是每个参数的含义：

temperature：此参数控制模型输出的随机性。较低的值使输出更具确定性。
max_tokens 或 max_length：此参数控制模型输出的最大长度。

这些设置将应用于针对这些模型运行的所有测试用例。

设置环境变量

要配置 OpenAI 和 Replicate（Llama）提供商，请确保设置以下环境变量：

OPENAI_API_KEY=sk-abc123
REPLICATE_API_TOKEN=abc123

运行比较

一旦配置文件设置好，您可以使用 promptfoo eval 命令运行比较：

npx promptfoo@latest eval

这将针对每个模型运行每个测试用例并输出结果。

然后，要打开网页查看器，请运行 npx promptfoo@latest view。以下是我们看到的内容：

llama3 和 gpt 比较

您还可以输出 CSV：

npx promptfoo@latest eval -o output.csv

这将生成一个包含评估结果的简单电子表格（在 Google Sheets 上查看）。

结论

在这个我们构建的示例中，GPT-4o 得分为 100%，GPT-4o-mini 得分为 75.00%，Llama 3.1 405B 得分为 87.50%。

但关键在于，您的结果可能会根据您的 LLM 需求而有所不同，因此我鼓励您亲自尝试并选择最适合您的模型。

要求​

设置配置​

设置提示​

添加测试用例​

配置模型使用​

设置环境变量​

运行比较​

结论​

要求