使用知识蒸馏对GPT-3.5评判器进行微调（正确性）¶

这个笔记本涉及对LLM评判器进行微调，该评判器评估另一个LLM对用户查询的响应。更具体地，我们演示如何使用llama_index库从GPT-4评判器中提炼知识到GPT-3.5评判器。为此，我们将执行以下步骤：

生成数据集：train和test
执行知识蒸馏（使用train）
在test上评估提炼的模型

更具体地，我们将使用CorrectnessEvaluator作为我们的LLM评判器。

In [ ]:

Copied!





%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-huggingface
%pip install llama-index-readers-wikipedia
%pip install llama-index-finetuning
%pip install llama-index-llms-openai
%pip install llama-index-finetuning-callbacks
%pip install llama-index-llms-huggingface

In [ ]:

Copied!

# 注意：此笔记本进行了多次API调用，以使用OpenAI GPT模型和HuggingFace托管的模型生成文本。如果您不想等待这些生成过程，那么可以使用下面提供的`wget`命令获取本笔记本的数据。

# !wget "https://www.dropbox.com/scl/fo/3kkm8v6qvhxnu449xwp3d/h?rlkey=fxom1yixru1nags9mmao1hkg2&dl=1" -O correctness.zip
# 注意：此笔记本进行了多次API调用，以使用OpenAI GPT模型和HuggingFace托管的模型生成文本。如果您不想等待这些生成过程，那么可以使用下面提供的`wget`命令获取本笔记本的数据。

# !wget "https://www.dropbox.com/scl/fo/3kkm8v6qvhxnu449xwp3d/h?rlkey=fxom1yixru1nags9mmao1hkg2&dl=1" -O correctness.zip

In [ ]:

Copied!

import nest_asyncio

nest_asyncio.apply()
import nest_asyncio

nest_asyncio.apply()

In [ ]:

Copied!

# 我们将使用HuggingFace上的模型作为我们的LLM答案生成器
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")

# 我们将使用GPT-4和GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# 我们将使用HuggingFace上的模型作为我们的LLM答案生成器
HUGGING_FACE_TOKEN = os.getenv("HUGGING_FACE_TOKEN")

# 我们将使用GPT-4和GPT-3.5 + OpenAI Fine-Tuning
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

步骤1 生成数据集：`train_dataset` 和 `test_dataset`¶

对于我们将生成问题并提示各种LLM回答的数据集，我们将使用WikipediaReader来读取几个城市的“的历史”。

In [ ]:

Copied!

!pip install wikipedia -q
!pip install wikipedia -q

[notice] A new release of pip is available: 23.2.1 -> 23.3.1
[notice] To update, run: pip install --upgrade pip

In [ ]:

Copied!





# 维基百科页面
from llama_index.readers.wikipedia import WikipediaReader

城市 = [
    "旧金山",
    "多伦多",
    "纽约",
    "温哥华",
    "蒙特利尔",
    "东京",
    "新加坡",
    "巴黎",
]

文档 = WikipediaReader().load_data(
    pages=[f"{x}的历史" for x in cities]
)
# 维基百科页面
from llama_index.readers.wikipedia import WikipediaReader

城市 = [
    "旧金山",
    "多伦多",
    "纽约",
    "温哥华",
    "蒙特利尔",
    "东京",
    "新加坡",
    "巴黎",
]

文档 = WikipediaReader().load_data(
    pages=[f"{x}的历史" for x in cities]
)

使用`DatasetGenerator`构建`train_dataset`和`test_dataset`¶

现在我们已经有了Document的训练集和测试集，下一步是生成问题。为此，我们将使用DatasetGenerator，它使用LLM从给定的文档集生成问题。

在这个部分，我们将生成一些问题，以便进行后续的讨论和分析。

In [ ]:

Copied!





QUESTION_GEN_PROMPT = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)
QUESTION_GEN_PROMPT = (
    "You are a Teacher/ Professor. Your task is to setup "
    "a quiz/examination. Using the provided context, formulate "
    "a single question that captures an important fact from the "
    "context. Restrict the question to the context information provided."
)

In [ ]:

Copied!





# 根据块生成问题
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

# 为llm提供上下文
gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

# 实例化一个DatasetGenerator
dataset_generator = DatasetGenerator.from_documents(
    documents,
    question_gen_query=QUESTION_GEN_PROMPT,
    llm=gpt_35_llm,
    num_questions_per_chunk=25,
)
# 根据块生成问题
from llama_index.core.evaluation import DatasetGenerator
from llama_index.llms.openai import OpenAI

# 为llm提供上下文
gpt_35_llm = OpenAI(model="gpt-3.5-turbo", temperature=0.3)

# 实例化一个DatasetGenerator
dataset_generator = DatasetGenerator.from_documents(
    documents,
    question_gen_query=QUESTION_GEN_PROMPT,
    llm=gpt_35_llm,
    num_questions_per_chunk=25,
)

In [ ]:

Copied!

qrd = dataset_generator.generate_dataset_from_nodes(num=350)
qrd = dataset_generator.generate_dataset_from_nodes(num=350)

In [ ]:

Copied!

# 如果您想将其保存以备将来使用
# qrd.save_json("qrd.json")
# 如果您想将其保存以备将来使用
# qrd.save_json("qrd.json")

生成问题的答案¶

接下来的步骤是使用LLM生成答案。请记住，重点是评估这些生成的答案。因此，我们稍后将使用GPT模型来评判这些答案。

为了生成问题的答案，我们将使用另一个LLM，即Llama-2。为了做到这一点，我们首先需要为我们的文档创建一个向量存储和一个相关的检索器，这个LLM答案生成器将使用这些。

In [ ]:

Copied!





# from llama_index.core import VectorStoreIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# 创建向量索引
the_index = VectorStoreIndex.from_documents(documents=documents)

# 在此索引上创建检索器
the_retriever = VectorIndexRetriever(
    index=the_index,
    similarity_top_k=2,
)
# from llama_index.core import VectorStoreIndex
from llama_index.core import VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever

# 创建向量索引
the_index = VectorStoreIndex.from_documents(documents=documents)

# 在此索引上创建检索器
the_retriever = VectorIndexRetriever(
    index=the_index,
    similarity_top_k=2,
)

从这里开始，我们将构建RetrieverQueryEngine，用于接收我们的查询（即问题）进行处理。请注意，我们使用HuggingFaceInferenceAPI来进行LLM答案生成器，而Llama-2需要权限。如果您还没有获得这些模型的访问权限，可以随意将Llama-2替换为您选择的其他模型。

在这一点上，我们将生成的问题分成两组：一组用于构建train_dataset，另一组用于构建我们将在下一节中构建的test_dataset。

In [ ]:

Copied!





from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

llm = HuggingFaceInferenceAPI(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    context_window=2048,  # 用于使用细化
    token=HUGGING_FACE_TOKEN,
)

query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.huggingface import HuggingFaceInferenceAPI

llm = HuggingFaceInferenceAPI(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    context_window=2048,  # 用于使用细化
    token=HUGGING_FACE_TOKEN,
)

query_engine = RetrieverQueryEngine.from_args(retriever=the_retriever, llm=llm)

/Users/nerdai/Library/Caches/pypoetry/virtualenvs/llama-index-e6cjsBOJ-py3.10/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

In [ ]:

Copied!





import tqdm

# 我们将使用生成的问题的65%用于训练
train_dataset = []
num_train_questions = int(0.65 * len(qrd.qr_pairs))

for q, a in tqdm.tqdm(qrd.qr_pairs[:num_train_questions]):
    # 这个问题的数据
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    train_dataset.append(data_entry)
import tqdm

# 我们将使用生成的问题的65%用于训练
train_dataset = []
num_train_questions = int(0.65 * len(qrd.qr_pairs))

for q, a in tqdm.tqdm(qrd.qr_pairs[:num_train_questions]):
    # 这个问题的数据
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    train_dataset.append(data_entry)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [08:30<00:00,  6.46s/it]

获取Mistral和LLama-2答案的GPT-4评估¶

正如之前多次提到的，本指南的目的是从GPT-4法官中微调LLM法官。因此，为了完成我们的train_dataset，我们现在需要实例化我们的GPT-4法官，并让它评估LLama-2提供的答案。为此，我们将使用CorrectnessEvaluator类。然后，这个法官将比较答案和参考答案，并根据提供的答案与参考答案的接近程度在1到5之间（分数越高越好）提供评分。

还要注意，我们使用OpenAIFineTuningHandler，它将收集我们最终需要微调GPT-3.5的所有聊天历史记录。

In [ ]:

Copied!





# 实例化gpt-4评估器
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import CorrectnessEvaluator

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
gpt_4_llm = OpenAI(
    temperature=0, model="gpt-4", callback_manager=callback_manager
)

gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)
# 实例化gpt-4评估器
from llama_index.llms.openai import OpenAI
from llama_index.finetuning.callbacks import OpenAIFineTuningHandler
from llama_index.core.callbacks import CallbackManager
from llama_index.core.evaluation import CorrectnessEvaluator

finetuning_handler = OpenAIFineTuningHandler()
callback_manager = CallbackManager([finetuning_handler])
gpt_4_llm = OpenAI(
    temperature=0, model="gpt-4", callback_manager=callback_manager
)

gpt4_judge = CorrectnessEvaluator(llm=gpt_4_llm)

In [ ]:

Copied!





import tqdm

# 对于“训练”
for data_entry in tqdm.tqdm(train_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]
import tqdm

# 对于“训练”
for data_entry in tqdm.tqdm(train_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79/79 [12:31<00:00,  9.51s/it]

In [ ]:

Copied!

finetuning_handler.save_finetuning_events("correction_finetuning_events.jsonl")
finetuning_handler.save_finetuning_events("correction_finetuning_events.jsonl")

Wrote 79 examples to correction_finetuning_events.jsonl

第2步执行知识蒸馏¶

好的，现在是时候从GPT-4中提炼一些知识到GPT-3.5了。为了做到这一点，我们将利用OpenAIFinetuneEngine类以及我们刚刚创建的correction_finetuning_events.jsonl文件。

In [ ]:

Copied!





from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "correction_finetuning_events.jsonl",
)
from llama_index.finetuning import OpenAIFinetuneEngine

finetune_engine = OpenAIFinetuneEngine(
    "gpt-3.5-turbo",
    "correction_finetuning_events.jsonl",
)

In [ ]:

Copied!

# 我们可以通过以下方式检查当前作业的状态
# 这可能需要一些时间...
finetune_engine.finetune()
# 我们可以通过以下方式检查当前作业的状态
# 这可能需要一些时间...
finetune_engine.finetune()

Num examples: 79
First example:
{'role': 'system', 'content': '\nYou are an expert evaluation system for a question answering chatbot.\n\nYou are given the following information:\n- a user query,\n- a reference answer, and\n- a generated answer.\n\nYour job is to judge the relevance and correctness of the generated answer.\nOutput a single score that represents a holistic evaluation.\nYou must return your response in a line with only the score.\nDo not return answers in any other format.\nOn a separate line provide your reasoning for the score as well.\n\nFollow these guidelines for scoring:\n- Your score has to be between 1 and 5, where 1 is the worst and 5 is the best.\n- If the generated answer is not relevant to the user query, you should give a score of 1.\n- If the generated answer is relevant but contains mistakes, you should give a score between 2 and 3.\n- If the generated answer is relevant and fully correct, you should give a score between 4 and 5.\n\nExample Response:\n4.0\nThe generated answer has the exact same metrics as the reference answer,     but it is not as concise.\n\n'}
{'role': 'user', 'content': '\n## User Query\nWhat event in 1906 caused significant damage to San Francisco but was followed by a quick rebuild?\n\n## Reference Answer\nThe great earthquake and fire in 1906 caused significant damage to San Francisco but was followed by a quick rebuild.\n\n## Generated Answer\n1906 earthquake and fire.\n'}
{'role': 'assistant', 'content': '4.0\nThe generated answer is relevant and correct, but it lacks the detail and context provided in the reference answer.'}
No errors found
Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 315, 782
mean / median: 479.49367088607596, 465.0
p5 / p95: 355.6, 634.6

#### Distribution of num_assistant_tokens_per_example:
min / max: 19, 110
mean / median: 57.63291139240506, 56.0
p5 / p95: 29.6, 83.2

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~37880 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~113640 tokens
As of August 22, 2023, fine-tuning gpt-3.5-turbo is $0.008 / 1K Tokens.
This means your total cost for training will be $0.30304000000000003 per epoch.

In [ ]:

Copied!

finetune_engine.get_current_job()
finetune_engine.get_current_job()

Out[ ]:

<FineTuningJob fine_tuning.job id=ftjob-9y8G7rzbCkzPjsKtPMsfwRSu at 0x1778d6a70> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-9y8G7rzbCkzPjsKtPMsfwRSu",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1698851177,
  "finished_at": 1698851823,
  "fine_tuned_model": "ft:gpt-3.5-turbo-0613:llamaindex::8G7FovVj",
  "organization_id": "org-1ZDAvajC6v2ZtAP9hLEIsXRz",
  "result_files": [
    "file-bx2ObrpVPq7Q2pmv743W1eFQ"
  ],
  "status": "succeeded",
  "validation_file": null,
  "training_file": "file-xAwZ2NSzbck3p8u24kznzySX",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": 113166,
  "error": null
}

3 在测试数据集上评估经过微调的GPT-3.5评判器¶

现在我们已经有了经过微调的GPT-3.5，让我们看看它在测试集上的表现如何。但首先，请记住我们说过要推迟创建test_dataset直到我们需要它的时候？现在就是需要的时候了。因此，我们将重复在这里创建train_dataset的过程，但现在是为了test_dataset。

注意：生成这些答案和评估需要一些时间。

In [ ]:

Copied!





# 使用Llama-2生成测试问题的答案
test_dataset = []
for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]):
    # 问题的数据
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    test_dataset.append(data_entry)
# 使用Llama-2生成测试问题的答案
test_dataset = []
for q, a in tqdm.tqdm(qrd.qr_pairs[num_train_questions:]):
    # 问题的数据
    data_entry = {"question": q, "reference": a}
    response = query_engine.query(q)
    response_struct = {}
    response_struct["model"] = "llama-2"
    response_struct["text"] = str(response)
    response_struct["context"] = (
        response.source_nodes[0].node.text[:1000] + "..."
    )

    data_entry["response_data"] = response_struct
    test_dataset.append(data_entry)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [05:07<00:00,  6.99s/it]

In [ ]:

Copied!





# 获取Llama-2答案的GPT-4评估
for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]
# 获取Llama-2答案的GPT-4评估
for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt4_judge.aevaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_4"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] = [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [06:52<00:00,  9.37s/it]

In [ ]:

Copied!





from llama_index.core.evaluation import EvaluationResult

# 使用我们经过精调的GPT-3.5来评估答案
ft_llm = finetune_engine.get_finetuned_model()

ft_gpt_3p5_judge = CorrectnessEvaluator(llm=ft_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await ft_gpt_3p5_judge.evaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "ft_gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]
from llama_index.core.evaluation import EvaluationResult

# 使用我们经过精调的GPT-3.5来评估答案
ft_llm = finetune_engine.get_finetuned_model()

ft_gpt_3p5_judge = CorrectnessEvaluator(llm=ft_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await ft_gpt_3p5_judge.evaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "ft_gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:44<00:00,  1.02s/it]

In [ ]:

Copied!





# 同样地，使用一个未经微调的评估器来评估答案
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")

gpt_3p5_judge = CorrectnessEvaluator(llm=gpt_3p5_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt_3p5_judge.evaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]
# 同样地，使用一个未经微调的评估器来评估答案
gpt_3p5_llm = OpenAI(model="gpt-3.5-turbo")

gpt_3p5_judge = CorrectnessEvaluator(llm=gpt_3p5_llm)

for data_entry in tqdm.tqdm(test_dataset):
    eval_result = await gpt_3p5_judge.evaluate(
        query=data_entry["question"],
        response=data_entry["response_data"]["text"],
        context=data_entry["response_data"]["context"],
        reference=data_entry["reference"],
    )

    # 保存最终结果
    judgement = {}
    judgement["llm"] = "gpt_3p5"
    judgement["score"] = eval_result.score
    judgement["text"] = eval_result.response
    data_entry["evaluations"] += [judgement]

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [01:36<00:00,  2.19s/it]

评估指标¶

哇！现在我们已经生成了LLM法官对测试查询中Llama-2/Mistral答案的所有评估。现在让我们 quantitatively 观察一下，fine-tuned GPT-3.5 和 GPT-4 有多接近。

为此，我们报告了fine-tuned（和未fine-tuned）GPT-3.5的分数与GPT-4法官之间的相关性。

In [ ]:

Copied!





REPORT_FMT_STR = (
    "{model}\n"
    "-----------------\n"
    "Number of obs.: {total_obs}\n"
    "Correlation with GPT-4: {corr}\n"
)
REPORT_FMT_STR = (
    "{model}\n"
    "-----------------\n"
    "Number of obs.: {total_obs}\n"
    "Correlation with GPT-4: {corr}\n"
)

In [ ]:

Copied!





import numpy as np

scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
for ix, d in enumerate(test_dataset):
    for e in d["evaluations"]:
        scores[e["llm"]].append(e["score"])
import numpy as np

scores = {"gpt_4": [], "gpt_3p5": [], "ft_gpt_3p5": []}
for ix, d in enumerate(test_dataset):
    for e in d["evaluations"]:
        scores[e["llm"]].append(e["score"])

In [ ]:

Copied!





# numpy转换
np_scores_gpt_4 = np.array(scores["gpt_4"])  # 将"gpt_4"的分数转换为numpy数组
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])  # 将"gpt_3p5"的分数转换为numpy数组
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])  # 将"ft_gpt_3p5"的分数转换为numpy数组

# 相关性
corr_ft = np.corrcoef(np_scores_gpt_4, np_scores_ft_gpt_3p5)[0, 1]  # 计算"np_scores_gpt_4"和"np_scores_ft_gpt_3p5"的相关性
corr_no_ft = np.corrcoef(np_scores_gpt_4, np_scores_gpt_3p5)[0, 1]  # 计算"np_scores_gpt_4"和"np_scores_gpt_3p5"的相关性

print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/ fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_ft,
    )
)
print("\n")
print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/out fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_no_ft,
    )
)
# numpy转换
np_scores_gpt_4 = np.array(scores["gpt_4"])  # 将"gpt_4"的分数转换为numpy数组
np_scores_gpt_3p5 = np.array(scores["gpt_3p5"])  # 将"gpt_3p5"的分数转换为numpy数组
np_scores_ft_gpt_3p5 = np.array(scores["ft_gpt_3p5"])  # 将"ft_gpt_3p5"的分数转换为numpy数组

# 相关性
corr_ft = np.corrcoef(np_scores_gpt_4, np_scores_ft_gpt_3p5)[0, 1]  # 计算"np_scores_gpt_4"和"np_scores_ft_gpt_3p5"的相关性
corr_no_ft = np.corrcoef(np_scores_gpt_4, np_scores_gpt_3p5)[0, 1]  # 计算"np_scores_gpt_4"和"np_scores_gpt_3p5"的相关性

print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/ fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_ft,
    )
)
print("\n")
print(
    REPORT_FMT_STR.format(
        model="GPT-3.5 w/out fine-tuning",
        total_obs=np_scores_gpt_4.shape[0],
        corr=corr_no_ft,
    )
)

GPT-3.5 w/ fine-tuning
-----------------
Number of obs.: 44
Correlation with GPT-4: 0.9279850303778618



GPT-3.5 w/out fine-tuning
-----------------
Number of obs.: 44
Correlation with GPT-4: 0.8737418723878325

结论¶

从以上数字可以看出，对GPT-3.5评判器进行微调可以使其与GPT-4的相关性更高，而非经过微调的评判器则不然。因此，在这种情况下，我们可以看到微调帮助我们获得了一个更接近GPT-4评判器（因此间接地更接近人类判断）的GPT-3.5评判器。

使用知识蒸馏对GPT-3.5评判器进行微调（正确性）¶

步骤1 生成数据集：train_dataset 和 test_dataset¶

使用DatasetGenerator构建train_dataset和test_dataset¶

生成问题的答案¶

获取Mistral和LLama-2答案的GPT-4评估¶

第2步 执行知识蒸馏¶

3 在测试数据集上评估经过微调的GPT-3.5评判器¶

评估指标¶

结论¶

步骤1 生成数据集：`train_dataset` 和 `test_dataset`¶

使用`DatasetGenerator`构建`train_dataset`和`test_dataset`¶

第2步执行知识蒸馏¶