使用Chroma和OpenAI进行强大的问答

本笔记本将逐步指导您如何使用Chroma，这是一个开源的嵌入式数据库，以及OpenAI的text embeddings和chat completion API，来回答关于一组数据的问题。

此外，本笔记本演示了使问答系统更加强大时需要做出的一些权衡。正如我们将看到的那样，简单的查询并不总是能产生最佳结果！

使用LLMs进行问答

像OpenAI的ChatGPT这样的大型语言模型（LLMs）可以用来回答关于模型可能没有经过训练或无法访问的数据的问题。例如；

个人数据，如电子邮件和笔记
高度专业化的数据，如档案或法律文件
最新创建的数据，如最近的新闻报道

为了克服这一限制，我们可以使用一个适合以自然语言查询的数据存储，就像LLM本身一样。像Chroma这样的嵌入式存储将文档表示为嵌入，同时还包括文档本身。

通过将文本查询嵌入到Chroma中，Chroma可以找到相关的文档，然后我们可以将这些文档传递给LLM来回答我们的问题。我们将展示这种方法的详细示例和变体。

设置和准备工作

首先，确保已安装我们需要的Python依赖项。

%pip install -qU openai chromadb pandas

Note: you may need to restart the kernel to use updated packages.

我们在整个笔记本中使用OpenAI的API。您可以从https://beta.openai.com/account/api-keys获取API密钥。

您可以通过在终端中执行命令export OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx将您的API密钥添加为环境变量。请注意，如果环境变量尚未设置，您将需要重新加载笔记本。或者，您可以在笔记本中设置它，见下文。

import os

# 取消注释以下行以在笔记本中设置环境变量。
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
    import openai
    openai.api_key = os.getenv("OPENAI_API_KEY")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready

数据集

在整个笔记本中，我们使用SciFact数据集。这是一个由专家注释的科学论断的筛选数据集，附带有论文标题和摘要的文本语料库。根据语料库中的文档，每个论断可能会得到支持、被反驳，或者没有足够的证据支持或反对。

有了作为基准的语料库，我们可以研究以下几种长文本模型问答方法的表现如何。

# 加载索赔数据集
import pandas as pd

data_path = '../../data'

claim_df = pd.read_json(f'{data_path}/scifact_claims.jsonl', lines=True)
claim_df.head()

	id	claim	evidence	cited_doc_ids
0	1	0-dimensional biomaterials show inductive prop...	{}	[31715818]
1	3	1,000 genomes project enables mapping of genet...	{'14717500': [{'sentences': [2, 5], 'label': '...	[14717500]
2	5	1/2000 in UK have abnormal PrP positivity.	{'13734012': [{'sentences': [4], 'label': 'SUP...	[13734012]
3	13	5% of perinatal mortality is due to low birth ...	{}	[1606628]
4	36	A deficiency of vitamin B12 increases blood le...	{}	[5152028, 11705328]

仅仅询问模型

GPT-3.5是在大量科学信息上进行训练的。作为基准，我们希望了解模型在没有任何进一步上下文的情况下已经知道的内容。这将帮助我们校准整体性能。

我们构建一个适当的提示，附带一些示例事实，然后针对数据集中的每个声明查询模型。我们要求模型评估一个声明为“真”、“假”或者如果没有足够证据支持任何一方，则为“无法确定”。

def build_prompt(claim):
    return [
        {"role": "system", "content": "I will ask you to assess a scientific claim. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."},
        {"role": "user", "content": f"""        
Example:

Claim:
0-dimensional biomaterials show inductive properties.

Assessment:
False

Claim:
1/2000 in UK have abnormal PrP positivity.

Assessment:
True

Claim:
Aspirin inhibits the production of PGE2.

Assessment:
False

End of examples. Assess the following claim:

Claim:
{claim}

Assessment:
"""}
    ]


def assess_claims(claims):
    responses = []
    # Query the OpenAI API
    for claim in claims:
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_prompt(claim),
            max_tokens=3,
        )
        # Strip any punctuation or whitespace from the response
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

我们从数据集中抽取了100个索赔。

# Let's take a look at 100 claims
samples = claim_df.sample(50)

claims = samples['claim'].tolist() 

我们根据数据集评估真实情况。根据数据集描述，每个主张要么由证据支持，要么被证据否认，否则就没有足够的证据支持或否认。

def get_groundtruth(evidence):
    groundtruth = []
    for e in evidence:
        # 证据是空洞的 
        if len(e) == 0:
            groundtruth.append('NEE')
        else:
            # 在这个数据集中，对于任何给定的声明，所有证据都是一致的，要么支持（SUPPORT），要么反驳（CONTRADICT）。
            if list(e.values())[0][0]['label'] == 'SUPPORT':
                groundtruth.append('True')
            else:
                groundtruth.append('False')
    return groundtruth

evidence = samples['evidence'].tolist()
groundtruth = get_groundtruth(evidence)

我们还会输出混淆矩阵，将模型的评估与实际情况进行比较，以便于阅读的表格形式呈现。

def confusion_matrix(inferred, groundtruth):
    assert len(inferred) == len(groundtruth)
    confusion = {
        'True': {'True': 0, 'False': 0, 'NEE': 0},
        'False': {'True': 0, 'False': 0, 'NEE': 0},
        'NEE': {'True': 0, 'False': 0, 'NEE': 0},
    }
    for i, g in zip(inferred, groundtruth):
        confusion[i][g] += 1

    # 美化打印混淆矩阵
    print('\tGroundtruth')
    print('\tTrue\tFalse\tNEE')
    for i in confusion:
        print(i, end='\t')
        for g in confusion[i]:
            print(confusion[i][g], end='\t')
        print()

    return confusion

我们要求模型直接评估声明，而无需额外的上下文。

gpt_inferred = assess_claims(claims)
confusion_matrix(gpt_inferred, groundtruth)

    Groundtruth
    True    False   NEE
True    15  5   14  
False   0   2   1   
NEE 3   3   7   

{'True': {'True': 15, 'False': 5, 'NEE': 14},
 'False': {'True': 0, 'False': 2, 'NEE': 1},
 'NEE': {'True': 3, 'False': 3, 'NEE': 7}}

结果

从这些结果中，我们可以看到LLM在评估声明时存在强烈的偏见，即使这些声明是错误的，它也倾向于将错误的声明评估为没有足够的证据。请注意，“没有足够的证据”是指模型在没有额外上下文的情况下对声明进行评估。

添加上下文

现在我们将从论文标题和摘要的语料库中添加额外的上下文。本节展示了如何将文本语料库加载到Chroma中，使用OpenAI文本嵌入。

首先，我们加载文本语料库。

# 将语料库加载到数据框中
corpus_df = pd.read_json(f'{data_path}/scifact_corpus.jsonl', lines=True)
corpus_df.head()

	doc_id	title	abstract	structured
0	4983	Microstructural development of human newborn c...	[Alterations of the architecture of cerebral w...	False
1	5836	Induction of myelodysplasia by myeloid-derived...	[Myelodysplastic syndromes (MDS) are age-depen...	False
2	7912	BC1 RNA, the transcript from a master gene for...	[ID elements are short interspersed elements (...	False
3	18670	The DNA Methylome of Human Peripheral Blood Mo...	[DNA methylation plays an important role in bi...	False
4	19238	The human myelin basic protein gene is include...	[Two human Golli (for gene expressed in the ol...	False

将语料库加载到Chroma中

接下来的步骤是将语料库加载到Chroma中。给定一个嵌入函数，Chroma将自动处理嵌入每个文档，并将其存储在其文本和元数据旁边，使查询变得简单。

我们实例化一个（临时的）Chroma客户端，并为SciFact标题和摘要语料库创建一个集合。 Chroma也可以在持久化配置中实例化；在Chroma文档中了解更多信息。

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# 我们初始化一个嵌入函数，并将其提供给集合。
embedding_function = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))

chroma_client = chromadb.Client() # 默认情况下短暂易逝
scifact_corpus_collection = chroma_client.create_collection(name='scifact_corpus', embedding_function=embedding_function)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.

接下来我们将语料库加载到Chroma中。由于这个数据加载过程需要大量内存，我们建议使用分批加载方案，每次加载50-1000个样本。对于这个示例，加载整个语料库应该需要一分钟多一点的时间。它会在后台自动进行嵌入，使用我们之前指定的embedding_function。

batch_size = 100

for i in range(0, len(corpus_df), batch_size):
    batch_df = corpus_df[i:i+batch_size]
    scifact_corpus_collection.add(
        ids=batch_df['doc_id'].apply(lambda x: str(x)).tolist(), # Chroma 接受字符串 ID。
        documents=(batch_df['title'] + '. ' + batch_df['abstract'].apply(lambda x: ' '.join(x))).to_list(), # 我们将标题和摘要连接起来。
        metadatas=[{"structured": structured} for structured in batch_df['structured'].to_list()] # 我们同样存储了元数据，尽管在本例中我们并未使用它。
    )

检索上下文

接下来，我们从语料库中检索可能与我们样本中的每个主张相关的文档。我们希望将这些文档作为上下文提供给LLM来评估这些主张。我们根据嵌入距离检索每个主张的3个最相关文档。

claim_query_result = scifact_corpus_collection.query(query_texts=claims, include=['documents', 'distances'], n_results=3)

我们创建一个新的提示，这次考虑到我们从语料库中检索到的额外上下文。

def build_prompt_with_context(claim, context):
    return [{'role': 'system', 'content': "I will ask you to assess whether a particular scientific claim, based on evidence provided. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence."}, 
            {'role': 'user', 'content': f""""
The evidence is the following:

{' '.join(context)}

Assess the following claim on the basis of the evidence. Output only the text 'True' if the claim is true, 'False' if the claim is false, or 'NEE' if there's not enough evidence. Do not output any other text. 

Claim:
{claim}

Assessment:
"""}]


def assess_claims_with_context(claims, contexts):
    responses = []
    # 查询OpenAI API
    for claim, context in zip(claims, contexts):
        # 如果未提供证据，则返回NEE。
        if len(context) == 0:
            responses.append('NEE')
            continue
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_prompt_with_context(claim=claim, context=context),
            max_tokens=3,
        )
        # 从响应中去除任何标点符号或空白字符。
        responses.append(response.choices[0].message.content.strip('., '))

    return responses

然后要求模型使用检索到的上下文来评估这些声明。

gpt_with_context_evaluation = assess_claims_with_context(claims, claim_query_result['documents'])
confusion_matrix(gpt_with_context_evaluation, groundtruth)

    Groundtruth
    True    False   NEE
True    16  2   8   
False   1   6   5   
NEE 1   2   9   

{'True': {'True': 16, 'False': 2, 'NEE': 8},
 'False': {'True': 1, 'False': 6, 'NEE': 5},
 'NEE': {'True': 1, 'False': 2, 'NEE': 9}}

结果

我们发现，模型评估虚假声明为真的可能性要小得多（2个实例对比之前的5个），但是没有足够证据支持的声明仍然经常被评估为真或假。

查看检索到的文档，我们发现它们有时与声明不相关 - 这会导致模型被额外信息搞混，可能会认为有足够的证据存在，即使信息是无关的。这是因为我们总是要求提供3个“最”相关的文档，但在某一点之后，这些文档可能根本不相关。

根据相关性筛选上下文

除了文档本身，Chroma还返回一个距离分数。我们可以尝试在距离上设置阈值，这样就可以减少无关的文档进入我们提供给模型的上下文中。

如果在阈值筛选后，没有剩余的上下文文档，我们将绕过模型，简单地返回没有足够的证据。

def filter_query_result(query_result, distance_threshold=0.25):
# 对于每个查询结果，仅保留距离阈值以下的文档。
    for ids, docs, distances in zip(query_result['ids'], query_result['documents'], query_result['distances']):
        for i in range(len(ids)-1, -1, -1):
            if distances[i] > distance_threshold:
                ids.pop(i)
                docs.pop(i)
                distances.pop(i)
    return query_result

filtered_claim_query_result = filter_query_result(claim_query_result)

现在我们使用这个更清晰的背景来评估这些声明。

gpt_with_filtered_context_evaluation = assess_claims_with_context(claims, filtered_claim_query_result['documents'])
confusion_matrix(gpt_with_filtered_context_evaluation, groundtruth)

    Groundtruth
    True    False   NEE
True    10  2   1   
False   0   2   1   
NEE 8   6   20  

{'True': {'True': 10, 'False': 2, 'NEE': 1},
 'False': {'True': 0, 'False': 2, 'NEE': 1},
 'NEE': {'True': 8, 'False': 6, 'NEE': 20}}

结果

当没有足够的证据时，该模型现在会将更少的索赔判断为真或假。然而，它现在偏离确定性。现在，大多数索赔被评估为没有足够的证据，因为其中很大一部分被距离阈值过滤掉。可以调整距离阈值来找到最佳操作点，但这可能很困难，并且取决于数据集和嵌入模型。

假设性文档嵌入：有效利用幻觉

我们希望能够检索相关文档，而不检索那些可能会混淆模型的不相关文档。实现这一目标的一种方法是改进检索查询。

到目前为止，我们使用声明来查询数据集，这些声明是单句陈述，而语料库包含描述科学论文的摘要。直觉上，虽然它们可能相关，但它们在结构和含义上存在显著差异。这些差异由嵌入模型编码，因此影响了查询与最相关结果之间的距离。

我们可以通过利用LLMs的能力生成相关文本来克服这一问题。虽然事实可能是虚构的，但模型生成的文档内容和结构与我们语料库中的文档更相似，而不同于查询。这可能会导致更好的查询，从而获得更好的结果。

这种方法被称为假设性文档嵌入（HyDE），已被证明在检索任务中表现出色。它应该有助于将更多相关信息引入上下文，而不会污染它。

简而言之： - 当您嵌入整个摘要而不是单个句子时，您会获得更好的匹配 - 但声明通常是单个句子 - 因此，HyDE表明使用GPT3将声明扩展为虚构的摘要，然后基于这些摘要进行搜索（声明 -> 摘要 -> 结果）比直接搜索（声明 -> 结果）效果更好。

首先，我们使用上下文示例来促使模型生成类似于语料库中内容的文档，用于评估我们想要评估的每个声明。

def build_hallucination_prompt(claim):
    return [{'role': 'system', 'content': """I will ask you to write an abstract for a scientific paper which supports or refutes a given claim. It should be written in scientific language, include a title. Output only one abstract, then stop.
    
    An Example:

    Claim:
    A high microerythrocyte count raises vulnerability to severe anemia in homozygous alpha (+)- thalassemia trait subjects.

    Abstract:
    BACKGROUND The heritable haemoglobinopathy alpha(+)-thalassaemia is caused by the reduced synthesis of alpha-globin chains that form part of normal adult haemoglobin (Hb). Individuals homozygous for alpha(+)-thalassaemia have microcytosis and an increased erythrocyte count. Alpha(+)-thalassaemia homozygosity confers considerable protection against severe malaria, including severe malarial anaemia (SMA) (Hb concentration < 50 g/l), but does not influence parasite count. We tested the hypothesis that the erythrocyte indices associated with alpha(+)-thalassaemia homozygosity provide a haematological benefit during acute malaria.   
    METHODS AND FINDINGS Data from children living on the north coast of Papua New Guinea who had participated in a case-control study of the protection afforded by alpha(+)-thalassaemia against severe malaria were reanalysed to assess the genotype-specific reduction in erythrocyte count and Hb levels associated with acute malarial disease. We observed a reduction in median erythrocyte count of approximately 1.5 x 10(12)/l in all children with acute falciparum malaria relative to values in community children (p < 0.001). We developed a simple mathematical model of the linear relationship between Hb concentration and erythrocyte count. This model predicted that children homozygous for alpha(+)-thalassaemia lose less Hb than children of normal genotype for a reduction in erythrocyte count of >1.1 x 10(12)/l as a result of the reduced mean cell Hb in homozygous alpha(+)-thalassaemia. In addition, children homozygous for alpha(+)-thalassaemia require a 10% greater reduction in erythrocyte count than children of normal genotype (p = 0.02) for Hb concentration to fall to 50 g/l, the cutoff for SMA. We estimated that the haematological profile in children homozygous for alpha(+)-thalassaemia reduces the risk of SMA during acute malaria compared to children of normal genotype (relative risk 0.52; 95% confidence interval [CI] 0.24-1.12, p = 0.09).   
    CONCLUSIONS The increased erythrocyte count and microcytosis in children homozygous for alpha(+)-thalassaemia may contribute substantially to their protection against SMA. A lower concentration of Hb per erythrocyte and a larger population of erythrocytes may be a biologically advantageous strategy against the significant reduction in erythrocyte count that occurs during acute infection with the malaria parasite Plasmodium falciparum. This haematological profile may reduce the risk of anaemia by other Plasmodium species, as well as other causes of anaemia. Other host polymorphisms that induce an increased erythrocyte count and microcytosis may confer a similar advantage.

    End of example. 
    
    """}, {'role': 'user', 'content': f""""
    Perform the task for the following claim.

    Claim:
    {claim}

    Abstract:
    """}]


def hallucinate_evidence(claims):
    # 查询OpenAI API
    responses = []
    # 查询OpenAI API
    for claim in claims:
        response = openai.ChatCompletion.create(
            model='gpt-3.5-turbo',
            messages=build_hallucination_prompt(claim),
        )
        responses.append(response.choices[0].message.content)
    return responses

我们为每个索赔制作一个文档。

注意：这可能需要一些时间，大约100个索赔需要30分钟。您可以减少要评估的索赔数量，以便更快地获得结果。

hallucinated_evidence = hallucinate_evidence(claims)

我们使用虚构的文档作为语料库中的查询，并使用相同的距离阈值对结果进行过滤。

hallucinated_query_result = scifact_corpus_collection.query(query_texts=hallucinated_evidence, include=['documents', 'distances'], n_results=3)
filtered_hallucinated_query_result = filter_query_result(hallucinated_query_result)

然后我们要求模型使用新的上下文来评估这些声明。

gpt_with_hallucinated_context_evaluation = assess_claims_with_context(claims, filtered_hallucinated_query_result['documents'])
confusion_matrix(gpt_with_hallucinated_context_evaluation, groundtruth)

    Groundtruth
    True    False   NEE
True    15  2   5   
False   1   5   4   
NEE 2   3   13  

{'True': {'True': 15, 'False': 2, 'NEE': 5},
 'False': {'True': 1, 'False': 5, 'NEE': 4},
 'NEE': {'True': 2, 'False': 3, 'NEE': 13}}

结果

将HyDE与简单的距离阈值相结合，可以显著改善结果。该模型不再偏向将索赔评估为真，也不会偏向于认为没有足够的证据。它还更准确地评估了何时没有足够的证据。

结论

为LLMs提供基于文档语料库的上下文是一种强大的技术，可以将LLMs的一般推理和自然语言交互带入您自己的数据中。然而，重要的是要知道，简单的查询和检索可能不会产生最佳结果！最终，理解数据将有助于充分利用基于检索的问答方法。

使用LLMs进行问答​