跳到主要内容

使用Pinecone进行检索增强生成问答

nbviewer

修复产生幻觉的LLMs

在这个笔记本中,我们将学习如何从Pinecone查询与我们的问题相关的上下文,并将这些上下文传递给生成式OpenAI模型,以生成由真实数据源支持的答案。

使用GPT-3来事实性地回答问题的一个常见问题是,GPT-3有时会编造事实。GPT模型具有广泛的常识知识,但这并不一定适用于更具体的信息。因此,我们使用Pinecone向量数据库作为我们的_“外部知识库”_ — 就像GPT-3的长期记忆

此笔记本所需的安装包括:

!pip install -qU openai pinecone-client datasets

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/55.3 KB ? eta -:--:--     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 55.3/55.3 KB 1.7 MB/s eta 0:00:00
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 170.6/170.6 KB 13.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 452.9/452.9 KB 30.4 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 KB 6.8 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 213.0/213.0 KB 17.3 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.0/132.0 KB 13.7 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 182.4/182.4 KB 18.6 MB/s eta 0:00:00
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.6/140.6 KB 6.7 MB/s eta 0:00:00
Building wheel for openai (pyproject.toml) ... done
import openai

# 在OpenAI网站的右上角下拉菜单中获取API密钥。
openai.api_key = "OPENAI_API_KEY"

对于许多问题,最先进的LLM模型都能够正确回答。

query = "who was the 12th person on the moon and when did they land?"

# 现在查询 `gpt-3.5-turbo-instruct`,不带上下文。
res = openai.Completion.create(
engine='gpt-3.5-turbo-instruct',
prompt=query,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)

res['choices'][0]['text'].strip()

'The 12th person on the moon was Harrison Schmitt, and he landed on December 11, 1972.'

然而,并非总是这样。首先让我们将上面的内容重写为一个简单的函数,这样我们就不需要每次都重写它。

def complete(prompt):
res = openai.Completion.create(
engine='gpt-3.5-turbo-instruct',
prompt=prompt,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return res['choices'][0]['text'].strip()

现在让我们提出一个关于训练一种名为句子转换器的transformer模型的更具体问题。我们期望得到的理想答案是_“多负例排名(MNR)损失”_。

如果这个术语对你来说是新名词,不用担心,理解我们正在做什么或演示的内容并不需要理解这个术语。

query = (
"Which training method should I use for sentence transformers when " +
"I only have pairs of related sentences?"
)

complete(query)

'If you only have pairs of related sentences, then the best training method to use for sentence transformers is the supervised learning approach. This approach involves providing the model with labeled data, such as pairs of related sentences, and then training the model to learn the relationships between the sentences. This approach is often used for tasks such as natural language inference, semantic similarity, and paraphrase identification.'

我们经常得到的一个常见答案是:

用于微调预训练模型的最佳训练方法是掩码语言模型(Masked Language Model,MLM)训练。MLM训练涉及随机屏蔽句子中的一些单词,然后训练模型来预测被屏蔽的单词。这有助于模型学习句子的上下文,并更好地理解单词之间的关系。

这个答案似乎相当令人信服,对吧?然而,这是错误的。MLM通常用于transformer模型的预训练步骤,但不能用于微调句子转换器,并且与具有_“相关句子对”_无关。

我们收到的另一个备选答案(以及我们上面返回的答案)是关于监督学习方法是最合适的。这完全正确,但它不够具体,也没有回答问题。

我们有两个选择来使我们的LLM能够理解并正确回答这个问题:

  1. 我们在涵盖所提到主题的文本数据上对LLM进行微调,很可能是在讨论句子转换器、语义搜索训练方法等文章和论文上。

  2. 我们使用Retrieval Augmented Generation(RAG),这是一种实现信息检索组件到生成过程的技术。这使我们能够检索相关信息,并将这些信息作为次要信息输入到生成模型中。

我们将演示选项2


构建知识库

选择 2 选项时,检索相关信息需要一个外部 “知识库”,一个我们可以存储和使用以有效检索信息的地方。我们可以将其视为我们LLM的外部 长期记忆

我们需要检索与我们查询语义相关的信息,为此我们需要使用 “密集向量嵌入”。这些可以被视为我们句子背后含义的数值表示。

为了创建这些密集向量,我们使用 text-embedding-3-small 模型。

我们已经验证了我们的OpenAI连接,要创建一个嵌入,我们只需执行:

embed_model = "text-embedding-ada-002"

res = openai.Embedding.create(
input=[
"Sample document text goes here",
"there will be several phrases in each batch"
], engine=embed_model
)

在响应res中,我们将在'data'字段中找到一个包含我们新嵌入向量的类似JSON的对象。

res.keys()

dict_keys(['object', 'data', 'model', 'usage'])

'data'中,我们会找到两条记录,分别对应我们刚刚嵌入的两个句子。每个向量嵌入包含1536个维度(即text-embedding-3-small模型的输出维度)。

len(res['data'])

2
len(res['data'][0]['embedding']), len(res['data'][1]['embedding'])

(1536, 1536)

我们将把相同的嵌入逻辑应用到一个包含与我们查询相关信息的数据集上(以及关于ML和AI主题的许多其他查询)。

数据准备

我们将使用的数据集是来自Hugging Face Datasetsjamescalam/youtube-transcriptions。它包含了几个ML和技术YouTube频道的音频转录。我们可以通过以下方式下载它:

from datasets import load_dataset

data = load_dataset('jamescalam/youtube-transcriptions', split='train')
data

Using custom data configuration jamescalam--youtube-transcriptions-6a482f3df0aedcdb
Reusing dataset json (/Users/jamesbriggs/.cache/huggingface/datasets/jamescalam___json/jamescalam--youtube-transcriptions-6a482f3df0aedcdb/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)
Dataset({
features: ['title', 'published', 'url', 'video_id', 'channel_id', 'id', 'text', 'start', 'end'],
num_rows: 208619
})
data[0]

{'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
'published': '2021-07-06 13:00:03 UTC',
'url': 'https://youtu.be/35Pdoyi6ZoQ',
'video_id': '35Pdoyi6ZoQ',
'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
'id': '35Pdoyi6ZoQ-t0.0',
'text': 'Hi, welcome to the video.',
'start': 0.0,
'end': 9.36}

数据集包含许多小片段的文本数据。我们需要合并每个视频中的许多片段,以创建包含更多信息的更大块的文本。

from tqdm.auto import tqdm

new_data = []

window = 20 # 需要合并的句子数量
stride = 4 # number of sentences to 'stride' over, used to create overlap

for i in tqdm(range(0, len(data), stride)):
i_end = min(len(data)-1, i+window)
if data[i]['title'] != data[i_end]['title']:
# 在这种情况下,我们跳过此条目,因为我们有两段视频的开始和结束。
continue
text = ' '.join(data[i:i_end]['text'])
# 创建新的合并数据集
new_data.append({
'start': data[i]['start'],
'end': data[i_end]['end'],
'title': data[i]['title'],
'text': text,
'id': data[i]['id'],
'url': data[i]['url'],
'published': data[i]['published'],
'channel_id': data[i]['channel_id']
})

  0%|          | 0/52155 [00:00<?, ?it/s]
new_data[0]

{'start': 0.0,
'end': 74.12,
'title': 'Training and Testing an Italian BERT - Transformers From Scratch #4',
'text': "Hi, welcome to the video. So this is the fourth video in a Transformers from Scratch mini series. So if you haven't been following along, we've essentially covered what you can see on the screen. So we got some data. We built a tokenizer with it. And then we've set up our input pipeline ready to begin actually training our model, which is what we're going to cover in this video. So let's move over to the code. And we see here that we have essentially everything we've done so far. So we've built our input data, our input pipeline. And we're now at a point where we have a data loader, PyTorch data loader, ready. And we can begin training a model with it. So there are a few things to be aware of. So I mean, first, let's just have a quick look at the structure of our data.",
'id': '35Pdoyi6ZoQ-t0.0',
'url': 'https://youtu.be/35Pdoyi6ZoQ',
'published': '2021-07-06 13:00:03 UTC',
'channel_id': 'UCv83tO5cePwHMt1952IVVHw'}

现在我们需要一个地方来存储这些嵌入向量,并通过它们进行高效的_向量搜索。为此,我们使用**Pinecone**,我们可以获取一个免费的API密钥,并在下面输入它,我们将初始化与Pinecone的连接并创建一个新的索引。

import pinecone

index_name = 'openai-youtube-transcriptions'

# 初始化与 Pinecone 的连接(在 app.pinecone.io 获取 API 密钥)
pinecone.init(
api_key="PINECONE_API_KEY",
environment="us-east1-gcp" # 可能有所不同,请在app.pinecone.io上查看。
)

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
# if does not exist, create index
pinecone.create_index(
index_name,
dimension=len(res['data'][0]['embedding']),
metric='cosine',
metadata_config={'indexed': ['channel_id', 'published']}
)
# 连接到索引
index = pinecone.Index(index_name)
# 查看索引统计信息
index.describe_index_stats()

{'dimension': 1536,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}

我们可以看到索引目前是空的,total_vector_count0。我们可以开始使用OpenAI text-embedding-3-small构建的嵌入来填充它,就像这样:

from tqdm.auto import tqdm
from time import sleep

batch_size = 100 # 我们一次创建和插入多少个嵌入?

for i in tqdm(range(0, len(new_data), batch_size)):
# 查找批处理结束位置
i_end = min(len(new_data), i+batch_size)
meta_batch = new_data[i:i_end]
# 获取ID
ids_batch = [x['id'] for x in meta_batch]
# 获取待编码文本
texts = [x['text'] for x in meta_batch]
# 创建嵌入(添加了try-except以避免RateLimitError)
done = False
while not done:
try:
res = openai.Embedding.create(input=texts, engine=embed_model)
done = True
except:
sleep(5)
embeds = [record['embedding'] for record in res['data']]
# 清理元数据
meta_batch = [{
'start': x['start'],
'end': x['end'],
'title': x['title'],
'text': x['text'],
'url': x['url'],
'published': x['published'],
'channel_id': x['channel_id']
} for x in meta_batch]
to_upsert = list(zip(ids_batch, embeds, meta_batch))
# 插入或更新到 Pinecone
index.upsert(vectors=to_upsert)

  0%|          | 0/487 [00:00<?, ?it/s]

现在我们要进行搜索,为此我们需要创建一个查询向量 xq

res = openai.Embedding.create(
input=[query],
engine=embed_model
)

# 从Pinecone检索
xq = res['data'][0]['embedding']

# 获取相关上下文(包括问题)
res = index.query(xq, top_k=2, include_metadata=True)

res

{'matches': [{'id': 'pNvujJ1XyeQ-t418.88',
'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
'end': 568.4,
'published': datetime.date(2021, 11, 24),
'start': 418.88,
'text': 'pairs of related sentences you can go '
'ahead and actually try training or '
'fine-tuning using NLI with multiple '
"negative ranking loss. If you don't have "
'that fine. Another option is that you have '
'a semantic textual similarity data set or '
'STS and what this is is you have so you '
'have sentence A here, sentence B here and '
'then you have a score from from 0 to 1 '
'that tells you the similarity between '
'those two scores and you would train this '
'using something like cosine similarity '
"loss. Now if that's not an option and your "
'focus or use case is on building a '
'sentence transformer for another language '
'where there is no current sentence '
'transformer you can use multilingual '
'parallel data. So what I mean by that is '
'so parallel data just means translation '
'pairs so if you have for example a English '
'sentence and then you have another '
'language here so it can it can be anything '
"I'm just going to put XX and that XX is "
'your target language you can fine-tune a '
'model using something called multilingual '
'knowledge distillation and what that does '
'is takes a monolingual model for example '
'in English and using those translation '
'pairs it distills the knowledge the '
'semantic similarity knowledge from that '
'monolingual English model into a '
'multilingual model which can handle both '
'English and your target language. So '
"they're three options quite popular very "
'common that you can go for and as a '
'supervised methods the chances are that '
'probably going to outperform anything you '
'do with unsupervised training at least for '
'now. So if none of those sound like '
'something',
'title': 'Today Unsupervised Sentence Transformers, '
'Tomorrow Skynet (how TSDAE works)',
'url': 'https://youtu.be/pNvujJ1XyeQ'},
'score': 0.865277052,
'sparseValues': {},
'values': []},
{'id': 'WS1uVMGhlWQ-t737.28',
'metadata': {'channel_id': 'UCv83tO5cePwHMt1952IVVHw',
'end': 900.72,
'published': datetime.date(2021, 10, 20),
'start': 737.28,
'text': "were actually more accurate. So we can't "
"really do that. We can't use this what is "
'called a mean pooling approach. Or we '
"can't use it in its current form. Now the "
'solution to this problem was introduced by '
'two people in 2019 Nils Reimers and Irenia '
'Gurevich. They introduced what is the '
'first sentence transformer or sentence '
'BERT. And it was found that sentence BERT '
'or S BERT outformed all of the previous '
'Save the Art models on pretty much all '
'benchmarks. Not all of them but most of '
'them. And it did it in a very quick time. '
'So if we compare it to BERT, if we wanted '
'to find the most similar sentence pair '
'from 10,000 sentences in that 2019 paper '
'they found that with BERT that took 65 '
'hours. With S BERT embeddings they could '
'create all the embeddings in just around '
'five seconds. And then they could compare '
'all those with cosine similarity in 0.01 '
"seconds. So it's a lot faster. We go from "
'65 hours to just over five seconds which '
'is I think pretty incredible. Now I think '
"that's pretty much all the context we need "
'behind sentence transformers. And what we '
'do now is dive into a little bit of how '
'they actually work. Now we said before we '
'have the core transform models and what S '
'BERT does is fine tunes on sentence pairs '
'using what is called a Siamese '
'architecture or Siamese network. What we '
'mean by a Siamese network is that we have '
'what we can see, what can view as two BERT '
'models that are identical and the weights '
'between those two models are tied. Now in '
'reality when implementing this we just use '
'a single BERT model. And what we do is we '
'process one sentence, a sentence A through '
'the model and then we process another '
'sentence, sentence B through the model. '
"And that's the sentence pair. So with our "
'cross-linked we were processing the '
'sentence pair together. We were putting '
'them both together, processing them all at '
'once. This time we process them '
'separately. And during training what '
'happens is the weights',
'title': 'Intro to Sentence Embeddings with '
'Transformers',
'url': 'https://youtu.be/WS1uVMGhlWQ'},
'score': 0.85855335,
'sparseValues': {},
'values': []}],
'namespace': ''}
limit = 3750

def retrieve(query):
res = openai.Embedding.create(
input=[query],
engine=embed_model
)

# 从Pinecone检索
xq = res['data'][0]['embedding']

# 获取相关上下文
res = index.query(xq, top_k=3, include_metadata=True)
contexts = [
x['metadata']['text'] for x in res['matches']
]

# 将检索到的上下文内容融入我们的提示中
prompt_start = (
"Answer the question based on the context below.\n\n"+
"Context:\n"
)
prompt_end = (
f"\n\nQuestion: {query}\nAnswer:"
)
# 不断附加上下文,直至达到限制。
for i in range(1, len(contexts)):
if len("\n\n---\n\n".join(contexts[:i])) >= limit:
prompt = (
prompt_start +
"\n\n---\n\n".join(contexts[:i-1]) +
prompt_end
)
break
elif i == len(contexts)-1:
prompt = (
prompt_start +
"\n\n---\n\n".join(contexts) +
prompt_end
)
return prompt

# 首先,我们从Pinecone中检索相关项目。
query_with_contexts = retrieve(query)
query_with_contexts

"Answer the question based on the context below.\n\nContext:\npairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B here and then you have a score from from 0 to 1 that tells you the similarity between those two scores and you would train this using something like cosine similarity loss. Now if that's not an option and your focus or use case is on building a sentence transformer for another language where there is no current sentence transformer you can use multilingual parallel data. So what I mean by that is so parallel data just means translation pairs so if you have for example a English sentence and then you have another language here so it can it can be anything I'm just going to put XX and that XX is your target language you can fine-tune a model using something called multilingual knowledge distillation and what that does is takes a monolingual model for example in English and using those translation pairs it distills the knowledge the semantic similarity knowledge from that monolingual English model into a multilingual model which can handle both English and your target language. So they're three options quite popular very common that you can go for and as a supervised methods the chances are that probably going to outperform anything you do with unsupervised training at least for now. So if none of those sound like something\n\n---\n\nwere actually more accurate. So we can't really do that. We can't use this what is called a mean pooling approach. Or we can't use it in its current form. Now the solution to this problem was introduced by two people in 2019 Nils Reimers and Irenia Gurevich. They introduced what is the first sentence transformer or sentence BERT. And it was found that sentence BERT or S BERT outformed all of the previous Save the Art models on pretty much all benchmarks. Not all of them but most of them. And it did it in a very quick time. So if we compare it to BERT, if we wanted to find the most similar sentence pair from 10,000 sentences in that 2019 paper they found that with BERT that took 65 hours. With S BERT embeddings they could create all the embeddings in just around five seconds. And then they could compare all those with cosine similarity in 0.01 seconds. So it's a lot faster. We go from 65 hours to just over five seconds which is I think pretty incredible. Now I think that's pretty much all the context we need behind sentence transformers. And what we do now is dive into a little bit of how they actually work. Now we said before we have the core transform models and what S BERT does is fine tunes on sentence pairs using what is called a Siamese architecture or Siamese network. What we mean by a Siamese network is that we have what we can see, what can view as two BERT models that are identical and the weights between those two models are tied. Now in reality when implementing this we just use a single BERT model. And what we do is we process one sentence, a sentence A through the model and then we process another sentence, sentence B through the model. And that's the sentence pair. So with our cross-linked we were processing the sentence pair together. We were putting them both together, processing them all at once. This time we process them separately. And during training what happens is the weights\n\n---\n\nTransformer-based Sequential Denoising Autoencoder. So what we'll do is jump straight into it and take a look at where we might want to use this training approach and and how we can actually implement it. So the first question we need to ask is do we really need to resort to unsupervised training? Now what we're going to do here is just have a look at a few of the most popular training approaches and what sort of data we need for that. So the first one we're looking at here is Natural Language Inference or NLI and NLI requires that we have pairs of sentences that are labeled as either contradictory, neutral which means they're not necessarily related or as entailing or as inferring each other. So you have pairs that entail each other so they are both very similar pairs that are neutral and also pairs that are contradictory. And this is the traditional NLI data. Now using another version of fine-tuning with NLI called a multiple negatives ranking loss you can get by with only entailment pairs so pairs that are related to each other or positive pairs and it can also use contradictory pairs to improve the performance of training as well but you don't need it. So if you have positive pairs of related sentences you can go ahead and actually try training or fine-tuning using NLI with multiple negative ranking loss. If you don't have that fine. Another option is that you have a semantic textual similarity data set or STS and what this is is you have so you have sentence A here, sentence B\n\nQuestion: Which training method should I use for sentence transformers when I only have pairs of related sentences?\nAnswer:"
# 然后我们完成融入上下文的查询。
complete(query_with_contexts)

'You should use Natural Language Inference (NLI) with multiple negative ranking loss.'

我们立即得到一个非常好的答案,指定使用_多重排名损失_(也称为_多负排名损失_)。