使用Pinecone进行嵌入式搜索
本笔记本将带您完成一个简单的流程,下载一些数据,对其进行嵌入,然后使用一些向量数据库对其进行索引和搜索。这是客户常见的需求,他们希望在安全环境中存储和搜索我们的嵌入,以支持生产用例,如聊天机器人、主题建模等。
什么是向量数据库
向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来,使用嵌入来将非结构化数据(文本、音频、视频等)编码为向量,以供机器学习模型使用,由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例时的效果日益增强,嵌入的使用已经迅速增长。向量数据库已经成为企业提供和扩展这些用例的有效解决方案。
为什么使用向量数据库
向量数据库使企业能够利用我们在此存储库中分享的许多嵌入用例(例如问答、聊天机器人和推荐服务),并在安全、可扩展的环境中使用它们。许多客户使用嵌入在小规模上解决问题,但性能和安全性阻碍了它们 投入生产 - 我们认为向量数据库是解决这一问题的关键组成部分,在本指南中,我们将介绍嵌入文本数据的基础知识,将其存储在向量数据库中,并将其用于语义搜索。
演示流程
演示流程如下: - 设置:导入包并设置任何必需的变量 - 加载数据:加载数据集并使用OpenAI嵌入对其进行嵌入 - Pinecone - 设置:在这里,我们将设置Pinecone的Python客户端。有关更多详细信息,请访问此处 - 索引数据:我们将创建一个带有__标题__和__内容__命名空间的索引 - 搜索数据:我们将使用搜索查询测试这两个命名空间,以确认其正常工作
完成本笔记后,您应该对如何设置和使用向量数据库有基本的了解,并可以继续进行更复杂的用例,利用我们的嵌入。
设置
导入所需的库并设置我们想要使用的嵌入模型。
# 我们需要安装Pinecone客户端。
!pip install pinecone-client
#安装wget以拉取zip文件
!pip install wget
Requirement already satisfied: pinecone-client in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (2.2.2)
Requirement already satisfied: requests>=2.19.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.31.0)
Requirement already satisfied: pyyaml>=5.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (6.0)
Requirement already satisfied: loguru>=0.5.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (0.7.0)
Requirement already satisfied: typing-extensions>=3.7.4 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.5.0)
Requirement already satisfied: dnspython>=2.0.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.3.0)
Requirement already satisfied: python-dateutil>=2.5.3 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (2.8.2)
Requirement already satisfied: urllib3>=1.21.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.26.16)
Requirement already satisfied: tqdm>=4.64.1 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (4.65.0)
Requirement already satisfied: numpy>=1.22.0 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from pinecone-client) (1.25.0)
Requirement already satisfied: six>=1.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from python-dateutil>=2.5.3->pinecone-client) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.1.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (from requests>=2.19.0->pinecone-client) (2023.5.7)
Requirement already satisfied: wget in /Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages (3.2)
import openai
from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval
# Pinecone's client library for Python
import pinecone
# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-3-small"
# 忽略未关闭的SSL套接字警告 - 可选,以防你遇到这些错误。
import warnings
warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
/Users/colin.jarvis/Documents/dev/cookbook/openai-cookbook/vector_db/lib/python3.10/site-packages/pinecone/index.py:4: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
from tqdm.autonotebook import tqdm
加载数据
在这一部分,我们将加载我们在这篇文章中准备的嵌入式数据。
embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'
# 文件大小约为700MB,因此需要一些时间来完成。
wget.download(embeddings_url)
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
zip_ref.extractall("../data")
article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')
article_df.head()
id | url | title | text | title_vector | content_vector | vector_id | |
---|---|---|---|---|---|---|---|
0 | 1 | https://simple.wikipedia.org/wiki/April | April | April is the fourth month of the year in the J... | [0.001009464613161981, -0.020700545981526375, ... | [-0.011253940872848034, -0.013491976074874401,... | 0 |
1 | 2 | https://simple.wikipedia.org/wiki/August | August | August (Aug.) is the eighth month of the year ... | [0.0009286514250561595, 0.000820168002974242, ... | [0.0003609954728744924, 0.007262262050062418, ... | 1 |
2 | 6 | https://simple.wikipedia.org/wiki/Art | Art | Art is a creative activity that expresses imag... | [0.003393713850528002, 0.0061537534929811954, ... | [-0.004959689453244209, 0.015772193670272827, ... | 2 |
3 | 8 | https://simple.wikipedia.org/wiki/A | A | A or a is the first letter of the English alph... | [0.0153952119871974, -0.013759135268628597, 0.... | [0.024894846603274345, -0.022186409682035446, ... | 3 |
4 | 9 | https://simple.wikipedia.org/wiki/Air | Air | Air refers to the Earth's atmosphere. Air is a... | [0.02224554680287838, -0.02044147066771984, -0... | [0.021524671465158463, 0.018522677943110466, -... | 4 |
# 从字符串中读取向量并将其转换为列表
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)
# 将 `vector_id` 设置为一个字符串
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.info(show_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 25000 non-null int64
1 url 25000 non-null object
2 title 25000 non-null object
3 text 25000 non-null object
4 title_vector 25000 non-null object
5 content_vector 25000 non-null object
6 vector_id 25000 non-null object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB
Pinecone
接下来我们将看一下Pinecone,这是一个提供云原生选项的托管向量数据库。
在继续之前,您需要前往Pinecone,注册并将您的API密钥保存为名为PINECONE_API_KEY
的环境变量。
在本节中,我们将: - 为文章标题和内容创建具有多个命名空间的索引 - 在索引中存储我们的数据,使用独立的可搜索的文章标题和内容的”命名空间” - 发出一些相似性搜索查询来验证我们的设置是否正常工作
api_key = os.getenv("PINECONE_API_KEY")
pinecone.init(api_key=api_key)
创建索引
首先,我们需要创建一个索引,我们将其命名为wikipedia-articles
。一旦我们有了一个索引,我们就可以创建多个命名空间,这样可以使单个索引可用于各种用例。有关更多详细信息,请参考Pinecone文档。
如果您想要批量并行插入到索引以增加插入速度,那么在Pinecone文档中有一份关于并行批量插入的指南。
# 该模型实现了一个简单的批处理生成器,能够将输入的DataFrame分割成多个数据块。
class BatchGenerator:
def __init__(self, batch_size: int = 10) -> None:
self.batch_size = batch_size
# 将输入的DataFrame分块处理
def to_batches(self, df: pd.DataFrame) -> Iterator[pd.DataFrame]:
splits = self.splits_num(df.shape[0])
if splits <= 1:
yield df
else:
for chunk in np.array_split(df, splits):
yield chunk
# 确定DataFrame包含多少个数据块
def splits_num(self, elements: int) -> int:
return round(elements / self.batch_size)
__call__ = to_batches
df_batcher = BatchGenerator(300)
# 为新指数取个名字
index_name = 'wikipedia-articles'
# 检查是否已存在同名索引 - 如果存在,则删除它。
if index_name in pinecone.list_indexes():
pinecone.delete_index(index_name)
# 创建新索引
pinecone.create_index(name=index_name, dimension=len(article_df['content_vector'][0]))
index = pinecone.Index(index_name=index_name)
# 确认我们的索引已创建。
pinecone.list_indexes()
['podcasts', 'wikipedia-articles']
# 在内容命名空间中插入或更新内容向量——这可能需要几分钟时间。
print("Uploading vectors to content namespace..")
for batch_df in df_batcher(article_df):
index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')
Uploading vectors to content namespace..
# 在标题命名空间中插入或更新标题向量——这也可能需要几分钟时间。
print("Uploading vectors to title namespace..")
for batch_df in df_batcher(article_df):
index.upsert(vectors=zip(batch_df.vector_id, batch_df.title_vector), namespace='title')
Uploading vectors to title namespace..
# 检查每个命名空间的索引大小,以确认所有文档均已加载。
index.describe_index_stats()
{'dimension': 1536,
'index_fullness': 0.1,
'namespaces': {'content': {'vector_count': 25000},
'title': {'vector_count': 25000}},
'total_vector_count': 50000}
搜索数据
现在我们将输入一些虚拟搜索内容,并检查我们是否能够 得到合理的结果。
# 首先,我们将创建字典,将向量ID映射到它们的输出,这样我们就可以检索搜索结果的文本。
titles_mapped = dict(zip(article_df.vector_id,article_df.title))
content_mapped = dict(zip(article_df.vector_id,article_df.text))
def query_article(query, namespace, top_k=5):
'''根据指定命名空间中的文章标题进行查询,并打印结果。'''
# 基于标题列创建向量嵌入
embedded_query = openai.Embedding.create(
input=query,
model=EMBEDDING_MODEL,
)["data"][0]['embedding']
# 使用标题向量查询作为参数传递的命名空间
query_result = index.query(embedded_query,
namespace=namespace,
top_k=top_k)
# 打印查询结果
print(f'\nMost similar results to {query} in "{namespace}" namespace:\n')
if not query_result.matches:
print('no query result')
matches = query_result.matches
ids = [res.id for res in matches]
scores = [res.score for res in matches]
df = pd.DataFrame({'id':ids,
'score':scores,
'title': [titles_mapped[_id] for _id in ids],
'content': [content_mapped[_id] for _id in ids],
})
counter = 0
for k,v in df.iterrows():
counter += 1
print(f'{v.title} (score = {v.score})')
print('\n')
return df
query_output = query_article('modern art in Europe','title')
Most similar results to modern art in Europe in "title" namespace:
Museum of Modern Art (score = 0.875177085)
Western Europe (score = 0.867441177)
Renaissance art (score = 0.864156306)
Pop art (score = 0.860346854)
Northern Europe (score = 0.854658186)
content_query_output = query_article("Famous battles in Scottish history",'content')
Most similar results to Famous battles in Scottish history in "content" namespace:
Battle of Bannockburn (score = 0.869336188)
Wars of Scottish Independence (score = 0.861470938)
1651 (score = 0.852588475)
First War of Scottish Independence (score = 0.84962213)
Robert I of Scotland (score = 0.846214116)