跳到主要内容

使用Tair作为OpenAI嵌入向量数据库

nbviewer

本笔记本将逐步指导您如何将Tair用作OpenAI嵌入向量数据库。

本笔记本展示了以下端到端的过程: 1. 使用OpenAI API创建的预先计算的嵌入向量。 2. 将嵌入向量存储在Tair的云实例中。 3. 将原始文本查询转换为使用OpenAI API的嵌入向量。 4. 使用Tair在创建的集合中执行最近邻搜索。

什么是Tair

Tair 是由阿里云开发的云原生内存数据库服务。Tair兼容开源Redis,并提供各种数据模型和企业级功能,以支持您的实时在线场景。Tair还推出了基于新型非易失性内存(NVM)存储介质的持久内存优化实例。这些实例可以降低成本约30%,确保数据持久性,并提供几乎与内存数据库相同的性能。Tair已被广泛应用于政务、金融、制造业、医疗保健和泛互联网等领域,以满足其高速查询和计算需求。

Tairvector 是一种内部数据结构,提供高性能的实时向量存储和检索。TairVector提供两种索引算法:分层可导航小世界(HNSW)和平面搜索。此外,TairVector支持多种距离函数,如欧氏距离、内积和Jaccard距离。与传统的向量检索服务相比,TairVector具有以下优势: - 将所有数据存储在内存中,并支持实时索引更新,以减少读写操作的延迟。 - 使用优化的内存数据结构更好地利用存储容量。 - 作为一个开箱即用的数据结构,在简单高效的架构中运行,无需复杂的模块或依赖项。

部署选项

先决条件

为了完成这个练习,我们需要准备一些事项:

  1. Tair 云服务器实例。
  2. 用于与 tair 数据库交互的 ‘tair’ 库。
  3. 一个OpenAI API密钥

安装要求

这个笔记本显然需要openaitair包,但我们还会使用一些其他附加库。以下命令会安装它们全部:

! pip install openai redis tair pandas wget

Looking in indexes: http://sg.mirrors.cloud.aliyuncs.com/pypi/simple/
Requirement already satisfied: openai in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (0.28.0)
Requirement already satisfied: redis in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (5.0.0)
Requirement already satisfied: tair in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (1.3.6)
Requirement already satisfied: pandas in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (2.1.0)
Requirement already satisfied: wget in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (3.2)
Requirement already satisfied: requests>=2.20 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (2.31.0)
Requirement already satisfied: tqdm in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (4.66.1)
Requirement already satisfied: aiohttp in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from openai) (3.8.5)
Requirement already satisfied: async-timeout>=4.0.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from redis) (4.0.3)
Requirement already satisfied: numpy>=1.22.4 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (1.25.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from pandas) (2023.3)
Requirement already satisfied: six>=1.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from requests>=2.20->openai) (2023.7.22)
Requirement already satisfied: attrs>=17.3.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (22.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /root/anaconda3/envs/notebook/lib/python3.10/site-packages (from aiohttp->openai) (1.3.1)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

准备你的OpenAI API密钥

OpenAI API密钥用于对文档和查询进行向量化。

如果你还没有OpenAI API密钥,可以从https://beta.openai.com/account/api-keys获取。

获取到密钥后,请通过getpass添加。

import getpass
import openai

openai.api_key = getpass.getpass("Input your OpenAI API key:")

Input your OpenAI API key:········

连接到Tair

首先将其添加到您的环境变量中。

使用官方Python库连接到正在运行的Tair服务器实例非常简单。

# URL 格式:redis://[[username]:[password]]@localhost:6379/0
TAIR_URL = getpass.getpass("Input your tair url:")

Input your tair url:········
from tair import Tair as TairClient

# 从URL连接到Tair并创建客户端

url = TAIR_URL
client = TairClient.from_url(url)

我们可以通过ping命令测试连接:

client.ping()

True
import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 文件大小约为700MB,因此需要一些时间来完成。
wget.download(embeddings_url)

100% [......................................................................] 698933052 / 698933052
'vector_database_wikipedia_articles_embedded (1).zip'

下载的文件必须被解压缩:

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
zip_ref.extractall(output_directory)


# 检查CSV文件是否存在
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
print(f"The file {file_name} exists in the data directory.")
else:
print(f"The file {file_name} does not exist in the data directory.")


The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

创建索引

Tair将数据存储在索引中,其中每个对象由一个键描述。每个键包含一个向量和多个属性键。

我们将首先创建两个索引,一个用于title_vector,另一个用于content_vector,然后我们将用预先计算的嵌入填充它。

# 设置索引参数
index = "openai_test"
embedding_dim = 1536
distance_type = "L2"
index_type = "HNSW"
data_type = "FLOAT32"

# 创建两个索引,一个用于title_vector,另一个用于content_vector,如果已存在则跳过。
index_names = [index + "_title_vector", index+"_content_vector"]
for index_name in index_names:
index_connection = client.tvs_get_index(index_name)
if index_connection is not None:
print("Index already exists")
else:
client.tvs_create_index(name=index_name, dim=embedding_dim, distance_type=distance_type,
index_type=index_type, data_type=data_type)

Index already exists
Index already exists

加载数据

在本节中,我们将加载在本次会话之前准备好的数据,这样您就不必使用自己的学分重新计算维基百科文章的嵌入。

import pandas as pd
from ast import literal_eval
# 本地CSV文件的路径
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'
article_df = pd.read_csv(csv_file_path)

# 从字符串中读取向量并将其转换为列表
article_df['title_vector'] = article_df.title_vector.apply(literal_eval).values
article_df['content_vector'] = article_df.content_vector.apply(literal_eval).values

# 向索引添加/更新数据
for i in range(len(article_df)):
# 将数据添加到索引,包含标题向量
client.tvs_hset(index=index_names[0], key=article_df.id[i].item(), vector=article_df.title_vector[i], is_binary=False,
**{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})
# 将数据添加到索引并包含content_vector
client.tvs_hset(index=index_names[1], key=article_df.id[i].item(), vector=article_df.content_vector[i], is_binary=False,
**{"url": article_df.url[i], "title": article_df.title[i], "text": article_df.text[i]})

# 检查数据计数,确保所有点都已存储。
for index_name in index_names:
stats = client.tvs_get_index(index_name)
count = int(stats["current_record_count"]) - int(stats["delete_record_count"])
print(f"Count in {index_name}:{count}")


Count in openai_test_title_vector:25000
Count in openai_test_content_vector:25000

搜索数据

一旦数据被放入Tair中,我们将开始查询集合中最接近的向量。我们可以提供一个额外的参数vector_name,以从基于标题的搜索切换到基于内容的搜索。由于预先计算的嵌入是使用text-embedding-3-small OpenAI模型创建的,因此我们在搜索过程中也必须使用它。

def query_tair(client, query, vector_name="title_vector", top_k=5):

# 从用户查询生成嵌入向量
embedded_query = openai.Embedding.create(
input= query,
model="text-embedding-3-small",
)["data"][0]['embedding']
embedded_query = np.array(embedded_query)

# 在索引中搜索向量最接近的k个近似最近邻
query_result = client.tvs_knnsearch(index=index+"_"+vector_name, k=top_k, vector=embedded_query)

return query_result

import openai
import numpy as np

query_result = query_tair(client=client, query="modern art in Europe", vector_name="title_vector")
for i in range(len(query_result)):
title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")

1. Museum of Modern Art (Distance: 0.125)
2. Western Europe (Distance: 0.133)
3. Renaissance art (Distance: 0.136)
4. Pop art (Distance: 0.14)
5. Northern Europe (Distance: 0.145)
# This time we'll query using content vector
query_result = query_tair(client=client, query="Famous battles in Scottish history", vector_name="content_vector")
for i in range(len(query_result)):
title = client.tvs_hmget(index+"_"+"content_vector", query_result[i][0].decode('utf-8'), "title")
print(f"{i + 1}. {title[0].decode('utf-8')} (Distance: {round(query_result[i][1],3)})")

1. Battle of Bannockburn (Distance: 0.131)
2. Wars of Scottish Independence (Distance: 0.139)
3. 1651 (Distance: 0.147)
4. First War of Scottish Independence (Distance: 0.15)
5. Robert I of Scotland (Distance: 0.154)