跳到主要内容

使用Redis和OpenAI运行混合VSS查询

nbviewer

本笔记本介绍如何使用Redis作为矢量数据库与OpenAI嵌入,并运行混合查询,结合VSS和词汇搜索使用Redis查询和搜索功能。Redis是一个可扩展的实时数据库,当使用RediSearch模块时,可以用作矢量数据库。Redis查询和搜索功能允许您在Redis中索引和搜索矢量。本笔记本将向您展示如何使用Redis查询和搜索来索引和搜索使用OpenAI API创建并存储在Redis中的矢量。

混合查询将矢量相似性与传统的Redis查询和搜索过滤功能(GEO、NUMERIC、TAG或TEXT数据)结合在一起,简化应用程序代码。在电子商务用例中,混合查询的常见示例是查找与给定查询图像在GEO位置和价格范围内可用的物品在视觉上相似的物品。

先决条件

在开始这个项目之前,我们需要设置以下内容:

===========================================================

启动 Redis

为了保持这个示例简单,我们将使用 Redis Stack docker 容器,可以按照以下方式启动:

$ docker-compose up -d

这还包括用于管理您的 Redis 数据库的 RedisInsight GUI,您可以在启动 docker 容器后在 http://localhost:8001 查看。

您已经设置好并且准备就绪!接下来,我们导入并创建用于与我们刚刚创建的 Redis 数据库通信的客户端。

安装要求

Redis-Py是用于与Redis通信的Python客户端。我们将使用它来与我们的Redis堆栈数据库进行通信。

! pip install redis pandas openai


Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: redis in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (4.5.4)
Requirement already satisfied: pandas in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (2.0.1)
Requirement already satisfied: openai in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (0.27.6)
Requirement already satisfied: async-timeout>=4.0.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from redis) (4.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: tzdata>=2022.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (2023.3)
Requirement already satisfied: numpy>=1.20.3 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from pandas) (1.23.4)
Requirement already satisfied: requests>=2.20 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (2.28.1)
Requirement already satisfied: tqdm in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (4.64.1)
Requirement already satisfied: aiohttp in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from openai) (3.8.4)
Requirement already satisfied: six>=1.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: charset-normalizer<3,>=2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (1.26.12)
Requirement already satisfied: certifi>=2017.4.17 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from requests>=2.20->openai) (2022.9.24)
Requirement already satisfied: attrs>=17.3.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /Users/michael.yuan/Library/Python/3.9/lib/python/site-packages (from aiohttp->openai) (1.3.1)

=========================================================== ## 准备您的OpenAI API密钥

OpenAI API密钥 用于对查询数据进行向量化。

如果您还没有OpenAI API密钥,可以从https://beta.openai.com/account/api-keys获取。

获取到密钥后,请使用以下命令将其添加到您的环境变量中,命名为 OPENAI_API_KEY

# 验证您的 OpenAI API 密钥是否已正确设置为环境变量。
# 注意:如果您在本地运行此笔记本,您需要重新加载终端和笔记本,以使环境变量生效。
import os
import openai

os.environ["OPENAI_API_KEY"] = '<YOUR_OPENAI_API_KEY>'

if os.getenv("OPENAI_API_KEY") is not None:
openai.api_key = os.getenv("OPENAI_API_KEY")
print ("OPENAI_API_KEY is ready")
else:
print ("OPENAI_API_KEY environment variable not found")


OPENAI_API_KEY is ready

加载数据

在这一部分,我们将加载并清洗一个电子商务数据集。我们将使用OpenAI生成嵌入向量,并使用这些数据在Redis中创建一个索引,然后搜索相似的向量。

import pandas as pd
import numpy as np
from typing import List

from utils.embeddings_utils import (
get_embeddings,
distances_from_embeddings,
tsne_components_from_embeddings,
chart_from_components,
indices_of_nearest_neighbors_from_distances,
)

EMBEDDING_MODEL = "text-embedding-3-small"

# 加载数据并清理数据类型,同时删除含有空值的行
df = pd.read_csv("../../data/styles_2k.csv", on_bad_lines='skip')
df.dropna(inplace=True)
df["year"] = df["year"].astype(int)
df.info()

# 打印数据框
n_examples = 5
df.head(n_examples)


<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1978 non-null int64
1 gender 1978 non-null object
2 masterCategory 1978 non-null object
3 subCategory 1978 non-null object
4 articleType 1978 non-null object
5 baseColour 1978 non-null object
6 season 1978 non-null object
7 year 1978 non-null int64
8 usage 1978 non-null object
9 productDisplayName 1978 non-null object
dtypes: int64(2), object(8)
memory usage: 170.0+ KB
id gender masterCategory subCategory articleType baseColour season year usage productDisplayName
0 15970 Men Apparel Topwear Shirts Navy Blue Fall 2011 Casual Turtle Check Men Navy Blue Shirt
1 39386 Men Apparel Bottomwear Jeans Blue Summer 2012 Casual Peter England Men Party Blue Jeans
2 59263 Women Accessories Watches Watches Silver Winter 2016 Casual Titan Women Silver Watch
3 21379 Men Apparel Bottomwear Track Pants Black Fall 2011 Casual Manchester United Men Solid Black Track Pants
4 53759 Men Apparel Topwear Tshirts Grey Summer 2012 Casual Puma Men Grey T-shirt
df["product_text"] = df.apply(lambda row: f"name {row['productDisplayName']} category {row['masterCategory']} subcategory {row['subCategory']} color {row['baseColour']} gender {row['gender']}".lower(), axis=1)
df.rename({"id":"product_id"}, inplace=True, axis=1)

df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 product_id 1978 non-null int64
1 gender 1978 non-null object
2 masterCategory 1978 non-null object
3 subCategory 1978 non-null object
4 articleType 1978 non-null object
5 baseColour 1978 non-null object
6 season 1978 non-null object
7 year 1978 non-null int64
8 usage 1978 non-null object
9 productDisplayName 1978 non-null object
10 product_text 1978 non-null object
dtypes: int64(2), object(9)
memory usage: 185.4+ KB
# 查看我们将用于创建语义嵌入的其中一个文本
df["product_text"][0]


'name turtle check men navy blue shirt category apparel subcategory topwear color navy blue gender men'

连接到Redis

现在我们的Redis数据库正在运行,我们可以使用Redis-py客户端连接到它。我们将使用Redis数据库的默认主机和端口,即localhost:6379

import redis
from redis.commands.search.indexDefinition import (
IndexDefinition,
IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
TagField,
NumericField,
TextField,
VectorField
)

REDIS_HOST = "localhost"
REDIS_PORT = 6379
REDIS_PASSWORD = "" # Redis 无密码默认设置

# 连接到 Redis
redis_client = redis.Redis(
host=REDIS_HOST,
port=REDIS_PORT,
password=REDIS_PASSWORD
)
redis_client.ping()


True

在Redis中创建搜索索引

下面的单元格将展示如何在Redis中指定和创建一个搜索索引。我们将:

  1. 设置一些常量来定义我们的索引,比如距离度量和索引名称
  2. 使用RediSearch字段定义索引模式
  3. 创建索引
# 常量
INDEX_NAME = "product_embeddings" # 搜索索引的名称
PREFIX = "doc" # 文档键的前缀
DISTANCE_METRIC = "L2" # 向量间的距离度量方法(例如:余弦距离、内积、欧氏距离L2)
NUMBER_OF_VECTORS = len(df)


# 为数据集中的每一列定义RediSearch字段
name = TextField(name="productDisplayName")
category = TagField(name="masterCategory")
articleType = TagField(name="articleType")
gender = TagField(name="gender")
season = TagField(name="season")
year = NumericField(name="year")
text_embedding = VectorField("product_vector",
"FLAT", {
"TYPE": "FLOAT32",
"DIM": 1536,
"DISTANCE_METRIC": DISTANCE_METRIC,
"INITIAL_CAP": NUMBER_OF_VECTORS,
}
)
fields = [name, category, articleType, gender, season, year, text_embedding]


# 检查索引是否存在
try:
redis_client.ft(INDEX_NAME).info()
print("Index already exists")
except:
# 创建RediSearch索引
redis_client.ft(INDEX_NAME).create_index(
fields = fields,
definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)


生成OpenAI Embeddings并将文档加载到索引中

现在我们有了一个搜索索引,我们可以将文档加载到其中。我们将使用之前加载的包含样式数据集的数据框。在Redis中,可以使用HASH或JSON(如果除了RediSearch还使用RedisJSON)数据类型来存储文档。在本例中,我们将使用HASH数据类型。下面的单元格将展示如何为不同的产品获取OpenAI embeddings并将文档加载到索引中。

# 利用OpenAI的批量请求功能加速嵌入向量的生成
def embeddings_batch_request(documents: pd.DataFrame):
records = documents.to_dict("records")
print("Records to process: ", len(records))
product_vectors = []
docs = []
batchsize = 1000

for idx,doc in enumerate(records,start=1):
# 创建字节向量
docs.append(doc["product_text"])
if idx % batchsize == 0:
product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
docs.clear()
print("Vectors processed ", len(product_vectors), end='\r')
product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
print("Vectors processed ", len(product_vectors), end='\r')
return product_vectors


def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
product_vectors = embeddings_batch_request(documents)
records = documents.to_dict("records")
batchsize = 500

# 利用Redis管道批量调用,以节省往返网络通信的开销。
pipe = client.pipeline()
for idx,doc in enumerate(records,start=1):
key = f"{prefix}:{str(doc['product_id'])}"

# 创建字节向量
text_embedding = np.array((product_vectors[idx-1]), dtype=np.float32).tobytes()

# 将浮点数列表替换为字节向量
doc["product_vector"] = text_embedding

pipe.hset(key, mapping = doc)
if idx % batchsize == 0:
pipe.execute()
pipe.execute()


%%time
index_documents(redis_client, PREFIX, df)
print(f"Loaded {redis_client.info()['db0']['keys']} documents in Redis search index with name: {INDEX_NAME}")


Records to process:  1978
Loaded 1978 documents in Redis search index with name: product_embeddings
CPU times: user 619 ms, sys: 78.9 ms, total: 698 ms
Wall time: 3.34 s

使用OpenAI查询嵌入进行简单的向量搜索查询

现在我们已经有了一个搜索索引和加载到其中的文档,我们可以运行搜索查询。下面我们将提供一个函数,该函数将运行一个搜索查询并返回结果。使用这个函数,我们运行一些查询,展示如何利用Redis作为向量数据库。

def search_redis(
redis_client: redis.Redis,
user_query: str,
index_name: str = "product_embeddings",
vector_field: str = "product_vector",
return_fields: list = ["productDisplayName", "masterCategory", "gender", "season", "year", "vector_score"],
hybrid_fields = "*",
k: int = 20,
print_results: bool = True,
) -> List[dict]:

# 利用OpenAI从用户查询中创建嵌入向量
embedded_query = openai.Embedding.create(input=user_query,
model="text-embedding-3-small",
)["data"][0]['embedding']

# 准备查询
base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
query = (
Query(base_query)
.return_fields(*return_fields)
.sort_by("vector_score")
.paging(0, k)
.dialect(2)
)
params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

# 执行向量搜索
results = redis_client.ft(index_name).search(query, params_dict)
if print_results:
for i, product in enumerate(results.docs):
score = 1 - float(product.vector_score)
print(f"{i}. {product.productDisplayName} (Score: {round(score ,3) })")
return results.docs


# 在Redis中执行简单的向量搜索
results = search_redis(redis_client, 'man blue jeans', k=10)


0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
4. Lee Men Blue Chicago Fit Jeans (Score: 0.761)
5. French Connection Men Blue Jeans (Score: 0.74)
6. Locomotive Men Washed Blue Jeans (Score: 0.739)
7. Locomotive Men Washed Blue Jeans (Score: 0.739)
8. Do U Speak Green Men Blue Shorts (Score: 0.736)
9. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)

使用Redis进行混合查询

前面的示例展示了如何在RediSearch中运行向量搜索查询。在本节中,我们将展示如何将向量搜索与其他RediSearch字段结合起来进行混合搜索。在下面的示例中,我们将结合向量搜索和全文搜索。

# improve search quality by adding hybrid query for "man blue jeans" in the product vector combined with a phrase search for "blue jeans"
results = search_redis(redis_client,
"man blue jeans",
vector_field="product_vector",
k=10,
hybrid_fields='@productDisplayName:"blue jeans"'
)


0. John Players Men Blue Jeans (Score: 0.791)
1. Lee Men Tino Blue Jeans (Score: 0.775)
2. Peter England Men Party Blue Jeans (Score: 0.763)
3. French Connection Men Blue Jeans (Score: 0.74)
4. Locomotive Men Washed Blue Jeans (Score: 0.739)
5. Locomotive Men Washed Blue Jeans (Score: 0.739)
6. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.732)
7. Denizen Women Blue Jeans (Score: 0.725)
8. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
9. Jealous 21 Women Washed Blue Jeans (Score: 0.713)
# hybrid query for shirt in the product vector and only include results with the phrase "slim fit" in the title
results = search_redis(redis_client,
"shirt",
vector_field="product_vector",
k=10,
hybrid_fields='@productDisplayName:"slim fit"'
)


0. Basics Men White Slim Fit Striped Shirt (Score: 0.633)
1. ADIDAS Men's Slim Fit White T-shirt (Score: 0.628)
2. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
3. Basics Men Blue Slim Fit Checked Shirt (Score: 0.627)
4. Basics Men Red Slim Fit Checked Shirt (Score: 0.623)
5. Basics Men Navy Slim Fit Checked Shirt (Score: 0.613)
6. Lee Rinse Navy Blue Slim Fit Jeans (Score: 0.558)
7. Tokyo Talkies Women Navy Slim Fit Jeans (Score: 0.552)
# hybrid query for watch in the product vector and only include results with the tag "Accessories" in the masterCategory field
results = search_redis(redis_client,
"watch",
vector_field="product_vector",
k=10,
hybrid_fields='@masterCategory:{Accessories}'
)


0. Titan Women Gold Watch (Score: 0.544)
1. Being Human Men Grey Dial Blue Strap Watch (Score: 0.544)
2. Police Men Black Dial Watch PL12170JSB (Score: 0.544)
3. Titan Men Black Watch (Score: 0.543)
4. Police Men Black Dial Chronograph Watch PL12777JS-02M (Score: 0.542)
5. CASIO Youth Series Digital Men Black Small Dial Digital Watch W-210-1CVDF I065 (Score: 0.542)
6. Titan Women Silver Watch (Score: 0.542)
7. Police Men Black Dial Watch PL12778MSU-61 (Score: 0.541)
8. Titan Raga Women Gold Watch (Score: 0.539)
9. ADIDAS Original Men Black Dial Chronograph Watch ADH2641 (Score: 0.539)
# 在产品向量中进行混合查询,搜索凉鞋,并仅包含2011至2012年范围内的结果。
results = search_redis(redis_client,
"sandals",
vector_field="product_vector",
k=10,
hybrid_fields='@year:[2011 2012]'
)


0. Enroute Teens Orange Sandals (Score: 0.701)
1. Fila Men Camper Brown Sandals (Score: 0.692)
2. Clarks Men Black Leather Closed Sandals (Score: 0.691)
3. Coolers Men Black Sandals (Score: 0.69)
4. Coolers Men Black Sandals (Score: 0.69)
5. Enroute Teens Brown Sandals (Score: 0.69)
6. Crocs Dora Boots Pink Sandals (Score: 0.69)
7. Enroute Men Leather Black Sandals (Score: 0.685)
8. ADIDAS Men Navy Blue Benton Sandals (Score: 0.684)
9. Coolers Men Black Sports Sandals (Score: 0.684)
# 在产品向量中进行混合查询,搜索凉鞋,并仅包含2011年至2012年夏季季节范围内的结果。
results = search_redis(redis_client,
"blue sandals",
vector_field="product_vector",
k=10,
hybrid_fields='(@year:[2011 2012] @season:{Summer})'
)


0. ADIDAS Men Navy Blue Benton Sandals (Score: 0.691)
1. Enroute Teens Brown Sandals (Score: 0.681)
2. ADIDAS Women's Adi Groove Blue Flip Flop (Score: 0.672)
3. Enroute Women Turquoise Blue Flats (Score: 0.671)
4. Red Tape Men Black Sandals (Score: 0.67)
5. Enroute Teens Orange Sandals (Score: 0.661)
6. Vans Men Blue Era Scilla Plaid Shoes (Score: 0.658)
7. FILA Men Aruba Navy Blue Sandal (Score: 0.657)
8. Quiksilver Men Blue Flip Flops (Score: 0.656)
9. Reebok Men Navy Twist Sandals (Score: 0.656)
# 针对棕带级别的混合查询,按年份(数值型)筛选结果,并限定特定文章类型(标签)及品牌名称(文本)。
results = search_redis(redis_client,
"brown belt",
vector_field="product_vector",
k=10,
hybrid_fields='(@year:[2012 2012] @articleType:{Shirts | Belts} @productDisplayName:"Wrangler")'
)


0. Wrangler Men Leather Brown Belt (Score: 0.67)
1. Wrangler Women Black Belt (Score: 0.639)
2. Wrangler Men Green Striped Shirt (Score: 0.575)
3. Wrangler Men Purple Striped Shirt (Score: 0.549)
4. Wrangler Men Griffith White Shirt (Score: 0.543)
5. Wrangler Women Stella Green Shirt (Score: 0.542)