使用Hologres作为OpenAI嵌入向量数据库

本笔记本将逐步指导您如何将Hologres用作OpenAI嵌入向量数据库。

本笔记本呈现了以下端到端流程： 1. 使用OpenAI API创建的预先计算的嵌入向量。 2. 将嵌入向量存储在Hologres的云实例中。 3. 将原始文本查询转换为嵌入向量，使用OpenAI API。 4. 使用Hologres在创建的集合中执行最近邻搜索。 5. 将搜索结果作为上下文提供给大型语言模型，用于提示工程。

什么是Hologres

Hologres 是由阿里巴巴云开发的统一实时数据仓库服务。您可以使用Hologres实时编写、更新、处理和分析大量数据。Hologres支持标准SQL语法，与PostgreSQL兼容，并支持大多数PostgreSQL函数。Hologres支持高达PB级数据的在线分析处理（OLAP）和即席分析，并提供高并发和低延迟的在线数据服务。Hologres支持多个工作负载的细粒度隔离和企业级安全功能。Hologres与MaxCompute、Realtime Compute for Apache Flink和DataWorks深度集成，并为企业提供全栈在线和离线数据仓库解决方案。

Hologres通过采用Proxima提供向量数据库功能。

Proxima是由阿里巴巴达摩院开发的高性能软件库。它允许您搜索向量的最近邻居。Proxima提供比Facebook AI Similarity Search（Faiss）等类似开源软件更高的稳定性和性能。Proxima提供了在行业中具有领先性能和效果的基本模块，允许您搜索相似的图像、视频或人脸。Hologres与Proxima深度集成，提供高性能的向量搜索服务。

部署选项

单击此处快速部署Hologres数据仓库。

先决条件

为了完成这个练习，我们需要准备一些东西：

Hologres 云服务器实例。
‘psycopg2-binary’ 库用于与矢量数据库进行交互。任何其他的 postgresql 客户端库也可以。
一个OpenAI API密钥。

我们可以通过运行一个简单的curl命令来验证服务器是否成功启动：

安装所需软件包

这个笔记本显然需要openai和psycopg2-binary软件包，但我们还将使用一些其他附加库。以下命令将安装它们全部：

! pip install openai psycopg2-binary pandas wget

准备你的OpenAI API密钥

OpenAI API密钥用于对文档和查询进行向量化。

如果你还没有OpenAI API密钥，你可以从https://beta.openai.com/account/api-keys获取。

获取到密钥后，请将其添加到你的环境变量中，命名为 OPENAI_API_KEY。

# 验证您的 OpenAI API 密钥是否已正确设置为环境变量。
# 注意：如果您在本地运行此笔记本，您需要重新加载终端和笔记本，以使环境变量生效。
import os

# 注意：或者，您也可以像这样设置一个临时的环境变量：
# os.environ["OPENAI_API_KEY"] = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready

连接到Hologres

首先将其添加到您的环境变量中。或者您可以直接更改下面的”psycopg2.connect”参数。

使用官方Python库连接到正在运行的Hologres服务器非常容易：

import os
import psycopg2

# 注意：或者，您也可以像这样设置一个临时的环境变量：
# os.environ["PGHOST"] = "your_host"
# os.environ["PGPORT"] "5432"),
# os.environ["PGDATABASE"] "postgres"),
# os.environ["PGUSER"] "user"),
# os.environ["PGPASSWORD"] "password"),

connection = psycopg2.connect(
    host=os.environ.get("PGHOST", "localhost"),
    port=os.environ.get("PGPORT", "5432"),
    database=os.environ.get("PGDATABASE", "postgres"),
    user=os.environ.get("PGUSER", "user"),
    password=os.environ.get("PGPASSWORD", "password")
)
connection.set_session(autocommit=True)

# 创建一个新的光标对象
cursor = connection.cursor()

我们可以通过运行任何可用的方法来测试连接：

# 执行一个简单的查询以测试连接
cursor.execute("SELECT 1;")
result = cursor.fetchone()

# 检查查询结果
if result == (1,):
    print("Connection successful!")
else:
    print("Connection failed.")

Connection successful!

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 文件大小约为700MB，因此需要一些时间来完成。
wget.download(embeddings_url)

下载的文件必须然后被解压：

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# 检查CSV文件是否存在
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The file {file_name} exists in the data directory.")
else:
    print(f"The file {file_name} does not exist in the data directory.")

The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

加载数据

在本节中，我们将加载在本次会话之前准备的数据，这样您就不必使用自己的学分重新计算维基百科文章的嵌入。

!unzip -n vector_database_wikipedia_articles_embedded.zip
!ls -lh vector_database_wikipedia_articles_embedded.csv

Archive:  vector_database_wikipedia_articles_embedded.zip
-rw-r--r--@ 1 geng  staff   1.7G Jan 31 01:19 vector_database_wikipedia_articles_embedded.csv

看一下数据。

import pandas, json
data = pandas.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')
data

	id	url	title	text	title_vector	content_vector	vector_id
0	1	https://simple.wikipedia.org/wiki/April	April	April is the fourth month of the year in the J...	[0.001009464613161981, -0.020700545981526375, ...	[-0.011253940872848034, -0.013491976074874401,...	0
1	2	https://simple.wikipedia.org/wiki/August	August	August (Aug.) is the eighth month of the year ...	[0.0009286514250561595, 0.000820168002974242, ...	[0.0003609954728744924, 0.007262262050062418, ...	1
2	6	https://simple.wikipedia.org/wiki/Art	Art	Art is a creative activity that expresses imag...	[0.003393713850528002, 0.0061537534929811954, ...	[-0.004959689453244209, 0.015772193670272827, ...	2
3	8	https://simple.wikipedia.org/wiki/A	A	A or a is the first letter of the English alph...	[0.0153952119871974, -0.013759135268628597, 0....	[0.024894846603274345, -0.022186409682035446, ...	3
4	9	https://simple.wikipedia.org/wiki/Air	Air	Air refers to the Earth's atmosphere. Air is a...	[0.02224554680287838, -0.02044147066771984, -0...	[0.021524671465158463, 0.018522677943110466, -...	4
...	...	...	...	...	...	...	...
24995	98295	https://simple.wikipedia.org/wiki/Geneva	Geneva	Geneva (, , , , ) is the second biggest cit...	[-0.015773078426718712, 0.01737344264984131, 0...	[0.008000412955880165, 0.02008531428873539, 0....	24995
24996	98316	https://simple.wikipedia.org/wiki/Concubinage	Concubinage	Concubinage is the state of a woman in a relat...	[-0.00519518880173564, 0.005898841191083193, 0...	[-0.01736736111342907, -0.002740012714639306, ...	24996
24997	98318	https://simple.wikipedia.org/wiki/Mistress%20%...	Mistress (lover)	A mistress is a man's long term female sexual ...	[-0.023164259269833565, -0.02052430994808674, ...	[-0.017878392711281776, -0.0004517830966506153...	24997
24998	98326	https://simple.wikipedia.org/wiki/Eastern%20Front	Eastern Front	Eastern Front can be one of the following: ...	[-0.00681863259524107, 0.002171179046854377, 8...	[-0.0019235472427681088, -0.004023272544145584...	24998
24999	98327	https://simple.wikipedia.org/wiki/Italian%20Ca...	Italian Campaign	Italian Campaign can mean the following: Th...	[-0.014151256531476974, -0.008553029969334602,...	[-0.011758845299482346, -0.01346028596162796, ...	24999

25000 rows × 7 columns

title_vector_length = len(json.loads(data['title_vector'].iloc[0]))
content_vector_length = len(json.loads(data['content_vector'].iloc[0]))

print(title_vector_length, content_vector_length)

1536 1536

创建表和Proxima向量索引

Hologres将数据存储在__表__中，每个对象至少由一个向量描述。我们的表将被称为articles，每个对象将由title和content向量描述。

我们将从创建一个表开始，并在title和content上创建proxima索引，然后我们将用预先计算的嵌入填充它。

cursor.execute('CREATE EXTENSION IF NOT EXISTS proxima;')
create_proxima_table_sql = '''
BEGIN;
DROP TABLE IF EXISTS articles;
CREATE TABLE articles (
    id INT PRIMARY KEY NOT NULL,
    url TEXT,
    title TEXT,
    content TEXT,
    title_vector float4[] check(
        array_ndims(title_vector) = 1 and 
        array_length(title_vector, 1) = 1536
    ), -- define the vectors
    content_vector float4[] check(
        array_ndims(content_vector) = 1 and 
        array_length(content_vector, 1) = 1536
    ),
    vector_id INT
);

-- Create indexes for the vector fields.
call set_table_property(
    'articles',
    'proxima_vectors', 
    '{
        "title_vector":{"algorithm":"Graph","distance_method":"Euclidean","builder_params":{"min_flush_proxima_row_count" : 10}},
        "content_vector":{"algorithm":"Graph","distance_method":"Euclidean","builder_params":{"min_flush_proxima_row_count" : 10}}
    }'
);  

COMMIT;
'''

# 执行SQL语句（将自动提交）
cursor.execute(create_proxima_table_sql)

上传数据

现在让我们使用COPY语句将数据上传到Hologres云实例。根据网络带宽的情况，这可能需要5-10分钟。

import io

# 解压后的CSV文件路径
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'

# 在SQL中，数组是用大括号{}而不是方括号[]包围的。
def process_file(file_path):
    with open(file_path, 'r') as file:
        for line in file:
            # Replace '[' with '{' and ']' with '}'
            modified_line = line.replace('[', '{').replace(']', '}')
            yield modified_line

# 创建一个 StringIO 对象以存储修改后的行
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))

# 创建用于 copy_expert 方法的 COPY 命令
copy_command = '''
复制 public.articles 表中的数据（包括 id、url、title、content、title_vector、content_vector、vector_id 字段）
从标准输入读取，使用 CSV 格式，包含表头，分隔符为逗号。
'''

# 使用 `copy_expert` 方法执行 `COPY` 命令
cursor.copy_expert(copy_command, modified_lines)

proxima索引将在后台构建。在此期间，我们可以进行搜索，但如果没有向量索引，查询将会很慢。使用此命令等待索引构建完成。

cursor.execute('vacuum articles;')

# 检查集合大小，确保所有点都已存储。
count_sql = "select count(*) from articles;"
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")

Count:25000

搜索数据

数据上传后，我们将开始查询集合中最接近的向量。我们可以提供一个额外的参数 vector_name，以从基于标题切换到基于内容的搜索。由于预先计算的嵌入是使用 text-embedding-3-small OpenAI 模型创建的，因此在搜索过程中我们也必须使用它。

import openai
def query_knn(query, table_name, vector_name="title_vector", top_k=20):

    # 从用户查询生成嵌入向量
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]

    # 将嵌入式查询转换为与PostgreSQL兼容的格式
    embedded_query_pg = "{" + ",".join(map(str, embedded_query)) + "}"

    # 创建SQL查询
    query_sql = f"""
    SELECT id, url, title, pm_approx_euclidean_distance({vector_name}, '{embedded_query_pg}'::float4[]) AS distance
FROM {table_name}
ORDER BY distance
LIMIT {top_k};
    """
    # 执行查询
    cursor.execute(query_sql)
    results = cursor.fetchall()

    return results

query_results = query_knn("modern art in Europe", "Articles")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

Museum of Modern Art (Score: 0.501)
Western Europe (Score: 0.485)
Renaissance art (Score: 0.479)
Pop art (Score: 0.472)
Northern Europe (Score: 0.461)
Hellenistic art (Score: 0.458)
Modernist literature (Score: 0.447)
Art film (Score: 0.44)
Central Europe (Score: 0.439)
Art (Score: 0.437)
European (Score: 0.437)
Byzantine art (Score: 0.436)
Postmodernism (Score: 0.435)
Eastern Europe (Score: 0.433)
Cubism (Score: 0.433)
Europe (Score: 0.432)
Impressionism (Score: 0.432)
Bauhaus (Score: 0.431)
Surrealism (Score: 0.429)
Expressionism (Score: 0.429)

# 这次我们将使用内容向量进行查询。
query_results = query_knn("Famous battles in Scottish history", "Articles", "content_vector")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

Battle of Bannockburn (Score: 0.489)
Wars of Scottish Independence (Score: 0.474)
1651 (Score: 0.457)
First War of Scottish Independence (Score: 0.452)
Robert I of Scotland (Score: 0.445)
841 (Score: 0.441)
1716 (Score: 0.441)
1314 (Score: 0.429)
1263 (Score: 0.428)
William Wallace (Score: 0.426)
Stirling (Score: 0.419)
1306 (Score: 0.419)
1746 (Score: 0.418)
1040s (Score: 0.414)
1106 (Score: 0.412)
1304 (Score: 0.411)
David II of Scotland (Score: 0.408)
Braveheart (Score: 0.407)
1124 (Score: 0.406)
July 27 (Score: 0.405)

什么是Hologres​

部署选项​

先决条件​

安装所需软件包​

准备你的OpenAI API密钥​

连接到Hologres​

加载数据​

创建表和Proxima向量索引​

上传数据​

搜索数据​