使用Neon Postgres进行向量相似性搜索

本笔记本指导您如何使用Neon无服务器Postgres作为OpenAI嵌入向量数据库。它演示了如何：

使用OpenAI API创建的嵌入向量。
将嵌入向量存储在Neon无服务器Postgres数据库中。
使用OpenAI API将原始文本查询转换为嵌入向量。
使用Neon和pgvector扩展执行向量相似性搜索。

先决条件

在开始之前，请确保您具备以下条件：

一个 Neon Postgres 数据库。您可以通过简单的几个步骤创建一个带有准备就绪的 neondb 数据库的帐户并设置一个项目。有关说明，请参阅注册和创建您的第一个项目。
Neon 数据库的连接字符串。您可以从 Neon 仪表板上的 连接详细信息 小部件中复制它。请参阅从任何应用程序连接。
pgvector 扩展。通过运行 CREATE EXTENSION vector; 在 Neon 中安装该扩展。有关说明，请参阅启用 pgvector 扩展。
您的 OpenAI API 密钥。
Python 和 pip。

安装所需模块

此笔记本需要安装 openai、psycopg2、pandas、wget 和 python-dotenv 包。您可以使用 pip 进行安装：

! pip install openai psycopg2 pandas wget python-dotenv

准备您的OpenAI API密钥

生成文档和查询向量需要一个OpenAI API密钥。

如果您没有OpenAI API密钥，请从 https://platform.openai.com/account/api-keys 获取一个。

将OpenAI API密钥添加为操作系统环境变量，或在提示时为会话提供。如果定义环境变量，请将变量命名为 OPENAI_API_KEY。

有关将OpenAI API密钥配置为环境变量的信息，请参考API密钥安全最佳实践。

测试您的OpenAPI密钥

# 测试以确保您的 OpenAI API 密钥已定义为环境变量，或在提示时提供它。
# 如果你在本地运行这个笔记本，可能需要重新加载终端和笔记本，以使环境可用。

import os
from getpass import getpass

# 检查是否将 OPENAI_API_KEY 设置为环境变量。
if os.getenv("OPENAI_API_KEY") is not None:
    print("Your OPENAI_API_KEY is ready")
else:
    # 如果没有，请立即提示。
    api_key = getpass("Enter your OPENAI_API_KEY: ")
    if api_key:
        print("Your OPENAI_API_KEY is now available for this session")
        # 此外，您还可以将其设置为当前会话的环境变量。
        os.environ["OPENAI_API_KEY"] = api_key
    else:
        print("You did not enter your OPENAI_API_KEY")

Your OPENAI_API_KEY is ready

连接到您的Neon数据库

在下方提供您的Neon数据库连接字符串，或者在.env文件中使用DATABASE_URL变量进行定义。有关获取Neon连接字符串的信息，请参阅从任何应用程序连接。

import os
import psycopg2
from dotenv import load_dotenv

# 从 .env 文件加载环境变量
load_dotenv()

# 连接字符串可以直接在此处提供。
# 请将下一行替换为您的 Neon 连接字符串。
connection_string = "postgres://<user>:<password>@<hostname>/<dbname>"

# 如果上述未直接提供connection_string， 
# 接着检查环境变量或.env文件中是否设置了DATABASE_URL。
if not connection_string:
    connection_string = os.environ.get("DATABASE_URL")

    # 如果这两种方法都无法提供连接字符串，则引发错误。
    if not connection_string:
        raise ValueError("Please provide a valid connection string either in the code or in the .env file as DATABASE_URL.")

# 使用连接字符串进行连接
connection = psycopg2.connect(connection_string)

# 创建一个新的光标对象
cursor = connection.cursor()

测试与数据库的连接：

# 执行此查询以测试数据库连接
cursor.execute("SELECT 1;")
result = cursor.fetchone()

# 检查查询结果
if result == (1,):
    print("Your database connection was successful!")
else:
    print("Your connection failed.")

Your database connection was successful!

本指南使用在OpenAI Cookbook examples目录中可用的预先计算的维基百科文章嵌入，因此您无需使用自己的OpenAI积分来计算嵌入。

导入预先计算的嵌入zip文件：

import wget

embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# 该文件大小约为700MB，导入过程需要几分钟时间。
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

提取已下载的zip文件：

import zipfile
import os
import re
import tempfile

current_directory = os.getcwd()
zip_file_path = os.path.join(current_directory, "vector_database_wikipedia_articles_embedded.zip")
output_directory = os.path.join(current_directory, "../../data")

with zipfile.ZipFile(zip_file_path, "r") as zip_ref:
    zip_ref.extractall(output_directory)


# 检查CSV文件是否已解压。
file_name = "vector_database_wikipedia_articles_embedded.csv"
data_directory = os.path.join(current_directory, "../../data")
file_path = os.path.join(data_directory, file_name)


if os.path.exists(file_path):
    print(f"The csv file {file_name} exists in the data directory.")
else:
    print(f"The csv file {file_name} does not exist in the data directory.")

The file vector_database_wikipedia_articles_embedded.csv exists in the data directory.

创建一个表并为您的向量嵌入添加索引

在您的数据库中创建的向量表被称为articles。每个对象都有title和content向量。

在title和content向量列上定义了索引。

create_table_sql = '''
CREATE TABLE IF NOT EXISTS public.articles (
    id INTEGER NOT NULL,
    url TEXT,
    title TEXT,
    content TEXT,
    title_vector vector(1536),
    content_vector vector(1536),
    vector_id INTEGER
);

ALTER TABLE public.articles ADD PRIMARY KEY (id);
'''

# 创建索引的SQL语句
create_indexes_sql = '''
在public.articles表上使用ivfflat方法为content_vector字段创建索引，设置lists参数为1000；

在public.articles表上使用ivfflat方法为title_vector字段创建索引，设置lists参数为1000。
'''

# 执行SQL语句
cursor.execute(create_table_sql)
cursor.execute(create_indexes_sql)

# 提交更改
connection.commit()

加载数据

从.csv文件中将预先计算的向量数据加载到您的articles表中。共有25000条记录，因此预计操作将花费几分钟的时间。

import io

# 本地CSV文件的路径
csv_file_path = '../../data/vector_database_wikipedia_articles_embedded.csv'

# 定义一个生成器函数来处理CSV文件
def process_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        for line in file:
            yield line

# 创建一个 StringIO 对象来存储修改后的行
modified_lines = io.StringIO(''.join(list(process_file(csv_file_path))))

# 创建COPY_EXPERT命令的COPY命令
copy_command = '''
复制 public.articles 表中的数据（包括 id、url、title、content、title_vector、content_vector、vector_id 字段）
从标准输入读取，使用 CSV 格式，包含表头，分隔符为逗号。
'''

# 使用copy_expert执行COPY命令
cursor.copy_expert(copy_command, modified_lines)

# 提交更改
connection.commit()

检查记录数以确保数据已经加载。应该有25000条记录。

# 检查数据的大小
count_sql = """从public.articles表中选择计数(*)；"""
cursor.execute(count_sql)
result = cursor.fetchone()
print(f"Count:{result[0]}")

Count:25000

搜索您的数据

在数据存储在Neon数据库中之后，您可以查询最近邻的数据。

首先要定义query_neon函数，当您运行向量相似性搜索时，该函数将被执行。该函数基于用户的查询创建嵌入，准备SQL查询，并使用嵌入运行SQL查询。您加载到数据库中的预计算嵌入是使用text-embedding-3-small OpenAI模型创建的，因此您必须使用相同的模型为相似性搜索创建嵌入。

提供了一个vector_name参数，允许您基于“title”或“content”进行搜索。

def query_neon(query, collection_name, vector_name="title_vector", top_k=20):

    # 从用户查询中生成一个嵌入向量
    embedded_query = openai.Embedding.create(
        input=query,
        model="text-embedding-3-small",
    )["data"][0]["embedding"]

    # 将嵌入式查询转换为与PostgreSQL兼容的格式
    embedded_query_pg = "[" + ",".join(map(str, embedded_query)) + "]"

    # 创建SQL查询
    query_sql = f"""
    SELECT id, url, title, l2_distance({vector_name},'{embedded_query_pg}'::VECTOR(1536)) AS similarity
    FROM {collection_name}
    ORDER BY {vector_name} <-> '{embedded_query_pg}'::VECTOR(1536)
    LIMIT {top_k};
    """
    # 执行查询
    cursor.execute(query_sql)
    results = cursor.fetchall()

    return results

基于title_vector嵌入运行相似性搜索：

# 基于`title_vector`嵌入的查询
import openai

query_results = query_neon("Greek mythology", "Articles")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

Greek mythology (Score: 0.998)
Roman mythology (Score: 0.7)
Greek underworld (Score: 0.637)
Mythology (Score: 0.635)
Classical mythology (Score: 0.629)
Japanese mythology (Score: 0.615)
Norse mythology (Score: 0.569)
Greek language (Score: 0.566)
Zeus (Score: 0.534)
List of mythologies (Score: 0.531)
Jupiter (mythology) (Score: 0.53)
Greek (Score: 0.53)
Gaia (mythology) (Score: 0.526)
Titan (mythology) (Score: 0.522)
Mercury (mythology) (Score: 0.521)
Ancient Greece (Score: 0.52)
Greek alphabet (Score: 0.52)
Venus (mythology) (Score: 0.515)
Pluto (mythology) (Score: 0.515)
Athena (Score: 0.514)

基于content_vector嵌入运行相似性搜索：

# 基于`content_vector`嵌入的查询
query_results = query_neon("Famous battles in Greek history", "Articles", "content_vector")
for i, result in enumerate(query_results):
    print(f"{i + 1}. {result[2]} (Score: {round(1 - result[3], 3)})")

222 BC (Score: 0.489)
Trojan War (Score: 0.458)
Peloponnesian War (Score: 0.456)
History of the Peloponnesian War (Score: 0.449)
430 BC (Score: 0.441)
168 BC (Score: 0.436)
Ancient Greece (Score: 0.429)
Classical Athens (Score: 0.428)
499 BC (Score: 0.427)
Leonidas I (Score: 0.426)
Battle (Score: 0.421)
Greek War of Independence (Score: 0.421)
Menelaus (Score: 0.419)
Thebes, Greece (Score: 0.417)
Patroclus (Score: 0.417)
427 BC (Score: 0.416)
429 BC (Score: 0.413)
August 2 (Score: 0.412)
Ionia (Score: 0.411)
323 (Score: 0.409)

先决条件​

安装所需模块​

准备您的OpenAI API密钥​

测试您的OpenAPI密钥​

连接到您的Neon数据库​

创建一个表并为您的向量嵌入添加索引​

加载数据​

搜索您的数据​