这个笔记本演示了如何使用OpenAI和MongoDB Atlas向量搜索构建一个语义搜索应用程序。

!pip install pymongo openai

Collecting pymongo
  Downloading pymongo-4.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 677.1/677.1 kB 10.3 MB/s eta 0:00:00
Collecting openai
  Downloading openai-1.3.3-py3-none-any.whl (220 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 220.3/220.3 kB 24.4 MB/s eta 0:00:00
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.4.2-py3-none-any.whl (300 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 300.4/300.4 kB 29.0 MB/s eta 0:00:00
Requirement already satisfied: anyio<4,>=3.5.0 in /usr/local/lib/python3.10/dist-packages (from openai) (3.7.1)
Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai) (1.7.0)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.0/75.0 kB 9.8 MB/s eta 0:00:00
Requirement already satisfied: pydantic<3,>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from openai) (1.10.13)
Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.10/dist-packages (from openai) (4.66.1)
Requirement already satisfied: typing-extensions<5,>=4.5 in /usr/local/lib/python3.10/dist-packages (from openai) (4.5.0)
Requirement already satisfied: idna>=2.8 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (3.4)
Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (1.3.0)
Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.5.0->openai) (1.1.3)
Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from httpx<1,>=0.23.0->openai) (2023.7.22)
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.9/76.9 kB 7.9 MB/s eta 0:00:00
Collecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 58.3/58.3 kB 6.8 MB/s eta 0:00:00
Installing collected packages: h11, dnspython, pymongo, httpcore, httpx, openai
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.
Successfully installed dnspython-2.4.2 h11-0.14.0 httpcore-1.0.2 httpx-0.25.1 openai-1.3.3 pymongo-4.6.0

第一步：设置环境

这里有两个前提条件：

MongoDB Atlas集群：要创建一个永久免费的MongoDB Atlas集群，首先需要创建一个MongoDB Atlas账户，如果你还没有的话。访问MongoDB Atlas网站，然后点击“注册”。访问MongoDB Atlas仪表板并设置你的集群。为了利用聚合管道中的$vectorSearch运算符，你需要运行MongoDB Atlas 6.0.11或更高版本。这个教程可以使用免费集群构建。在设置部署时，你将被提示设置数据库用户和网络连接规则。请确保你将用户名和密码保存在安全的地方，并设置正确的IP地址规则，以便你的集群可以正确连接。如果需要更多帮助入门，请查看我们的MongoDB Atlas教程。
OpenAI API密钥：要创建你的OpenAI密钥，你需要创建一个账户。一旦你有了账户，访问OpenAI平台。点击屏幕右上角的个人资料图标以获取下拉菜单，然后选择“查看API密钥”。

import getpass

MONGODB_ATLAS_CLUSTER_URI = getpass.getpass("MongoDB Atlas Cluster URI:")
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")

MongoDB Atlas Cluster URI:··········
OpenAI API Key:··········

注意：在执行上述步骤后，您将被提示输入凭据。

在本教程中，我们将使用MongoDB示例数据集。使用Atlas UI加载示例数据集。我们将使用“sample_mflix”数据库，其中包含一个“movies”集合，每个文档都包含标题、情节、流派、演员阵容、导演等字段。

import openai
import pymongo

client = pymongo.MongoClient(MONGODB_ATLAS_CLUSTER_URI)
db = client.sample_mflix
collection = db.movies

openai.api_key = OPENAI_API_KEY

ATLAS_VECTOR_SEARCH_INDEX_NAME = "default"
EMBEDDING_FIELD_NAME = "embedding_openai_nov19_23"

第二步：设置嵌入生成函数

model = "text-embedding-3-small"
def generate_embedding(text: str) -> list[float]:
    return openai.embeddings.create(input = [text], model=model).data[0].embedding

步骤3：创建并存储嵌入向量

样本数据集sample_mflix.movies中的每个文档对应一部电影；我们将执行一个操作，为“plot”字段中的数据创建一个向量嵌入，并将其存储在数据库中。使用OpenAI嵌入端点创建向量嵌入是为了基于意图执行相似性搜索而必要的。

from pymongo import ReplaceOne

# 使用嵌入更新集合
requests = []

for doc in collection.find({'plot':{"$exists": True}}).limit(500):
  doc[EMBEDDING_FIELD_NAME] = generate_embedding(doc['plot'])
  requests.append(ReplaceOne({'_id': doc['_id']}, doc))

collection.bulk_write(requests)

BulkWriteResult({'writeErrors': [], 'writeConcernErrors': [], 'nInserted': 0, 'nUpserted': 0, 'nMatched': 50, 'nModified': 50, 'nRemoved': 0, 'upserted': []}, acknowledged=True)

执行上述操作后，“movies”集合中的文档将包含一个名为“embedding”的额外字段，该字段由EMBEDDDING_FIELD_NAME变量定义，除了已经存在的字段如title、plot、genres、cast、directors等。

注意：出于时间考虑，我们将此限制在500个文档中。如果您想在我们的sample_mflix数据库中的23000多个文档上执行此操作，可能需要一些时间。或者，您可以使用sample_mflix.embedded_movies集合，其中包含一个预先填充的plot_embedding字段，其中包含使用OpenAI的text-embedding-3-small嵌入模型创建的嵌入，您可以将其与Atlas Search矢量搜索功能一起使用。

第四步：创建向量搜索索引

我们将在这个集合上创建Atlas向量搜索索引，这将允许我们执行近似KNN搜索，从而支持语义搜索。我们将介绍两种创建此索引的方法 - Atlas UI 和使用MongoDB Python驱动程序。

（可选）文档：创建向量搜索索引

现在前往Atlas UI并按照这里描述的步骤创建一个Atlas Vector Search索引。值为1536的“dimensions”字段对应于openAI文本嵌入-ada002。

在Atlas UI的JSON编辑器中使用以下定义：

{
  "mappings": {
    "dynamic": true,
    "fields": {
      "embedding": {
        "dimensions": 1536,
        "similarity": "dotProduct",
        "type": "knnVector"
      }
    }
  }
}

（可选）或者，我们可以使用pymongo驱动程序以编程方式创建这些向量搜索索引下面单元格中给出的python命令将创建索引（这仅适用于最新版本的Python驱动程序和MongoDB服务器版本7.0+ Atlas集群）。

collection.create_search_index(
    {"definition":
        {"mappings": {"dynamic": True, "fields": {
            EMBEDDING_FIELD_NAME : {
                "dimensions": 1536,
                "similarity": "dotProduct",
                "type": "knnVector"
                }}}},
     "name": ATLAS_VECTOR_SEARCH_INDEX_NAME
    }
)

'default'

第五步：查询您的数据

这里的查询结果是找到与查询字符串中捕获的文本在情节上语义相似的电影，而不是基于关键字搜索。

（可选）文档：运行向量搜索查询

def query_results(query, k):
  results = collection.aggregate([
    {
        '$vectorSearch': {
            "index": ATLAS_VECTOR_SEARCH_INDEX_NAME,
            "path": EMBEDDING_FIELD_NAME,
            "queryVector": generate_embedding(query),
            "numCandidates": 50,
            "limit": 5,
        }
    }
    ])
  return results

query="imaginary characters from outerspace at war with earthlings"
movies = query_results(query, 5)

for movie in movies:
    print(f'Movie Name: {movie["title"]},\nMovie Plot: {movie["plot"]}\n')