跳到主要内容

使用Chroma进行嵌入搜索

nbviewer

本笔记本将带您完成一个简单的流程,下载一些数据,对其进行嵌入,然后使用一些向量数据库进行索引和搜索。这是客户经常需要的需求,他们希望在安全环境中存储和搜索我们的嵌入数据,以支持生产用例,如聊天机器人、主题建模等。

什么是向量数据库

向量数据库是一种用于存储、管理和搜索嵌入向量的数据库。近年来,使用嵌入来将非结构化数据(文本、音频、视频等)编码为向量,供机器学习模型使用的做法已经蓬勃发展,这是由于人工智能在解决涉及自然语言、图像识别和其他非结构化数据形式的用例时日益有效。向量数据库已经成为企业提供和扩展这些用例的有效解决方案。

为什么使用向量数据库

向量数据库使企业能够利用我们在这个存储库中分享的许多嵌入用例(例如问答、聊天机器人和推荐服务),并在安全、可扩展的环境中使用它们。许多客户在小规模上使用嵌入来解决问题,但性能和安全性阻碍了它们投入生产 - 我们认为向量数据库是解决这一问题的关键组成部分,在本指南中,我们将介绍嵌入文本数据的基础知识,将其存储在向量数据库中,并将其用于语义搜索。

演示流程

演示流程如下: - 设置:导入包并设置任何必需的变量 - 加载数据:加载数据集并使用OpenAI嵌入进行嵌入 - Chroma: - 设置:在这里,我们将设置Chroma的Python客户端。更多详情请查看这里 - 索引数据:我们将为__标题__和__内容__创建带有向量的集合 - 搜索数据:我们将运行一些搜索以确认其有效性

完成本笔记后,您应该对如何设置和使用向量数据库有基本的了解,并可以继续进行更复杂的用例,利用我们的嵌入。

设置

导入所需的库并设置我们想要使用的嵌入模型。

# 确保已安装 OpenAI 库。
%pip install openai

# 我们需要安装Chroma客户端。
%pip install chromadb

# 安装 wget 以拉取 zip 文件
%pip install wget

# 安装用于数据处理的numpy库
%pip install numpy

Collecting openai
Obtaining dependency information for openai from https://files.pythonhosted.org/packages/67/78/7588a047e458cb8075a4089d721d7af5e143ff85a2388d4a28c530be0494/openai-0.27.8-py3-none-any.whl.metadata
Downloading openai-0.27.8-py3-none-any.whl.metadata (13 kB)
Collecting requests>=2.20 (from openai)
Obtaining dependency information for requests>=2.20 from https://files.pythonhosted.org/packages/70/8e/0e2d847013cb52cd35b38c009bb167a1a26b2ce6cd6965bf26b47bc0bf44/requests-2.31.0-py3-none-any.whl.metadata
Using cached requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from openai)
Using cached tqdm-4.65.0-py3-none-any.whl (77 kB)
Collecting aiohttp (from openai)
Obtaining dependency information for aiohttp from https://files.pythonhosted.org/packages/fa/9e/49002fde2a97d7df0e162e919c31cf13aa9f184537739743d1239edd0e67/aiohttp-3.8.5-cp310-cp310-macosx_11_0_arm64.whl.metadata
Downloading aiohttp-3.8.5-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting charset-normalizer<4,>=2 (from requests>=2.20->openai)
Obtaining dependency information for charset-normalizer<4,>=2 from https://files.pythonhosted.org/packages/ec/a7/96835706283d63fefbbbb4f119d52f195af00fc747e67cc54397c56312c8/charset_normalizer-3.2.0-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached charset_normalizer-3.2.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (31 kB)
Collecting idna<4,>=2.5 (from requests>=2.20->openai)
Using cached idna-3.4-py3-none-any.whl (61 kB)
Collecting urllib3<3,>=1.21.1 (from requests>=2.20->openai)
Obtaining dependency information for urllib3<3,>=1.21.1 from https://files.pythonhosted.org/packages/9b/81/62fd61001fa4b9d0df6e31d47ff49cfa9de4af03adecf339c7bc30656b37/urllib3-2.0.4-py3-none-any.whl.metadata
Downloading urllib3-2.0.4-py3-none-any.whl.metadata (6.6 kB)
Collecting certifi>=2017.4.17 (from requests>=2.20->openai)
Using cached certifi-2023.5.7-py3-none-any.whl (156 kB)
Collecting attrs>=17.3.0 (from aiohttp->openai)
Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting multidict<7.0,>=4.5 (from aiohttp->openai)
Using cached multidict-6.0.4-cp310-cp310-macosx_11_0_arm64.whl (29 kB)
Collecting async-timeout<5.0,>=4.0.0a3 (from aiohttp->openai)
Using cached async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0 (from aiohttp->openai)
Using cached yarl-1.9.2-cp310-cp310-macosx_11_0_arm64.whl (62 kB)
Collecting frozenlist>=1.1.1 (from aiohttp->openai)
Obtaining dependency information for frozenlist>=1.1.1 from https://files.pythonhosted.org/packages/67/6a/55a49da0fa373ac9aa49ccd5b6393ecc183e2a0904d9449ea3ee1163e0b1/frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata
Downloading frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.2 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->openai)
Using cached aiosignal-1.3.1-py3-none-any.whl (7.6 kB)
Using cached openai-0.27.8-py3-none-any.whl (73 kB)
Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Downloading aiohttp-3.8.5-cp310-cp310-macosx_11_0_arm64.whl (343 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 343.9/343.9 kB 11.4 MB/s eta 0:00:00
Using cached charset_normalizer-3.2.0-cp310-cp310-macosx_11_0_arm64.whl (124 kB)
Downloading frozenlist-1.4.0-cp310-cp310-macosx_11_0_arm64.whl (46 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 46.0/46.0 kB 4.4 MB/s eta 0:00:00
Downloading urllib3-2.0.4-py3-none-any.whl (123 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 123.9/123.9 kB 20.0 MB/s eta 0:00:00
Installing collected packages: urllib3, tqdm, multidict, idna, frozenlist, charset-normalizer, certifi, attrs, async-timeout, yarl, requests, aiosignal, aiohttp, openai
Successfully installed aiohttp-3.8.5 aiosignal-1.3.1 async-timeout-4.0.2 attrs-23.1.0 certifi-2023.5.7 charset-normalizer-3.2.0 frozenlist-1.4.0 idna-3.4 multidict-6.0.4 openai-0.27.8 requests-2.31.0 tqdm-4.65.0 urllib3-2.0.4 yarl-1.9.2
Note: you may need to restart the kernel to use updated packages.
Collecting chromadb
Obtaining dependency information for chromadb from https://files.pythonhosted.org/packages/47/b7/41d975f02818c965cdb8a119cab5a38cfb08e0c1abb18efebe9a373ea97b/chromadb-0.4.2-py3-none-any.whl.metadata
Downloading chromadb-0.4.2-py3-none-any.whl.metadata (6.9 kB)
Collecting pandas>=1.3 (from chromadb)
Obtaining dependency information for pandas>=1.3 from https://files.pythonhosted.org/packages/4a/f6/f620ca62365d83e663a255a41b08d2fc2eaf304e0b8b21bb6d62a7390fe3/pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (18 kB)
Requirement already satisfied: requests>=2.28 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from chromadb) (2.31.0)
Collecting pydantic<2.0,>=1.9 (from chromadb)
Obtaining dependency information for pydantic<2.0,>=1.9 from https://files.pythonhosted.org/packages/79/3e/6b4d0fb2174beceac9a991ba8e67158b45c35faca9ea4545ae32d47096cd/pydantic-1.10.11-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached pydantic-1.10.11-cp310-cp310-macosx_11_0_arm64.whl.metadata (148 kB)
Collecting chroma-hnswlib==0.7.1 (from chromadb)
Obtaining dependency information for chroma-hnswlib==0.7.1 from https://files.pythonhosted.org/packages/a5/d5/54947127f5cb2a1fcef40877fb3e6044495eec0a158ba0956babe4ab2a77/chroma_hnswlib-0.7.1-cp310-cp310-macosx_13_0_arm64.whl.metadata
Using cached chroma_hnswlib-0.7.1-cp310-cp310-macosx_13_0_arm64.whl.metadata (252 bytes)
Collecting fastapi<0.100.0,>=0.95.2 (from chromadb)
Obtaining dependency information for fastapi<0.100.0,>=0.95.2 from https://files.pythonhosted.org/packages/73/eb/03b691afa0b5ffa1e93ed34f97ec1e7855c758efbdcfb16c209af0b0506b/fastapi-0.99.1-py3-none-any.whl.metadata
Using cached fastapi-0.99.1-py3-none-any.whl.metadata (23 kB)
Collecting uvicorn[standard]>=0.18.3 (from chromadb)
Obtaining dependency information for uvicorn[standard]>=0.18.3 from https://files.pythonhosted.org/packages/5d/07/b9eac057f7efa56900640a233c1ed63db83568322c6bcbabe98f741d5289/uvicorn-0.23.1-py3-none-any.whl.metadata
Using cached uvicorn-0.23.1-py3-none-any.whl.metadata (6.2 kB)
Collecting numpy>=1.21.6 (from chromadb)
Obtaining dependency information for numpy>=1.21.6 from https://files.pythonhosted.org/packages/1b/cd/9e8313ffd849626c836fffd7881296a74f53a7739bd9ce7a6e22b1fc843b/numpy-1.25.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached numpy-1.25.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (5.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
Using cached posthog-3.0.1-py2.py3-none-any.whl (37 kB)
Requirement already satisfied: typing-extensions>=4.5.0 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from chromadb) (4.7.1)
Collecting pulsar-client>=3.1.0 (from chromadb)
Obtaining dependency information for pulsar-client>=3.1.0 from https://files.pythonhosted.org/packages/43/85/ab0455008ce3335a1c75a7c500fd8921ab166f34821fa67dc91ae9687a40/pulsar_client-3.2.0-cp310-cp310-macosx_10_15_universal2.whl.metadata
Using cached pulsar_client-3.2.0-cp310-cp310-macosx_10_15_universal2.whl.metadata (1.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
Obtaining dependency information for onnxruntime>=1.14.1 from https://files.pythonhosted.org/packages/cf/06/0c6e355b9ddbebc34d0e21bc5be1e4bd2c124ebd9030525838fa6e65eaa8/onnxruntime-1.15.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached onnxruntime-1.15.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (4.0 kB)
Collecting tokenizers>=0.13.2 (from chromadb)
Using cached tokenizers-0.13.3-cp310-cp310-macosx_12_0_arm64.whl (3.9 MB)
Collecting pypika>=0.48.9 (from chromadb)
Using cached PyPika-0.48.9-py2.py3-none-any.whl
Requirement already satisfied: tqdm>=4.65.0 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from chromadb) (4.65.0)
Collecting overrides>=7.3.1 (from chromadb)
Using cached overrides-7.3.1-py3-none-any.whl (17 kB)
Collecting importlib-resources (from chromadb)
Obtaining dependency information for importlib-resources from https://files.pythonhosted.org/packages/29/d1/bed03eca30aa05aaf6e0873de091f9385c48705c4a607c2dfe3edbe543e8/importlib_resources-6.0.0-py3-none-any.whl.metadata
Using cached importlib_resources-6.0.0-py3-none-any.whl.metadata (4.2 kB)
Collecting starlette<0.28.0,>=0.27.0 (from fastapi<0.100.0,>=0.95.2->chromadb)
Obtaining dependency information for starlette<0.28.0,>=0.27.0 from https://files.pythonhosted.org/packages/58/f8/e2cca22387965584a409795913b774235752be4176d276714e15e1a58884/starlette-0.27.0-py3-none-any.whl.metadata
Using cached starlette-0.27.0-py3-none-any.whl.metadata (5.8 kB)
Collecting coloredlogs (from onnxruntime>=1.14.1->chromadb)
Using cached coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Collecting flatbuffers (from onnxruntime>=1.14.1->chromadb)
Obtaining dependency information for flatbuffers from https://files.pythonhosted.org/packages/6f/12/d5c79ee252793ffe845d58a913197bfa02ae9a0b5c9bc3dc4b58d477b9e7/flatbuffers-23.5.26-py2.py3-none-any.whl.metadata
Using cached flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
Requirement already satisfied: packaging in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from onnxruntime>=1.14.1->chromadb) (23.1)
Collecting protobuf (from onnxruntime>=1.14.1->chromadb)
Obtaining dependency information for protobuf from https://files.pythonhosted.org/packages/cb/d3/a164038605494d49acc4f9cda1c0bc200b96382c53edd561387263bb181d/protobuf-4.23.4-cp37-abi3-macosx_10_9_universal2.whl.metadata
Using cached protobuf-4.23.4-cp37-abi3-macosx_10_9_universal2.whl.metadata (540 bytes)
Collecting sympy (from onnxruntime>=1.14.1->chromadb)
Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from pandas>=1.3->chromadb) (2.8.2)
Collecting pytz>=2020.1 (from pandas>=1.3->chromadb)
Using cached pytz-2023.3-py2.py3-none-any.whl (502 kB)
Collecting tzdata>=2022.1 (from pandas>=1.3->chromadb)
Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from posthog>=2.4.0->chromadb) (1.16.0)
Collecting monotonic>=1.5 (from posthog>=2.4.0->chromadb)
Using cached monotonic-1.6-py2.py3-none-any.whl (8.2 kB)
Collecting backoff>=1.10.0 (from posthog>=2.4.0->chromadb)
Using cached backoff-2.2.1-py3-none-any.whl (15 kB)
Requirement already satisfied: certifi in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from pulsar-client>=3.1.0->chromadb) (2023.5.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from requests>=2.28->chromadb) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from requests>=2.28->chromadb) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (from requests>=2.28->chromadb) (2.0.4)
Collecting click>=7.0 (from uvicorn[standard]>=0.18.3->chromadb)
Obtaining dependency information for click>=7.0 from https://files.pythonhosted.org/packages/1a/70/e63223f8116931d365993d4a6b7ef653a4d920b41d03de7c59499962821f/click-8.1.6-py3-none-any.whl.metadata
Using cached click-8.1.6-py3-none-any.whl.metadata (3.0 kB)
Collecting h11>=0.8 (from uvicorn[standard]>=0.18.3->chromadb)
Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Collecting httptools>=0.5.0 (from uvicorn[standard]>=0.18.3->chromadb)
Obtaining dependency information for httptools>=0.5.0 from https://files.pythonhosted.org/packages/8f/71/d535e9f6967958d21b8fe1baeb7efb6304b86e8fcff44d0bda8690e0aec9/httptools-0.6.0-cp310-cp310-macosx_10_9_universal2.whl.metadata
Using cached httptools-0.6.0-cp310-cp310-macosx_10_9_universal2.whl.metadata (3.6 kB)
Collecting python-dotenv>=0.13 (from uvicorn[standard]>=0.18.3->chromadb)
Using cached python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Collecting pyyaml>=5.1 (from uvicorn[standard]>=0.18.3->chromadb)
Obtaining dependency information for pyyaml>=5.1 from https://files.pythonhosted.org/packages/5b/07/10033a403b23405a8fc48975444463d3d10a5c2736b7eb2550b07b367429/PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl.metadata
Using cached PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl.metadata (2.1 kB)
Collecting uvloop!=0.15.0,!=0.15.1,>=0.14.0 (from uvicorn[standard]>=0.18.3->chromadb)
Using cached uvloop-0.17.0-cp310-cp310-macosx_10_9_universal2.whl (2.1 MB)
Collecting watchfiles>=0.13 (from uvicorn[standard]>=0.18.3->chromadb)
Using cached watchfiles-0.19.0-cp37-abi3-macosx_11_0_arm64.whl (388 kB)
Collecting websockets>=10.4 (from uvicorn[standard]>=0.18.3->chromadb)
Using cached websockets-11.0.3-cp310-cp310-macosx_11_0_arm64.whl (121 kB)
Collecting anyio<5,>=3.4.0 (from starlette<0.28.0,>=0.27.0->fastapi<0.100.0,>=0.95.2->chromadb)
Obtaining dependency information for anyio<5,>=3.4.0 from https://files.pythonhosted.org/packages/19/24/44299477fe7dcc9cb58d0a57d5a7588d6af2ff403fdd2d47a246c91a3246/anyio-3.7.1-py3-none-any.whl.metadata
Using cached anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntime>=1.14.1->chromadb)
Using cached humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
Collecting mpmath>=0.19 (from sympy->onnxruntime>=1.14.1->chromadb)
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Collecting sniffio>=1.1 (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi<0.100.0,>=0.95.2->chromadb)
Using cached sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting exceptiongroup (from anyio<5,>=3.4.0->starlette<0.28.0,>=0.27.0->fastapi<0.100.0,>=0.95.2->chromadb)
Obtaining dependency information for exceptiongroup from https://files.pythonhosted.org/packages/fe/17/f43b7c9ccf399d72038042ee72785c305f6c6fdc6231942f8ab99d995742/exceptiongroup-1.1.2-py3-none-any.whl.metadata
Using cached exceptiongroup-1.1.2-py3-none-any.whl.metadata (6.1 kB)
Downloading chromadb-0.4.2-py3-none-any.whl (399 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 399.3/399.3 kB 12.8 MB/s eta 0:00:00
Using cached chroma_hnswlib-0.7.1-cp310-cp310-macosx_13_0_arm64.whl (195 kB)
Using cached fastapi-0.99.1-py3-none-any.whl (58 kB)
Using cached numpy-1.25.1-cp310-cp310-macosx_11_0_arm64.whl (14.0 MB)
Using cached onnxruntime-1.15.1-cp310-cp310-macosx_11_0_arm64.whl (6.1 MB)
Using cached pandas-2.0.3-cp310-cp310-macosx_11_0_arm64.whl (10.8 MB)
Using cached pulsar_client-3.2.0-cp310-cp310-macosx_10_15_universal2.whl (10.8 MB)
Using cached pydantic-1.10.11-cp310-cp310-macosx_11_0_arm64.whl (2.5 MB)
Using cached importlib_resources-6.0.0-py3-none-any.whl (31 kB)
Using cached click-8.1.6-py3-none-any.whl (97 kB)
Using cached httptools-0.6.0-cp310-cp310-macosx_10_9_universal2.whl (237 kB)
Using cached PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl (169 kB)
Using cached starlette-0.27.0-py3-none-any.whl (66 kB)
Using cached flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Using cached protobuf-4.23.4-cp37-abi3-macosx_10_9_universal2.whl (400 kB)
Using cached uvicorn-0.23.1-py3-none-any.whl (59 kB)
Using cached anyio-3.7.1-py3-none-any.whl (80 kB)
Using cached exceptiongroup-1.1.2-py3-none-any.whl (14 kB)
Installing collected packages: tokenizers, pytz, pypika, mpmath, monotonic, flatbuffers, websockets, uvloop, tzdata, sympy, sniffio, pyyaml, python-dotenv, pydantic, pulsar-client, protobuf, overrides, numpy, importlib-resources, humanfriendly, httptools, h11, exceptiongroup, click, backoff, uvicorn, posthog, pandas, coloredlogs, chroma-hnswlib, anyio, watchfiles, starlette, onnxruntime, fastapi, chromadb
Successfully installed anyio-3.7.1 backoff-2.2.1 chroma-hnswlib-0.7.1 chromadb-0.4.2 click-8.1.6 coloredlogs-15.0.1 exceptiongroup-1.1.2 fastapi-0.99.1 flatbuffers-23.5.26 h11-0.14.0 httptools-0.6.0 humanfriendly-10.0 importlib-resources-6.0.0 monotonic-1.6 mpmath-1.3.0 numpy-1.25.1 onnxruntime-1.15.1 overrides-7.3.1 pandas-2.0.3 posthog-3.0.1 protobuf-4.23.4 pulsar-client-3.2.0 pydantic-1.10.11 pypika-0.48.9 python-dotenv-1.0.0 pytz-2023.3 pyyaml-6.0.1 sniffio-1.3.0 starlette-0.27.0 sympy-1.12 tokenizers-0.13.3 tzdata-2023.3 uvicorn-0.23.1 uvloop-0.17.0 watchfiles-0.19.0 websockets-11.0.3
Note: you may need to restart the kernel to use updated packages.
Collecting wget
Using cached wget-3.2.zip (10 kB)
Preparing metadata (setup.py) ... done
Building wheels for collected packages: wget
Building wheel for wget (setup.py) ... done
Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=b2d83c5fcdeab398d0a4e9808a470bbf725fffea4a6130e731c6097b9561005b
Stored in directory: /Users/antontroynikov/Library/Caches/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy in /Users/antontroynikov/miniforge3/envs/chroma-openai-cookbook/lib/python3.10/site-packages (1.25.1)
Note: you may need to restart the kernel to use updated packages.
import openai
import pandas as pd
import os
import wget
from ast import literal_eval

# Chroma's client library for Python
import chromadb

# I've set this to our new embeddings model, this can be changed to the embedding model of your choice
EMBEDDING_MODEL = "text-embedding-3-small"

# 忽略未关闭的SSL套接字警告——如果你遇到这些错误,可以选择此项。
import warnings

warnings.filterwarnings(action="ignore", message="unclosed", category=ResourceWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

加载数据

在本部分,我们将加载在本次会话之前准备好的嵌入数据。

embeddings_url = 'https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip'

# 文件大小约为700MB,因此需要一些时间来完成。
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'
import zipfile
with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
zip_ref.extractall("../data")

article_df = pd.read_csv('../data/vector_database_wikipedia_articles_embedded.csv')

article_df.head()

id url title text title_vector content_vector vector_id
0 1 https://simple.wikipedia.org/wiki/April April April is the fourth month of the year in the J... [0.001009464613161981, -0.020700545981526375, ... [-0.011253940872848034, -0.013491976074874401,... 0
1 2 https://simple.wikipedia.org/wiki/August August August (Aug.) is the eighth month of the year ... [0.0009286514250561595, 0.000820168002974242, ... [0.0003609954728744924, 0.007262262050062418, ... 1
2 6 https://simple.wikipedia.org/wiki/Art Art Art is a creative activity that expresses imag... [0.003393713850528002, 0.0061537534929811954, ... [-0.004959689453244209, 0.015772193670272827, ... 2
3 8 https://simple.wikipedia.org/wiki/A A A or a is the first letter of the English alph... [0.0153952119871974, -0.013759135268628597, 0.... [0.024894846603274345, -0.022186409682035446, ... 3
4 9 https://simple.wikipedia.org/wiki/Air Air Air refers to the Earth's atmosphere. Air is a... [0.02224554680287838, -0.02044147066771984, -0... [0.021524671465158463, 0.018522677943110466, -... 4
# 从字符串中读取向量并将其转换为列表
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# 将 `vector_id` 设置为一个字符串
article_df['vector_id'] = article_df['vector_id'].apply(str)

article_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 25000 non-null int64
1 url 25000 non-null object
2 title 25000 non-null object
3 text 25000 non-null object
4 title_vector 25000 non-null object
5 content_vector 25000 non-null object
6 vector_id 25000 non-null object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB

色度

我们将在一个向量数据库中索引这些嵌入式文档并进行搜索。我们将首先看一下的选项是Chroma,这是一个易于使用的开源自托管内存中向量数据库,专为与LLM一起使用嵌入式设计。

在本节中,我们将: - 实例化Chroma客户端 - 为每个嵌入类创建集合 - 查询每个集合

实例化Chroma客户端

创建Chroma客户端。默认情况下,Chroma是临时的,运行在内存中。 然而,您可以轻松地设置一个持久化配置,将数据写入磁盘。

chroma_client = chromadb.EphemeralClient() # 相当于 chromadb.Client(),临时性的。
# 取消注释以启用持久客户端
# chroma_client = chromadb.PersistentClient()

创建集合

Chroma集合允许您存储和过滤具有任意元数据的数据,从而可以轻松查询嵌入数据的子集。

Chroma已经集成了OpenAI的嵌入函数。最佳方法是在构建集合时使用它们,如下所示。 或者,您可以“自己带嵌入”。更多信息可以在这里找到。

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# 确保你的OpenAI API密钥已正确设置为环境变量。
# 注意:如果您在本地运行此笔记本,您需要重新加载终端和笔记本,以使环境变量生效。

# 注意:或者,您也可以像这样设置一个临时的环境变量:
# os.environ["OPENAI_API_KEY"] = 'sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

if os.getenv("OPENAI_API_KEY") is not None:
openai.api_key = os.getenv("OPENAI_API_KEY")
print ("OPENAI_API_KEY is ready")
else:
print ("OPENAI_API_KEY environment variable not found")


embedding_function = OpenAIEmbeddingFunction(api_key=os.environ.get('OPENAI_API_KEY'), model_name=EMBEDDING_MODEL)

wikipedia_content_collection = chroma_client.create_collection(name='wikipedia_content', embedding_function=embedding_function)
wikipedia_title_collection = chroma_client.create_collection(name='wikipedia_titles', embedding_function=embedding_function)

OPENAI_API_KEY is ready

填充集合

Chroma集合允许您填充和过滤任何您喜欢的元数据。Chroma还可以将文本存储在向量旁边,并在更方便时在单个query调用中返回所有内容。

对于这种用例,我们将只存储嵌入和ID,并使用它们来索引原始数据框。

# 添加内容向量
wikipedia_content_collection.add(
ids=article_df.vector_id.tolist(),
embeddings=article_df.content_vector.tolist(),
)

# 添加标题向量
wikipedia_title_collection.add(
ids=article_df.vector_id.tolist(),
embeddings=article_df.title_vector.tolist(),
)

搜索集合

如果设置了嵌入函数,Chroma会为您处理嵌入查询,就像这个例子中一样。

def query_collection(collection, query, max_results, dataframe):
results = collection.query(query_texts=query, n_results=max_results, include=['distances'])
df = pd.DataFrame({
'id':results['ids'][0],
'score':results['distances'][0],
'title': dataframe[dataframe.vector_id.isin(results['ids'][0])]['title'],
'content': dataframe[dataframe.vector_id.isin(results['ids'][0])]['text'],
})

return df

title_query_result = query_collection(
collection=wikipedia_title_collection,
query="modern art in Europe",
max_results=10,
dataframe=article_df
)
title_query_result.head()

id score title content
2 23266 0.249646 Art Art is a creative activity that expresses imag...
11777 15436 0.271688 Hellenistic art The art of the Hellenistic time (from 400 B.C....
12178 23265 0.279306 Byzantine art Byzantine art is a form of Christian Greek art...
13215 11777 0.294415 Art film Art films are a type of movie that is very dif...
15436 22108 0.305937 Renaissance art Many of the most famous and best-loved works o...
content_query_result = query_collection(
collection=wikipedia_content_collection,
query="Famous battles in Scottish history",
max_results=10,
dataframe=article_df
)
content_query_result.head()

id score title content
2923 13135 0.261328 1651 Events January 1 – Charles II crowned K...
3694 13571 0.277058 Stirling Stirling () is a city in the middle of Scotlan...
6248 2923 0.294823 841 Events June 25: Battle of Fontenay – Lo...
6297 13568 0.300756 1746 Events January 8 – Bonnie Prince Charli...
11702 11708 0.307572 William Wallace William Wallace was a Scottish knight who foug...

现在您已经运行了基本的嵌入搜索,您可以跳转到Chroma文档了解如何向查询添加过滤器、更新/删除集合中的数据以及部署Chroma。