CouchbaseVectorStoreDemo
Couchbase向量存储¶
Couchbase是一款屡获殊荣的分布式NoSQL云数据库,为您的云端、移动、人工智能和边缘计算应用提供无与伦比的多功能性、性能、可扩展性和经济价值。Couchbase通过为开发人员提供编码辅助和向量搜索来拥抱人工智能。
向量搜索是Couchbase中全文搜索服务(搜索服务)的一部分。
本教程将解释如何在Couchbase中使用向量搜索。您可以在Couchbase Capella和自行管理的Couchbase服务器上使用。
安装¶
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-vector-stores-couchbase
!pip install llama-index
创建Couchbase连接¶
我们首先创建一个到Couchbase集群的连接,然后将集群对象传递给Vector Store。
在这里,我们使用用户名和密码进行连接。您也可以使用任何其他支持的方式连接到您的集群。
有关连接到Couchbase集群的更多信息,请查看Python SDK文档。
COUCHBASE_CONNECTION_STRING = (
"couchbase://localhost" # 或者如果使用TLS,则为"couchbases://localhost"
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"
# from datetime模块 import timedelta
from datetime import timedelta
# from couchbase.auth模块 import PasswordAuthenticator
# from couchbase.cluster模块 import Cluster
# from couchbase.options模块 import ClusterOptions
from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions
# 创建PasswordAuthenticator对象并传入DB_USERNAME和DB_PASSWORD参数
# 创建ClusterOptions对象并传入auth参数
# 创建Cluster对象并传入COUCHBASE_CONNECTION_STRING和options参数
auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)
# 等待集群准备就绪,最多等待5秒
cluster.wait_until_ready(timedelta(seconds=5))
创建搜索索引¶
目前,搜索索引需要从Couchbase Capella或服务器UI创建,或者使用REST接口。
让我们在testing
存储桶上定义一个名为vector-index
的搜索索引。
在这个示例中,让我们在UI上的搜索服务中使用导入索引功能。
我们正在为_default
范围下的testing
存储桶的_default
集合定义一个索引,其中向量字段设置为具有1536维度的embedding
,文本字段设置为text
。我们还对文档中的所有字段进行索引和存储,将其存储在metadata下作为动态映射,以适应不同的文档结构。相似度度量设置为dot_product
。
如何将索引导入到全文搜索服务中?¶
-
- 点击搜索 -> 添加索引 -> 导入
- 在导入屏幕中复制以下索引定义
- 点击创建索引以创建索引。
-
- 将索引定义复制到一个名为
index.json
的新文件中 - 使用文档中的说明在 Capella 中导入该文件。
- 点击创建索引以创建索引。
- 将索引定义复制到一个名为
索引定义¶
{
"name": "vector-index",
"type": "fulltext-index",
"params": {
"doc_config": {
"docid_prefix_delim": "",
"docid_regexp": "",
"mode": "type_field",
"type_field": "type"
},
"mapping": {
"default_analyzer": "standard",
"default_datetime_parser": "dateTimeOptional",
"default_field": "_all",
"default_mapping": {
"dynamic": true,
"enabled": true,
"properties": {
"metadata": {
"dynamic": true,
"enabled": true
},
"embedding": {
"enabled": true,
"dynamic": false,
"fields": [
{
"dims": 1536,
"index": true,
"name": "embedding",
"similarity": "dot_product",
"type": "vector",
"vector_index_optimized_for": "recall"
}
]
},
"text": {
"enabled": true,
"dynamic": false,
"fields": [
{
"index": true,
"name": "text",
"store": true,
"type": "text"
}
]
}
}
},
"default_type": "_default",
"docvalues_dynamic": false,
"index_dynamic": true,
"store_dynamic": true,
"type_field": "_type"
},
"store": {
"indexType": "scorch",
"segmentVersion": 16
}
},
"sourceType": "gocbcore",
"sourceName": "testing",
"sourceParams": {},
"planParams": {
"maxPartitionsPerPIndex": 103,
"indexPartitions": 10,
"numReplicas": 0
}
}
我们现在将在Couchbase集群中设置用于向量搜索的存储桶、作用域和集合名称。
在本示例中,我们将使用默认的作用域和集合。
BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"
# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseVectorStore
对于本教程,我们将使用OpenAI嵌入。
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
OpenAI API Key: ········
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2024-04-09 23:31:46-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.008s 2024-04-09 23:31:46 (8.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
加载文档¶
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
vector_store = CouchbaseVectorStore(
cluster=cluster,
bucket_name=BUCKET_NAME,
scope_name=SCOPE_NAME,
collection_name=COLLECTION_NAME,
index_name=SEARCH_INDEX_NAME,
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
基本示例¶
我们将向查询引擎询问关于我们刚刚索引的文章的问题。
query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" His investments in Y Combinator were $6k per founder, totaling $12k in the typical two-founder case, in return for 6% equity.
元数据过滤器¶
我们将创建一些带有元数据的示例文档,以便我们可以看到如何根据元数据过滤文档。
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text="The Shawshank Redemption",
metadata={
"author": "Stephen King",
"theme": "Friendship",
},
),
TextNode(
text="The Godfather",
metadata={
"director": "Francis Ford Coppola",
"theme": "Mafia",
},
),
TextNode(
text="Inception",
metadata={
"director": "Christopher Nolan",
},
),
]
vector_store.add(nodes)
['5abb42cf-7312-46eb-859e-60df4f92842a', 'b90525f4-38bf-453c-a51a-5f0718bccc98', '22f732d0-da17-4bad-b3cd-b54e2102367a']
# 元数据过滤器
from llama_index.core.vector_stores import ExactMatchFilter,MetadataFilters
filters = MetadataFilters(
filters=[ExactMatchFilter(key="theme", value="Mafia")]
)
retriever = index.as_retriever(filters=filters)
retriever.retrieve("Inception是关于什么的?")
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
[NodeWithScore(node=TextNode(id_='b90525f4-38bf-453c-a51a-5f0718bccc98', embedding=None, metadata={'director': 'Francis Ford Coppola', 'theme': 'Mafia'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='The Godfather', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3068528194400547)]
def custom_query(query, query_str):
print("custom query", query)
return query
query_engine = index.as_query_engine(
vector_store_kwargs={
"cb_search_options": {
"query": {"match": "growing up", "field": "text"}
},
"custom_query": custom_query,
}
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK" INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK" His investments in Y Combinator were based on a combination of the deal he did with Julian ($10k for 10%) and what Robert said MIT grad students got for the summer ($6k). He invested $6k per founder, which in the typical two-founder case was $12k, in return for 6%.