CouchbaseVectorStoreDemo

Couchbase向量存储¶

Couchbase是一款屡获殊荣的分布式NoSQL云数据库，为您的云端、移动、人工智能和边缘计算应用提供无与伦比的多功能性、性能、可扩展性和经济价值。Couchbase通过为开发人员提供编码辅助和向量搜索来拥抱人工智能。

向量搜索是Couchbase中全文搜索服务（搜索服务）的一部分。

本教程将解释如何在Couchbase中使用向量搜索。您可以在Couchbase Capella和自行管理的Couchbase服务器上使用。

安装¶

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-couchbase
%pip install llama-index-vector-stores-couchbase

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

创建Couchbase连接¶

我们首先创建一个到Couchbase集群的连接，然后将集群对象传递给Vector Store。

在这里，我们使用用户名和密码进行连接。您也可以使用任何其他支持的方式连接到您的集群。

有关连接到Couchbase集群的更多信息，请查看Python SDK文档。

In [ ]:

Copied!





COUCHBASE_CONNECTION_STRING = (
    "couchbase://localhost"  # 或者如果使用TLS，则为"couchbases://localhost"
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"
COUCHBASE_CONNECTION_STRING = (
    "couchbase://localhost"  # 或者如果使用TLS，则为"couchbases://localhost"
)
DB_USERNAME = "Administrator"
DB_PASSWORD = "P@ssword1!"

In [ ]:

Copied!





# from datetime模块 import timedelta

from datetime import timedelta

# from couchbase.auth模块 import PasswordAuthenticator
# from couchbase.cluster模块 import Cluster
# from couchbase.options模块 import ClusterOptions

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

# 创建PasswordAuthenticator对象并传入DB_USERNAME和DB_PASSWORD参数
# 创建ClusterOptions对象并传入auth参数
# 创建Cluster对象并传入COUCHBASE_CONNECTION_STRING和options参数

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# 等待集群准备就绪，最多等待5秒
cluster.wait_until_ready(timedelta(seconds=5))

# from datetime模块 import timedelta

from datetime import timedelta

# from couchbase.auth模块 import PasswordAuthenticator
# from couchbase.cluster模块 import Cluster
# from couchbase.options模块 import ClusterOptions

from couchbase.auth import PasswordAuthenticator
from couchbase.cluster import Cluster
from couchbase.options import ClusterOptions

# 创建PasswordAuthenticator对象并传入DB_USERNAME和DB_PASSWORD参数
# 创建ClusterOptions对象并传入auth参数
# 创建Cluster对象并传入COUCHBASE_CONNECTION_STRING和options参数

auth = PasswordAuthenticator(DB_USERNAME, DB_PASSWORD)
options = ClusterOptions(auth)
cluster = Cluster(COUCHBASE_CONNECTION_STRING, options)

# 等待集群准备就绪，最多等待5秒
cluster.wait_until_ready(timedelta(seconds=5))

创建搜索索引¶

目前，搜索索引需要从Couchbase Capella或服务器UI创建，或者使用REST接口。

让我们在testing存储桶上定义一个名为vector-index的搜索索引。

在这个示例中，让我们在UI上的搜索服务中使用导入索引功能。

我们正在为_default范围下的testing存储桶的_default集合定义一个索引，其中向量字段设置为具有1536维度的embedding，文本字段设置为text。我们还对文档中的所有字段进行索引和存储，将其存储在metadata下作为动态映射，以适应不同的文档结构。相似度度量设置为dot_product。

如何将索引导入到全文搜索服务中？¶

Couchbase Server
- 点击搜索 -> 添加索引 -> 导入
- 在导入屏幕中复制以下索引定义
- 点击创建索引以创建索引。
Couchbase Capella
- 将索引定义复制到一个名为 index.json 的新文件中
- 使用文档中的说明在 Capella 中导入该文件。
- 点击创建索引以创建索引。

索引定义¶

{
 "name": "vector-index",
 "type": "fulltext-index",
 "params": {
  "doc_config": {
   "docid_prefix_delim": "",
   "docid_regexp": "",
   "mode": "type_field",
   "type_field": "type"
  },
  "mapping": {
   "default_analyzer": "standard",
   "default_datetime_parser": "dateTimeOptional",
   "default_field": "_all",
   "default_mapping": {
    "dynamic": true,
    "enabled": true,
    "properties": {
     "metadata": {
      "dynamic": true,
      "enabled": true
     },
     "embedding": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "dims": 1536,
        "index": true,
        "name": "embedding",
        "similarity": "dot_product",
        "type": "vector",
        "vector_index_optimized_for": "recall"
       }
      ]
     },
     "text": {
      "enabled": true,
      "dynamic": false,
      "fields": [
       {
        "index": true,
        "name": "text",
        "store": true,
        "type": "text"
       }
      ]
     }
    }
   },
   "default_type": "_default",
   "docvalues_dynamic": false,
   "index_dynamic": true,
   "store_dynamic": true,
   "type_field": "_type"
  },
  "store": {
   "indexType": "scorch",
   "segmentVersion": 16
  }
 },
 "sourceType": "gocbcore",
 "sourceName": "testing",
 "sourceParams": {},
 "planParams": {
  "maxPartitionsPerPIndex": 103,
  "indexPartitions": 10,
  "numReplicas": 0
 }
}

我们现在将在Couchbase集群中设置用于向量搜索的存储桶、作用域和集合名称。

在本示例中，我们将使用默认的作用域和集合。

In [ ]:

Copied!





BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"
BUCKET_NAME = "testing"
SCOPE_NAME = "_default"
COLLECTION_NAME = "_default"
SEARCH_INDEX_NAME = "vector-index"

In [ ]:

Copied!





# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseVectorStore
# Import required packages
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext
from llama_index.core import Settings
from llama_index.vector_stores.couchbase import CouchbaseVectorStore

对于本教程，我们将使用OpenAI嵌入。

In [ ]:

Copied!

import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

OpenAI API Key: ········

In [ ]:

Copied!

import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

下载数据¶

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-04-09 23:31:46--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’

data/paul_graham/pa 100%[===================>]  73.28K  --.-KB/s    in 0.008s  

2024-04-09 23:31:46 (8.97 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]

加载文档¶

In [ ]:

Copied!

# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

In [ ]:

Copied!





vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    index_name=SEARCH_INDEX_NAME,
)
vector_store = CouchbaseVectorStore(
    cluster=cluster,
    bucket_name=BUCKET_NAME,
    scope_name=SCOPE_NAME,
    collection_name=COLLECTION_NAME,
    index_name=SEARCH_INDEX_NAME,
)

In [ ]:

Copied!





storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

基本示例¶

我们将向查询引擎询问关于我们刚刚索引的文章的问题。

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)
query_engine = index.as_query_engine()
response = query_engine.query("What were his investments in Y Combinator?")
print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were $6k per founder, totaling $12k in the typical two-founder case, in return for 6% equity.

元数据过滤器¶

我们将创建一些带有元数据的示例文档，以便我们可以看到如何根据元数据过滤文档。

In [ ]:

Copied!





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]
vector_store.add(nodes)
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
        },
    ),
]
vector_store.add(nodes)

Out[ ]:

['5abb42cf-7312-46eb-859e-60df4f92842a',
 'b90525f4-38bf-453c-a51a-5f0718bccc98',
 '22f732d0-da17-4bad-b3cd-b54e2102367a']

In [ ]:

Copied!

# 元数据过滤器
from llama_index.core.vector_stores import ExactMatchFilter,MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

retriever = index.as_retriever(filters=filters)

retriever.retrieve("Inception是关于什么的？")
# 元数据过滤器
from llama_index.core.vector_stores import ExactMatchFilter,MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

retriever = index.as_retriever(filters=filters)

retriever.retrieve("Inception是关于什么的？")

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

Out[ ]:

[NodeWithScore(node=TextNode(id_='b90525f4-38bf-453c-a51a-5f0718bccc98', embedding=None, metadata={'director': 'Francis Ford Coppola', 'theme': 'Mafia'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='The Godfather', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3068528194400547)]

自定义过滤器和覆盖查询¶

目前，Couchbase仅支持通过LlamaIndex支持ExactMatchFilters。Couchbase支持各种过滤器，包括范围过滤器、地理空间过滤器等。要使用这些过滤器，可以将它们作为字典列表传递给cb_search_options参数。有关search_options的不同搜索/查询可能性，可以在这里找到。

In [ ]:

Copied!





def custom_query(query, query_str):
    print("custom query", query)
    return query


query_engine = index.as_query_engine(
    vector_store_kwargs={
        "cb_search_options": {
            "query": {"match": "growing up", "field": "text"}
        },
        "custom_query": custom_query,
    }
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)
def custom_query(query, query_str):
    print("custom query", query)
    return query


query_engine = index.as_query_engine(
    vector_store_kwargs={
        "cb_search_options": {
            "query": {"match": "growing up", "field": "text"}
        },
        "custom_query": custom_query,
    }
)
response = query_engine.query("what were his investments in Y Combinator?")
print(response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
His investments in Y Combinator were based on a combination of the deal he did with Julian ($10k for 10%) and what Robert said MIT grad students got for the summer ($6k). He invested $6k per founder, which in the typical two-founder case was $12k, in return for 6%.