百度 VectorDB¶

Baidu VectorDB是一种强大的企业级分布式数据库服务，由百度智能云精心开发和全面管理。它以其出色的存储、检索和分析多维向量数据的能力脱颖而出。在其核心，VectorDB运行在百度专有的“Mochow”向量数据库内核上，确保高性能、可用性和安全性，同时具有卓越的可扩展性和用户友好性。

这个数据库服务支持各种不同类型的索引和相似性计算方法，满足各种用例需求。VectorDB的一个显著特点是其能够管理高达100亿的巨大向量规模，同时保持令人印象深刻的查询性能，支持每秒数百万次的查询(QPS)，并具有毫秒级的查询延迟。

这个笔记本展示了BaiduVectorDB在LlamaIndex中作为向量存储的基本用法。

要运行，您应该拥有一个数据库实例。

设置¶

如果您在colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-baiduvectordb
%pip install llama-index-vector-stores-baiduvectordb

In [ ]:

Copied!

!pip install llama-index
!pip install llama-index

In [ ]:

Copied!

!pip install pymochow
!pip install pymochow

In [ ]:

Copied!





from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.vector_stores.baiduvectordb import (
    BaiduVectorDB,
    TableParams,
    TableField,
)
import pymochow
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    StorageContext,
)
from llama_index.vector_stores.baiduvectordb import (
    BaiduVectorDB,
    TableParams,
    TableField,
)
import pymochow

请提供OpenAI访问密钥¶

为了使用OpenAI提供的嵌入，您需要提供一个OpenAI API密钥：

In [ ]:

Copied!

import openai

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
import openai

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY

下载数据¶

In [ ]:

Copied!

!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

创建和填充向量存储¶

现在，您将从本地文件加载一些Paul Graham的文章，并将它们存储到百度VectorDB中。

In [ ]:

Copied!





# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"总文档数：{len(documents)}")
print(f"第一个文档，id：{documents[0].doc_id}")
print(f"第一个文档，哈希值：{documents[0].hash}")
print(
    f"第一个文档，文本（{len(documents[0].text)}个字符）：\n{'='*20}\n{documents[0].text[:360]} ..."
)
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"总文档数：{len(documents)}")
print(f"第一个文档，id：{documents[0].doc_id}")
print(f"第一个文档，哈希值：{documents[0].hash}")
print(
    f"第一个文档，文本（{len(documents[0].text)}个字符）：\n{'='*20}\n{documents[0].text[:360]} ..."
)

初始化百度VectorDB¶

如果尚不存在，则创建向量存储的过程包括创建底层数据库集合：

In [ ]:

Copied!





vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=True),
)
vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=True),
)

现在将这个存储包装成一个index LlamaIndex抽象，以便以后进行查询：

In [ ]:

Copied!

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

请注意，上面的from_documents调用同时执行了几项操作：将输入文档分割成可管理的大小块（"节点"），为每个节点计算嵌入向量，并将它们全部存储在百度VectorDB中。

查询商店¶

基础查询¶

In [ ]:

Copied!

query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)

基于MMR的查询¶

MMR（最大边际相关性）方法旨在从存储中获取文本块，这些文本块既与查询相关，又尽可能不同，其目标是为最终答案的构建提供更广泛的上下文。

In [ ]:

Copied!

query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)

连接到现有的存储库¶

由于这个存储库是由百度VectorDB支持的，根据定义它是持久的。因此，如果您想连接到之前创建和填充的存储库，可以按照以下步骤操作：

In [ ]:

Copied!





vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=False),
)

# 创建索引（从预先存储的向量中）
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_vector_store
)

# 现在可以进行查询等操作：
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "作者在从事人工智能之前学习了什么？"
)
print(response)
vector_store = BaiduVectorDB(
    endpoint="http://192.168.X.X",
    api_key="*******",
    table_params=TableParams(dimension=1536, drop_exists=False),
)

# 创建索引（从预先存储的向量中）
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_vector_store
)

# 现在可以进行查询等操作：
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "作者在从事人工智能之前学习了什么？"
)
print(response)

元数据过滤¶

百度VectorDB向量存储支持在查询时以精确匹配的key=value对的形式进行元数据过滤。下面的单元格将在一个全新的集合上演示这个功能。

在这个演示中，为了简洁起见，加载了一个单一的源文件（../data/paul_graham/paul_graham_essay.txt文本文件）。尽管如此，您将附加一些自定义元数据到文档上，以说明如何通过对文档附加的元数据设置条件来限制查询。

In [ ]:

Copied!





filter_fields = [
    TableField(name="source_type"),
]

md_storage_context = StorageContext.from_defaults(
    vector_store=BaiduVectorDB(
        endpoint="http://192.168.X.X",
        api_key="="*******",",
        table_params=TableParams(
            dimension=1536, drop_exists=True, filter_fields=filter_fields
        ),
    )
)


def my_file_metadata(file_name: str):
    """根据输入的文件名，关联不同的元数据。"""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # 在这个演示中（不幸地）不会发生
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# 加载文档并构建索引
md_documents = SimpleDirectoryReader(
    "../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)
filter_fields = [
    TableField(name="source_type"),
]

md_storage_context = StorageContext.from_defaults(
    vector_store=BaiduVectorDB(
        endpoint="http://192.168.X.X",
        api_key="="*******",",
        table_params=TableParams(
            dimension=1536, drop_exists=True, filter_fields=filter_fields
        ),
    )
)


def my_file_metadata(file_name: str):
    """根据输入的文件名，关联不同的元数据。"""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # 在这个演示中（不幸地）不会发生
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# 加载文档并构建索引
md_documents = SimpleDirectoryReader(
    "../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)

In [ ]:

Copied!

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters

In [ ]:

Copied!





md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[MetadataFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "How long it took the author to write his thesis?"
)
print(md_response.response)
md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[MetadataFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "How long it took the author to write his thesis?"
)
print(md_response.response)