腾讯云 VectorDB¶
Tencent Cloud VectorDB是一种完全托管、自主开发的企业级分布式数据库服务,旨在存储、检索和分析多维向量数据。该数据库支持多种索引类型和相似度计算方法。单个索引可以支持高达10亿的向量规模,并且可以支持数百万的QPS和毫秒级的查询延迟。腾讯云向量数据库不仅可以为大型模型提供外部知识库,以提高大型模型响应的准确性,还可以广泛应用于推荐系统、自然语言处理服务、计算机视觉和智能客户服务等人工智能领域。
本笔记展示了在LlamaIndex中将TencentVectorDB用作向量存储的基本用法。
要运行此笔记,您需要拥有数据库实例。
设置¶
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-vector-stores-tencentvectordb
!pip install llama-index
!pip install tcvectordb
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
StorageContext,
)
from llama_index.vector_stores.tencentvectordb import TencentVectorDB
from llama_index.core.vector_stores.tencentvectordb import (
CollectionParams,
FilterField,
)
import tcvectordb
tcvectordb.debug.DebugEnable = False
请提供OpenAI访问密钥¶
为了使用OpenAI提供的嵌入功能,您需要提供一个OpenAI API密钥:
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
OpenAI API Key: ········
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
创建和填充向量存储¶
现在,您将从本地文件中加载一些Paul Graham的文章,并将它们存储到腾讯云的VectorDB中。
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"总文档数:{len(documents)}")
print(f"第一个文档,id:{documents[0].doc_id}")
print(f"第一个文档,哈希值:{documents[0].hash}")
print(
f"第一个文档,文本({len(documents[0].text)} 个字符):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 5b7489b6-0cca-4088-8f30-6de32d540fdf First document, hash: 4c702b4df575421e1d1af4b1fd50511b226e0c9863dbfffeccb8b689b8448f35 First document, text (75019 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ...
初始化腾讯云 VectorDB¶
创建向量存储需要创建基础数据库集合(如果尚不存在):
vector_store = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=True),
)
现在将这个存储包装成一个index
LlamaIndex抽象,以便以后进行查询:
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上面的from_documents
调用同时执行了几项操作:将输入文档分割成可管理的大小块("节点"),为每个节点计算嵌入向量,并将它们全部存储在腾讯云的VectorDB中。
查询商店¶
基本查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because of his fascination with the novel The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also drawn to the idea that AI could be used to explore the ultimate truths that other fields could not.
基于MMR的查询¶
MMR(最大边际相关性)方法旨在从存储中获取文本块,这些文本块既与查询相关,又尽可能不同,其目标是为最终答案的构建提供更广泛的上下文。
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response)
The author chose to work on AI because he was impressed and envious of his friend who had built a computer kit and was able to type programs into it. He was also inspired by a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. He was also disappointed with philosophy courses in college, which he found to be boring, and he wanted to work on something that seemed more powerful.
连接到现有存储库¶
由于此存储库由腾讯云VectorDB支持,根据定义它是持久的。因此,如果您想连接到之前创建并填充的存储库,可以按照以下步骤操作:
新的向量存储 = TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(dimension=1536, drop_exists=False),
)
# 创建索引(从预先存储的向量中)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_vector_store
)
# 现在你可以进行查询等操作:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"作者在从事人工智能之前学习了什么?"
)
print(response)
The author studied philosophy and painting, worked on spam filters, and wrote essays prior to working on AI.
从索引中删除文档¶
首先从索引中生成一个Retriever
,然后获取文档的显式节点列表。
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes. [0] score = 0.42589144520149874 id = 05f53f06-9905-461a-bc6d-fa4817e5a776 text = What I Worked On February 2021 Before college the two main things I worked on, outside o ... [1] score = -0.0012061281453193962 id = 2f9f843e-6495-4646-a03d-4b844ff7c1ab text = been explored. But all I wanted was to get out of grad school, and my rapidly written diss ... [2] score = 0.025454533089838027 id = 28ad32da-25f9-4aaa-8487-88390ec13348 text = showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress ...
但是等等!在使用向量存储时,你应该将文档视为可删除的合理单元,而不是属于它的任何单个节点。嗯,在这种情况下,你只插入了一个单个文本文件,所以所有节点将具有相同的 ref_doc_id
:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf 5b7489b6-0cca-4088-8f30-6de32d540fdf
现在假设您需要删除您上传的文本文件:
new_vector_store.delete(nodes_with_scores[0].node.ref_doc_id)
重复相同的查询并检查结果。现在应该看到找不到任何结果:
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
腾讯云VectorDB向量存储在查询时支持以精确匹配的key=value
对形式进行元数据过滤。以下单元格演示了在全新集合上运行的这一功能。
在这个演示中,为了简洁起见,加载了单个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。尽管如此,您将为文档附加一些自定义元数据,以说明如何通过对文档附加的元数据设置条件来限制查询。
filter_fields = [
FilterField(name="source_type"),
]
md_storage_context = StorageContext.from_defaults(
vector_store=TencentVectorDB(
url="http://10.0.X.X",
key="eC4bLRy2va******************************",
collection_params=CollectionParams(
dimension=1536, drop_exists=True, filter_fields=filter_fields
),
)
)
def my_file_metadata(file_name: str):
"""根据输入的文件名,关联不同的元数据。"""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# 这(不幸地)在此演示中不会发生
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# 加载文档并构建索引
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
就是这样:现在你可以向你的查询引擎添加过滤功能了:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
print(md_response.response)
It took the author five weeks to write his thesis.
为了测试过滤是否生效,尝试将其更改为仅使用“dinos”文档...这一次不会有答案 :)