Cassandra向量存储¶
Apache Cassandra®是一个NoSQL、面向行的、高度可扩展且高度可用的数据库。从5.0版本开始,该数据库具备了向量搜索功能。
DataStax Astra DB通过CQL是建立在Cassandra上的托管无服务器数据库,提供相同的接口和优势。
这个笔记本展示了LlamaIndex中Cassandra向量存储的基本用法。
要运行完整的代码,您需要一个配备了向量搜索功能的运行中的Cassandra集群,或者一个DataStax Astra DB实例。
设置¶
%pip install llama-index-vector-stores-cassandra
!pip install --quiet "astrapy>=0.5.8"
import os
from getpass import getpass
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Document,
StorageContext,
)
from llama_index.vector_stores.cassandra import CassandraVectorStore
下一步是使用全局DB连接初始化CassIO:这是仅在Cassandra集群和Astra DB上略有不同的一步:
初始化(Cassandra 集群)¶
在这种情况下,你首先需要创建一个 cassandra.cluster.Session
对象,就像Cassandra驱动程序文档中描述的那样。具体细节会有所不同(例如网络设置和身份验证),但大致如下:
from cassandra.cluster import Cluster
cluster = Cluster(["127.0.0.1"])
session = cluster.connect()
import cassio
CASSANDRA_KEYSPACE = input("CASSANDRA_KEYSPACE = ")
cassio.init(session=session, keyspace=CASSANDRA_KEYSPACE)
初始化(通过CQL连接Astra DB)¶
在这种情况下,您可以使用以下连接参数初始化CassIO:
- 数据库ID,例如01234567-89ab-cdef-0123-456789abcdef
- 令牌,例如AstraCS:6gBhNmsk135....(必须是“数据库管理员”令牌)
- 可选的Keyspace名称(如果省略,将使用数据库的默认Keyspace)
ASTRA_DB_ID = input("ASTRA_DB_ID = ")
ASTRA_DB_TOKEN = getpass("ASTRA_DB_TOKEN = ")
desired_keyspace = input("ASTRA_DB_KEYSPACE (optional, can be left empty) = ")
if desired_keyspace:
ASTRA_DB_KEYSPACE = desired_keyspace
else:
ASTRA_DB_KEYSPACE = None
ASTRA_DB_ID = 01234567-89ab-cdef-0123-456789abcdef ASTRA_DB_TOKEN = ········ ASTRA_DB_KEYSPACE (optional, can be left empty) =
import cassio
cassio.init(
database_id=ASTRA_DB_ID,
token=ASTRA_DB_TOKEN,
keyspace=ASTRA_DB_KEYSPACE,
)
OpenAI密钥¶
为了使用OpenAI的嵌入,您需要提供一个OpenAI API密钥:
os.environ["OPENAI_API_KEY"] = getpass("OpenAI API Key:")
OpenAI API Key: ········
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
--2023-11-10 01:44:05-- https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 75042 (73K) [text/plain] Saving to: ‘data/paul_graham/paul_graham_essay.txt’ data/paul_graham/pa 100%[===================>] 73.28K --.-KB/s in 0.01s 2023-11-10 01:44:06 (4.80 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]
创建和填充向量存储¶
现在,您将从本地文件加载一些Paul Graham的文章,并将它们存储到Cassandra向量存储中。
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"总文档数:{len(documents)}")
print(f"第一个文档,id:{documents[0].doc_id}")
print(f"第一个文档,哈希值:{documents[0].hash}")
print(
"第一个文档,文本"
f"({len(documents[0].text)} 个字符):\n{'='*20}\n{documents[0].text[:360]} ..."
)
Total documents: 1 First document, id: 12bc6987-366a-49eb-8de0-7b52340e4958 First document, hash: abe31930a1775c78df5a5b1ece7108f78fedbf5fe4a9cf58d7a21808fccaef34 First document, text (75014 characters): ==================== What I Worked On February 2021 Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...
初始化Cassandra向量存储¶
创建向量存储包括创建底层数据库表,如果尚不存在的话:
cassandra_store = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
现在将这个存储包装成一个index
LlamaIndex抽象,以便以后进行查询:
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
请注意,上面的 from_documents
调用同时执行了几项操作:将输入文档分割成可管理的大小块("节点"),为每个节点计算嵌入向量,并将它们全部存储在Cassandra向量存储中。
查询商店¶
基础查询¶
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they were inspired by a novel called The Moon is a Harsh Mistress, which featured an intelligent computer, and a PBS documentary that showed Terry Winograd using SHRDLU. These experiences sparked the author's interest in AI and motivated them to pursue it as a field of study and work.
基于MMR的查询¶
MMR(最大边际相关性)方法旨在从存储中获取文本块,这些文本块既与查询相关,又尽可能不同,其目标是为最终答案的构建提供更广泛的上下文。
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)
The author chose to work on AI because they believed that teaching SHRDLU more words would eventually lead to the development of intelligent programs. They were fascinated by the potential of AI and saw it as an opportunity to expand their understanding of programming and push the limits of what could be achieved.
连接到现有的存储¶
由于这个存储是由Cassandra支持的,它在定义上是持久的。因此,如果你想连接到之前创建和填充过的存储,可以按照以下步骤操作:
new_store_instance = CassandraVectorStore(
table="cass_v_table", embedding_dimension=1536
)
# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
vector_store=new_store_instance
)
# now you can do querying, etc:
query_engine = new_index_instance.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What did the author study prior to working on AI?"
)
print(response.response)
The author studied philosophy prior to working on AI.
从索引中移除文档¶
首先,从索引中生成的Retriever
中获取文档的显式节点列表。
retriever = new_index_instance.as_retriever(
vector_store_query_mode="mmr",
similarity_top_k=3,
vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
print(f" [{idx}] score = {node_with_score.score}")
print(f" id = {node_with_score.node.node_id}")
print(f" text = {node_with_score.node.text[:90]} ...")
Found 3 nodes. [0] score = 0.4251742327832831 id = 7e628668-58fa-4548-9c92-8c31d315dce0 text = What I Worked On February 2021 Before college the two main things I worked on, outside o ... [1] score = -0.020323897262800816 id = aa279d09-717f-4d68-9151-594c5bfef7ce text = This was now only weeks away. My nice landlady let me leave my stuff in her attic. I had s ... [2] score = 0.011198131320563909 id = 50b9170d-6618-4e8b-aaf8-36632e2801a6 text = It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDL ...
但是等等!在使用向量存储时,你应该将文档视为可删除的合理单元,而不是属于它的任何单个节点。嗯,在这种情况下,你只插入了一个单个文本文件,所以所有节点将具有相同的 ref_doc_id
:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))
Nodes' ref_doc_id: 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958 12bc6987-366a-49eb-8de0-7b52340e4958
现在假设您需要删除您上传的文本文件:
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)
重复相同的查询并检查结果。现在应该看到找不到任何结果。
nodes_with_scores = retriever.retrieve(
"What did the author study prior to working on AI?"
)
print(f"Found {len(nodes_with_scores)} nodes.")
Found 0 nodes.
元数据过滤¶
Cassandra向量存储支持在查询时以精确匹配的key=value
对的形式进行元数据过滤。下面的单元格将在全新的Cassandra表上演示这一特性。
在这个演示中,为了简洁起见,只加载了一个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。尽管如此,您将会附加一些自定义元数据到文档中,以说明如何通过对文档附加的元数据设置条件来限制查询。
md_storage_context = StorageContext.from_defaults(
vector_store=CassandraVectorStore(
table="cass_v_table_md", embedding_dimension=1536
)
)
def my_file_metadata(file_name: str):
"""根据输入的文件名,关联不同的元数据。"""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# 这(不幸地)在这个演示中不会发生
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# 加载文档并构建索引
md_documents = SimpleDirectoryReader(
"./data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
这就是全部:现在你可以向你的查询引擎添加过滤功能了。
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[ExactMatchFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"did the author appreciate Lisp and painting?"
)
print(md_response.response)
Yes, the author appreciated Lisp and painting. They mentioned spending a significant amount of time working on Lisp and even building a new dialect of Lisp called Arc. Additionally, the author mentioned spending most of 2014 painting and experimenting with different techniques.
为了测试过滤是否生效,尝试将其更改为仅使用“dinos”文档...这一次将不会有答案 :)