TiDB 向量存储¶
TiDB Cloud 是一款全面的数据库即服务(DBaaS)解决方案,提供了专用和无服务器选项。TiDB 无服务器现在正在将内置的向量搜索集成到 MySQL 环境中。通过这一增强功能,您可以在不需要新数据库或额外技术堆栈的情况下,无缝地使用 TiDB 无服务器开发人工智能应用程序。成为首批体验者,加入私人测试版的等待列表,网址为 https://tidb.cloud/ai。
本手册详细介绍了如何在 LlamaIndex 中利用 TiDB 向量搜索。
设置环境¶
In [ ]:
Copied!
import textwrap
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore
import textwrap
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore
In [ ]:
Copied!
# 在这里我们使用import getpass
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
tidb_connection_url = getpass.getpass(
"TiDB连接URL(格式 - mysql+pymysql://root@127.0.0.1:4000/test):"
)
# 在这里我们使用import getpass
import getpass
import os
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
tidb_connection_url = getpass.getpass(
"TiDB连接URL(格式 - mysql+pymysql://root@127.0.0.1:4000/test):"
)
准备用于展示的数据
In [ ]:
Copied!
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
In [ ]:
Copied!
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
document.metadata = {"book": "paul_graham"}
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
document.metadata = {"book": "paul_graham"}
Document ID: d970e919-4469-414b-967e-24dd9b2eb014
创建 TiDB 向量存储¶
下面的代码片段在 TiDB 中创建了一个名为 VECTOR_TABLE_NAME
的表,该表经过优化,适用于向量搜索。成功执行此代码后,您将能够在 TiDB 数据库环境中直接查看和访问 VECTOR_TABLE_NAME
表。
In [ ]:
Copied!
VECTOR_TABLE_NAME = "paul_graham_test"
tidbvec = TiDBVectorStore(
connection_string=tidb_connection_url,
table_name=VECTOR_TABLE_NAME,
distance_strategy="cosine",
vector_dimension=1536,
drop_existing_table=False,
)
VECTOR_TABLE_NAME = "paul_graham_test"
tidbvec = TiDBVectorStore(
connection_string=tidb_connection_url,
table_name=VECTOR_TABLE_NAME,
distance_strategy="cosine",
vector_dimension=1536,
drop_existing_table=False,
)
创建一个基于 TiDB Vector Store 的查询引擎
In [ ]:
Copied!
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
/Users/ianz/Work/miniconda3/envs/llama_index/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 8.76it/s] Generating embeddings: 100%|██████████| 21/21 [00:02<00:00, 8.22it/s]
语义相似性搜索¶
本节重点介绍向量搜索的基础知识,以及如何使用元数据过滤器来优化搜索结果。请注意,TiDB 向量仅支持默认的 VectorStoreQueryMode。
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
The author worked on writing, programming, building microcomputers, giving talks at conferences, publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing a building for office use.
使用元数据进行筛选¶
使用元数据筛选器进行搜索,以检索与应用筛选器相符的特定数量的最近邻结果。
In [ ]:
Copied!
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="!="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="!="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
Empty Response
查询再次进行。
In [ ]:
Copied!
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="=="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="=="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI. Later on, the author attended art school in both the US and Italy, where they observed a lack of substantial teaching in the painting department.
删除文档¶
In [ ]:
Copied!
tidbvec.delete(documents[0].doc_id)
tidbvec.delete(documents[0].doc_id)
检查文档是否已被删除。
In [ ]:
Copied!
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
Empty Response