从向量数据库自动检索¶
本指南展示了如何在LlamaIndex中执行自动检索。
许多流行的向量数据库除了语义搜索的查询字符串外,还支持一组元数据过滤器。给定自然语言查询,我们首先使用LLM推断一组元数据过滤器以及传递给向量数据库的正确查询字符串(也可以为空)。然后对向量数据库执行整个查询包。
这允许进行比top-k语义搜索更动态、更具表现力的检索形式。对于给定查询的相关上下文,可能只需要在元数据标签上进行过滤,或者需要在过滤集合内进行过滤+语义搜索的联合组合,或者只需要原始语义搜索。
我们以Chroma为例进行演示,但自动检索也已在许多其他向量数据库中实现(例如Pinecone、Weaviate等)。
设置¶
首先我们定义导入项并创建一个空的Chroma集合。
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
In [ ]:
Copied!
%pip install llama-index-vector-stores-chroma
%pip install llama-index-vector-stores-chroma
In [ ]:
Copied!
!pip install llama-index
!pip install llama-index
In [ ]:
Copied!
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
In [ ]:
Copied!
# 设置OpenAIimport osimport getpassos.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API密钥:")import openaiopenai.api_key = os.environ["OPENAI_API_KEY"]
# 设置OpenAIimport osimport getpassos.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API密钥:")import openaiopenai.api_key = os.environ["OPENAI_API_KEY"]
In [ ]:
Copied!
import chromadb
import chromadb
In [ ]:
Copied!
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("quickstart")
INFO:chromadb.telemetry.posthog:Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information. Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
定义一些样本数据¶
我们向向量数据库中插入一些包含文本块的样本节点。请注意,每个 TextNode
不仅包含文本,还包含元数据,例如 category
和 country
。这些元数据字段将被转换/存储在底层的向量数据库中。
In [ ]:
Copied!
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
In [ ]:
Copied!
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
from llama_index.core.schema import TextNode
nodes = [
TextNode(
text=(
"Michael Jordan is a retired professional basketball player,"
" widely regarded as one of the greatest basketball players of all"
" time."
),
metadata={
"category": "Sports",
"country": "United States",
},
),
TextNode(
text=(
"Angelina Jolie is an American actress, filmmaker, and"
" humanitarian. She has received numerous awards for her acting"
" and is known for her philanthropic work."
),
metadata={
"category": "Entertainment",
"country": "United States",
},
),
TextNode(
text=(
"Elon Musk is a business magnate, industrial designer, and"
" engineer. He is the founder, CEO, and lead designer of SpaceX,"
" Tesla, Inc., Neuralink, and The Boring Company."
),
metadata={
"category": "Business",
"country": "United States",
},
),
TextNode(
text=(
"Rihanna is a Barbadian singer, actress, and businesswoman. She"
" has achieved significant success in the music industry and is"
" known for her versatile musical style."
),
metadata={
"category": "Music",
"country": "Barbados",
},
),
TextNode(
text=(
"Cristiano Ronaldo is a Portuguese professional footballer who is"
" considered one of the greatest football players of all time. He"
" has won numerous awards and set multiple records during his"
" career."
),
metadata={
"category": "Sports",
"country": "Portugal",
},
),
]
使用Chroma向量存储构建向量索引¶
在这里,我们将数据加载到向量存储中。如上所述,每个节点的文本和元数据都将转换为Chroma中的相应表示。现在我们可以从Chroma对这些数据运行语义查询,还可以进行元数据过滤。
In [ ]:
Copied!
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
In [ ]:
Copied!
index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)
定义 VectorIndexAutoRetriever
¶
我们定义了核心的 VectorIndexAutoRetriever
模块。该模块接收 VectorStoreInfo
,其中包含向量存储集合的结构化描述以及其支持的元数据过滤器。然后这些信息将被用于自动检索提示,LLM 将推断元数据过滤器。
In [ ]:
Copied!
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="brief biography of celebrities",
metadata_info=[
MetadataInfo(
name="category",
type="str",
description=(
"Category of the celebrity, one of [Sports, Entertainment,"
" Business, Music]"
),
),
MetadataInfo(
name="country",
type="str",
description=(
"Country of the celebrity, one of [United States, Barbados,"
" Portugal]"
),
),
],
)
retriever = VectorIndexAutoRetriever(
index, vector_store_info=vector_store_info
)
from llama_index.core.retrievers import VectorIndexAutoRetriever
from llama_index.core.vector_stores.types import MetadataInfo, VectorStoreInfo
vector_store_info = VectorStoreInfo(
content_info="brief biography of celebrities",
metadata_info=[
MetadataInfo(
name="category",
type="str",
description=(
"Category of the celebrity, one of [Sports, Entertainment,"
" Business, Music]"
),
),
MetadataInfo(
name="country",
type="str",
description=(
"Country of the celebrity, one of [United States, Barbados,"
" Portugal]"
),
),
],
)
retriever = VectorIndexAutoRetriever(
index, vector_store_info=vector_store_info
)
运行一些示例数据¶
我们尝试运行一些示例数据。请注意元数据过滤器是如何被推断出来的 - 这有助于更精确地检索数据!
In [ ]:
Copied!
retriever.retrieve("Tell me about two celebrities from United States")
retriever.retrieve("Tell me about two celebrities from United States")
INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using query str: celebrities Using query str: celebrities INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using filters: {'country': 'United States'} Using filters: {'country': 'United States'} INFO:llama_index.indices.vector_store.retrievers.auto_retriever.auto_retriever:Using top_k: 2 Using top_k: 2
Out[ ]:
[NodeWithScore(node=TextNode(id_='b2ab3b1a-5731-41ec-b884-405016de5a34', embedding=None, metadata={'category': 'Entertainment', 'country': 'United States'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='28e1d0d600908a5e9f0c388f0d49b0cd58920dc13e4f2743becd135ac0f18799', text='Angelina Jolie is an American actress, filmmaker, and humanitarian. She has received numerous awards for her acting and is known for her philanthropic work.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.32621567877748514), NodeWithScore(node=TextNode(id_='e0104b6a-676a-4c83-95b7-b018cb8b39b2', embedding=None, metadata={'category': 'Sports', 'country': 'United States'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='7456e8d70b089c3830424e49b2a03c8d6d3f5cd0de42b0669a8ee518eca01012', text='Michael Jordan is a retired professional basketball player, widely regarded as one of the greatest basketball players of all time.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.3734030955060519)]
In [ ]:
Copied!
retriever.retrieve("Tell me about Sports celebrities from United States")
retriever.retrieve("Tell me about Sports celebrities from United States")