阿里云OpenSearch向量存储¶
Alibaba Cloud OpenSearch Vector Search Edition is a large-scale distributed search engine that is developed by Alibaba Group. Alibaba Cloud OpenSearch Vector Search Edition provides search services for the entire Alibaba Group, including Taobao, Tmall, Cainiao, Youku, and other e-commerce platforms that are provided for customers in regions outside the Chinese mainland. Alibaba Cloud OpenSearch Vector Search Edition is also a base engine of Alibaba Cloud OpenSearch. After years of development, Alibaba Cloud OpenSearch Vector Search Edition has met the business requirements for high availability, high timeliness, and cost-effectiveness. Alibaba Cloud OpenSearch Vector Search Edition also provides an automated O&M system on which you can build a custom search service based on your business features.
To run, you should have a instance.
Setup¶
如果您在colab上打开这个笔记本,您可能需要安装LlamaIndex 🦙。
%pip install llama-index-vector-stores-alibabacloud-opensearch
%pip install llama-index
import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
请提供OpenAI访问密钥¶
为了使用OpenAI的嵌入功能,您需要提供一个OpenAI API密钥:
import openai
OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY
下载数据¶
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
加载文档¶
from llama_index.core import SimpleDirectoryReader
from IPython.display import Markdown, display
# 加载文档
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print(f"总文档数:{len(documents)}")
Total documents: 1
创建阿里云OpenSearch Vector Store对象:¶
要运行下一步,您应该拥有一个阿里巴巴云OpenSearch Vector Service实例,并配置一个表。
# 如果运行以下单元格时出现async io异常,请运行此代码
import nest_asyncio
nest_asyncio.apply()
# 初始化,不使用元数据过滤器
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="*****",
instance_id="*****",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
# 查询索引
这个示例演示了如何查询索引。
# 将日志级别设置为DEBUG,以获得更详细的输出
query_engine = index.as_query_engine()
response = query_engine.query("作者在成长过程中做了什么?")
display(Markdown(f"<b>{response}</b>"))
Before college, the author worked on writing and programming. They wrote short stories and tried writing programs on the IBM 1401 in 9th grade using an early version of Fortran.
连接到现有的存储库¶
由于这个存储库是由阿里巴巴云OpenSearch支持的,根据定义它是持久的。因此,如果您想连接到之前创建并填充的存储库,可以按照以下步骤操作:
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="***",
instance_id="***",
username="your_username",
password="your_password",
table_name="llama",
)
vector_store = AlibabaCloudOpenSearchStore(config)
# 从现有存储的向量创建索引
index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
response = query_engine.query(
"作者在从事人工智能之前学习了什么?"
)
display(Markdown(f"<b>{response}</b>"))
元数据过滤¶
阿里云OpenSearch向量存储支持在查询时进行元数据过滤。下面的单元格在一个全新的表上演示了这个功能。
在这个演示中,为了简洁起见,加载了一个单个源文档(../data/paul_graham/paul_graham_essay.txt
文本文件)。尽管如此,您将为文档附加一些自定义元数据,以说明如何通过对文档附加的元数据设置条件来限制查询。
from llama_index.core import StorageContext, VectorStoreIndex
from llama_index.vector_stores.alibabacloud_opensearch import (
AlibabaCloudOpenSearchStore,
AlibabaCloudOpenSearchConfig,
)
config = AlibabaCloudOpenSearchConfig(
endpoint="****",
instance_id="****",
username="your_username",
password="your_password",
table_name="llama",
)
md_storage_context = StorageContext.from_defaults(
vector_store=AlibabaCloudOpenSearchStore(config)
)
def my_file_metadata(file_name: str):
"""根据输入的文件名,关联不同的元数据。"""
if "essay" in file_name:
source_type = "essay"
elif "dinosaur" in file_name:
# 在这个演示中(不幸地)不会发生
source_type = "dinos"
else:
source_type = "other"
return {"source_type": source_type}
# 加载文档并构建索引
md_documents = SimpleDirectoryReader(
"../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
md_documents, storage_context=md_storage_context
)
在查询引擎中添加过滤器:
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters
md_query_engine = md_index.as_query_engine(
filters=MetadataFilters(
filters=[MetadataFilter(key="source_type", value="essay")]
)
)
md_response = md_query_engine.query(
"How long it took the author to write his thesis?"
)
display(Markdown(f"<b>{md_response}</b>"))
为了测试过滤是否生效,尝试将其更改为仅使用“dinos”文档...这次不会有答案 :)