使用Zilliz云管道管理索引¶

Zilliz云管道是一个可扩展的用于检索的API服务。您可以将Zilliz云管道用作llama-index中的托管索引。该服务可以将文档转换为向量嵌入，并将它们存储在Zilliz云中，以实现有效的语义搜索。

设置¶

安装llama-index的依赖项

In [ ]:

Copied!

%pip install llama-index-indices-managed-zilliz
%pip install llama-index-indices-managed-zilliz

In [ ]:

Copied!

%pip install llama-index
%pip install llama-index

配置您的Zilliz Cloud账户凭据。

In [ ]:

Copied!

from getpass import getpass

ZILLIZ_PROJECT_ID = getpass("Enter your Zilliz Project ID:")
ZILLIZ_CLUSTER_ID = getpass("Enter your Zilliz Cluster ID:")
ZILLIZ_TOKEN = getpass("Enter your Zilliz API Key:")
from getpass import getpass

ZILLIZ_PROJECT_ID = getpass("Enter your Zilliz Project ID:")
ZILLIZ_CLUSTER_ID = getpass("Enter your Zilliz Cluster ID:")
ZILLIZ_TOKEN = getpass("Enter your Zilliz API Key:")

查找您的OpenAI API密钥

查找您的Zilliz Cloud凭据

索引文档¶

对于每个文档添加元数据是可选的。元数据可用于在检索过程中过滤文档数据。

从签名URL¶

Zilliz Cloud Pipelines接受来自AWS S3和Google Cloud Storage的文件。您可以从对象存储生成预签名URL，并使用from_document_url()来摄入文件。它可以自动索引文档，并将文档块存储为Zilliz Cloud上的向量。

In [ ]:

Copied!

from llama_index.indices.managed.zilliz import ZillizCloudPipelineIndex# 创建流水线：如果您已经准备好有效的流水线，则可以跳过此步骤pipeline_ids = ZillizCloudPipelineIndex.create_pipelines(    project_id=ZILLIZ_PROJECT_ID,    cluster_id=ZILLIZ_CLUSTER_ID,    api_key=ZILLIZ_TOKEN,    data_type="doc",    collection_name="zcp_llamalection_doc",  # 更改此值将自定义集合名称    metadata_schema={"user_id": "VarChar"},)print(pipeline_ids)
from llama_index.indices.managed.zilliz import ZillizCloudPipelineIndex# 创建流水线：如果您已经准备好有效的流水线，则可以跳过此步骤pipeline_ids = ZillizCloudPipelineIndex.create_pipelines(    project_id=ZILLIZ_PROJECT_ID,    cluster_id=ZILLIZ_CLUSTER_ID,    api_key=ZILLIZ_TOKEN,    data_type="doc",    collection_name="zcp_llamalection_doc",  # 更改此值将自定义集合名称    metadata_schema={"user_id": "VarChar"},)print(pipeline_ids)

{'INGESTION': 'pipe-d639f220f27320e2e381de', 'SEARCH': 'pipe-47bd43fe8fd54502874a08', 'DELETION': 'pipe-bd434c99e064282f1a28e8'}

In [ ]:

Copied!

zcp_doc_index = ZillizCloudPipelineIndex.from_document_url(    # 存储在AWS S3或Google Cloud Storage上的文件的公共或预签名URL    url="https://publicdataset.zillizcloud.com/milvus_doc.md",    pipeline_ids=pipeline_ids,    api_key=ZILLIZ_TOKEN,    metadata={        "user_id": "user-001"    },  # 可选，可用于过滤)# # 按文档名称删除文档# zcp_doc_index.delete_by_expression(expression="doc_name == 'milvus_doc_22.md'")
zcp_doc_index = ZillizCloudPipelineIndex.from_document_url(    # 存储在AWS S3或Google Cloud Storage上的文件的公共或预签名URL    url="https://publicdataset.zillizcloud.com/milvus_doc.md",    pipeline_ids=pipeline_ids,    api_key=ZILLIZ_TOKEN,    metadata={        "user_id": "user-001"    },  # 可选，可用于过滤)# # 按文档名称删除文档# zcp_doc_index.delete_by_expression(expression="doc_name == 'milvus_doc_22.md'")

从文档节点¶

Zilliz云管道还支持文本作为数据输入。以下示例准备了一个带有示例文档节点的数据。

In [ ]:

Copied!

from llama_index.core import Documentfrom llama_index.indices.managed.zilliz import ZillizCloudPipelineIndex# 准备文档documents = [Document(text="被搜索的数字是十。")]# 创建流水线：如果您已经准备好有效的流水线，请跳过此步骤pipeline_ids = ZillizCloudPipelineIndex.create_pipelines(    project_id=ZILLIZ_PROJECT_ID,    cluster_id=ZILLIZ_CLUSTER_ID,    api_key=ZILLIZ_TOKEN,    data_type="text",    collection_name="zcp_llamalection_text",  # 更改此值将自定义集合名称)print(pipeline_ids)
from llama_index.core import Documentfrom llama_index.indices.managed.zilliz import ZillizCloudPipelineIndex# 准备文档documents = [Document(text="被搜索的数字是十。")]# 创建流水线：如果您已经准备好有效的流水线，请跳过此步骤pipeline_ids = ZillizCloudPipelineIndex.create_pipelines(    project_id=ZILLIZ_PROJECT_ID,    cluster_id=ZILLIZ_CLUSTER_ID,    api_key=ZILLIZ_TOKEN,    data_type="text",    collection_name="zcp_llamalection_text",  # 更改此值将自定义集合名称)print(pipeline_ids)

{'INGESTION': 'pipe-2bbab10f273a57eb987024', 'SEARCH': 'pipe-e1914a072ec5e6f83e446a', 'DELETION': 'pipe-72bbabf273a51af0b0c447'}

In [ ]:

Copied!

zcp_text_index = ZillizCloudPipelineIndex.from_documents(    # 存储在AWS S3或Google Cloud Storage上的文件的公共或预签名URL    documents=documents,    pipeline_ids=pipeline_ids,    api_key=ZILLIZ_TOKEN,)
zcp_text_index = ZillizCloudPipelineIndex.from_documents(    # 存储在AWS S3或Google Cloud Storage上的文件的公共或预签名URL    documents=documents,    pipeline_ids=pipeline_ids,    api_key=ZILLIZ_TOKEN,)

作为查询引擎工作¶

要使用 ZillizCloudPipelineIndex 进行语义搜索，您可以通过指定一些参数来使用 as_query_engine()：

search_top_k：要检索多少个文本节点/块。可选，默认为 DEFAULT_SIMILARITY_TOP_K（2）。
filters：元数据过滤器。可选，默认为 None。
output_metadata：要与检索到的文本节点一起返回的元数据字段。可选，默认为 []。

In [ ]:

Copied!

import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API Key:")

In [ ]:

Copied!

query_engine = zcp_doc_index.as_query_engine(search_top_k=3)
query_engine = zcp_doc_index.as_query_engine(search_top_k=3)

然后查询引擎已经准备好与Milvus 2.3文档一起进行语义搜索或检索增强生成：

检索（由Zilliz Cloud Pipelines提供支持的语义搜索）：

In [ ]:

Copied!

question = "Can users delete entities by filtering non-primary fields?"
retrieved_nodes = query_engine.retrieve(question)
print(retrieved_nodes)
question = "Can users delete entities by filtering non-primary fields?"
retrieved_nodes = query_engine.retrieve(question)
print(retrieved_nodes)

[NodeWithScore(node=TextNode(id_='449755997496672548', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\nThis topic describes how to delete entities in Milvus.  \nMilvus supports deleting entities by primary key or complex boolean expressions. Deleting entities by primary key is much faster and lighter than deleting them by complex boolean expressions. This is because Milvus executes queries first when deleting data by complex boolean expressions.  \nDeleted entities can still be retrieved immediately after the deletion if the consistency level is set lower than Strong.\nEntities deleted beyond the pre-specified span of time for Time Travel cannot be retrieved again.\nFrequent deletion operations will impact the system performance.  \nBefore deleting entities by comlpex boolean expressions, make sure the collection has been loaded.\nDeleting entities by complex boolean expressions is not an atomic operation. Therefore, if it fails halfway through, some data may still be deleted.\nDeleting entities by complex boolean expressions is supported only when the consistency is set to Bounded. For details, see Consistency.\\\n\\\n# Delete Entities\n## Prepare boolean expression\nPrepare the boolean expression that filters the entities to delete.  \nMilvus supports deleting entities by primary key or complex boolean expressions. For more information on expression rules and supported operators, see Boolean Expression Rules.\\\n\\\n# Delete Entities\n## Prepare boolean expression\n### Simple boolean expression\nUse a simple expression to filter data with primary key values of 0 and 1:  \n```python\nexpr = "book_id in [0,1]"\n```\\\n\\\n# Delete Entities\n## Prepare boolean expression\n### Complex boolean expression\nTo filter entities that meet specific conditions, define complex boolean expressions.  \nFilter entities whose word_count is greater than or equal to 11000:  \n```python\nexpr = "word_count >= 11000"\n```  \nFilter entities whose book_name is not Unknown:  \n```python\nexpr = "book_name != Unknown"\n```  \nFilter entities whose primary key values are greater than 5 and word_count is smaller than or equal to 9999:  \n```python\nexpr = "book_id > 5 && word_count <= 9999"\n```', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.742070198059082), NodeWithScore(node=TextNode(id_='449755997496672549', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\n## Delete entities\nDelete the entities with the boolean expression you created. Milvus returns the ID list of the deleted entities.\n```python\nfrom pymilvus import Collection\ncollection = Collection("book")      # Get an existing collection.\ncollection.delete(expr)\n```  \nParameter\tDescription\nexpr\tBoolean expression that specifies the entities to delete.\npartition_name (optional)\tName of the partition to delete entities from.\\\n\\\n# Upsert Entities\nThis topic describes how to upsert entities in Milvus.  \nUpserting is a combination of insert and delete operations. In the context of a Milvus vector database, an upsert is a data-level operation that will overwrite an existing entity if a specified field already exists in a collection, and insert a new entity if the specified value doesn’t already exist.  \nThe following example upserts 3,000 rows of randomly generated data as the example data. When performing upsert operations, it\'s important to note that the operation may compromise performance. This is because the operation involves deleting data during execution.\\\n\\\n# Upsert Entities\n## Prepare data\nFirst, prepare the data to upsert. The type of data to upsert must match the schema of the collection, otherwise Milvus will raise an exception.  \nMilvus supports default values for scalar fields, excluding a primary key field. This indicates that some fields can be left empty during data inserts or upserts. For more information, refer to Create a Collection.  \n```python\n# Generate data to upsert\n\nimport random\nnb = 3000\ndim = 8\nvectors = [[random.random() for _ in range(dim)] for _ in range(nb)]\ndata = [\n[i for i in range(nb)],\n[str(i) for i in range(nb)],\n[i for i in range(10000, 10000+nb)],\nvectors,\n[str("dy"*i) for i in range(nb)]\n]\n```', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.6409814953804016), NodeWithScore(node=TextNode(id_='449755997496672550', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Upsert Entities\n## Upsert data\nUpsert the data to the collection.  \n```python\nfrom pymilvus import Collection\ncollection = Collection("book") # Get an existing collection.\nmr = collection.upsert(data)\n```  \nParameter\tDescription\ndata\tData to upsert into Milvus.\npartition_name (optional)\tName of the partition to upsert data into.\ntimeout (optional)\tAn optional duration of time in seconds to allow for the RPC. If it is set to None, the client keeps waiting until the server responds or error occurs.\nAfter upserting entities into a collection that has previously been indexed, you do not need to re-index the collection, as Milvus will automatically create an index for the newly upserted data. For more information, refer to Can indexes be created after inserting vectors?\\\n\\\n# Upsert Entities\n## Flush data\nWhen data is upserted into Milvus it is updated and inserted into segments. Segments have to reach a certain size to be sealed and indexed. Unsealed segments will be searched brute force. In order to avoid this with any remainder data, it is best to call flush(). The flush() call will seal any remaining segments and send them for indexing. It is important to only call this method at the end of an upsert session. Calling it too often will cause fragmented data that will need to be cleaned later on.\\\n\\\n# Upsert Entities\n## Limits\nUpdating primary key fields is not supported by upsert().\nupsert() is not applicable and an error can occur if autoID is set to True for primary key fields.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.5456743240356445)]

查询（由Zilliz云管道提供支持的RAG作为检索器和OpenAI的LLM）：

In [ ]:

Copied!

response = query_engine.query(question)
print(response.response)
response = query_engine.query(question)
print(response.response)

Users can delete entities by filtering non-primary fields using complex boolean expressions in Milvus.

多租户¶

通过将租户特定的值（例如用户ID）作为元数据，托管索引能够通过应用元数据过滤器实现多租户。

通过指定元数据值，每个文档在摄入时都会被标记上租户特定的字段。

In [ ]:

Copied!





zcp_doc_index._insert_doc_url(
    url="https://publicdataset.zillizcloud.com/milvus_doc_22.md",
    metadata={"user_id": "user_002"},
)
zcp_doc_index._insert_doc_url(
    url="https://publicdataset.zillizcloud.com/milvus_doc_22.md",
    metadata={"user_id": "user_002"},
)

Out[ ]:

{'token_usage': 984, 'doc_name': 'milvus_doc_22.md', 'num_chunks': 3}

然后，托管索引能够通过过滤特定于租户的字段来为每个租户构建查询引擎。

In [ ]:

Copied!

from llama_index.core.vector_stores import ExactMatchFilter，MetadataFiltersquery_engine_for_user_002 = zcp_doc_index.as_query_engine(    search_top_k=3,    filters=MetadataFilters(        filters=[ExactMatchFilter(key="user_id", value="user_002")]    ),    output_metadata=["user_id"],  # 可选，显示输出中的user_id)
from llama_index.core.vector_stores import ExactMatchFilter，MetadataFiltersquery_engine_for_user_002 = zcp_doc_index.as_query_engine(    search_top_k=3,    filters=MetadataFilters(        filters=[ExactMatchFilter(key="user_id", value="user_002")]    ),    output_metadata=["user_id"],  # 可选，显示输出中的user_id)

更改filters以构建具有不同条件的查询引擎。

In [ ]:

Copied!

问题 = "我能通过过滤非主键字段来删除实体吗？"# search_results = query_engine_for_user_002.retrieve(question)response = query_engine_for_user_002.query(question)print(response.response)
问题 = "我能通过过滤非主键字段来删除实体吗？"# search_results = query_engine_for_user_002.retrieve(question)response = query_engine_for_user_002.query(question)print(response.response)

Milvus only supports deleting entities by primary key filtered with boolean expressions. Other operators can be used only in query or scalar filtering in vector search.