使用Pinecone + Arize Phoenix进行自动检索的简单到高级指南¶

在这个笔记本中，我们展示了如何针对Pinecone执行自动检索，这使您能够执行广泛的半结构化查询，超出了您可以通过标准的top-k语义搜索所能做的范围。

我们展示了如何设置基本的自动检索，以及如何通过自定义提示和动态元数据检索来扩展它。

如果您在Colab上打开这个笔记本，您可能需要安装LlamaIndex 🦙。

In [ ]:

Copied!

%pip install llama-index-vector-stores-pinecone
%pip install llama-index-vector-stores-pinecone

In [ ]:

Copied!

# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0
# !pip install llama-index>=0.9.31 scikit-learn==1.2.2 arize-phoenix==2.4.1 pinecone-client>=3.0.0

第一部分：设置自动检索¶

要设置自动检索，请执行以下操作：

我们将进行一些设置，加载数据，构建一个Pinecone向量索引。
我们将定义我们的自动检索器并运行一些示例查询。
我们将使用Phoenix观察每个跟踪并可视化提示的输入/输出。
我们将向您展示如何自定义自动检索提示。

1.a 设置Pinecone/Phoenix，加载数据并构建向量索引¶

在本节中，我们将设置Pinecone并导入一些关于书籍/电影的玩具数据（包括文本数据和元数据）。

我们还将设置Phoenix，以便它捕获下游的跟踪信息。

In [ ]:

Copied!

# 设置Phoeniximport phoenix as pximport llama_index.corepx.launch_app()llama_index.core.set_global_handler("arize_phoenix")
# 设置Phoeniximport phoenix as pximport llama_index.corepx.launch_app()llama_index.core.set_global_handler("arize_phoenix")

🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix

In [ ]:

Copied!

import osos.environ[    "PINECONE_API_KEY"] = "<您的Pinecone API密钥，来自app.pinecone.io>"# os.environ["OPENAI_API_KEY"] = "sk-..."
import osos.environ[    "PINECONE_API_KEY"] = "<您的Pinecone API密钥，来自app.pinecone.io>"# os.environ["OPENAI_API_KEY"] = "sk-..."

In [ ]:

Copied!

from pinecone import Pinecone
from pinecone import ServerlessSpec

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)
from pinecone import Pinecone
from pinecone import ServerlessSpec

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)

In [ ]:

Copied!

# 如果需要的话删除# pc.delete_index("quickstart-index")
# 如果需要的话删除# pc.delete_index("quickstart-index")

In [ ]:

Copied!

# 文本嵌入维度为1536的注释try:    pc.create_index(        "quickstart-index",        dimension=1536,        metric="euclidean",        spec=ServerlessSpec(cloud="aws", region="us-west-2"),    )except Exception as e:    # 最有可能是索引已经存在    print(e)    pass
# 文本嵌入维度为1536的注释try:    pc.create_index(        "quickstart-index",        dimension=1536,        metric="euclidean",        spec=ServerlessSpec(cloud="aws", region="us-west-2"),    )except Exception as e:    # 最有可能是索引已经存在    print(e)    pass

In [ ]:

Copied!

pinecone_index = pc.Index("quickstart-index")
pinecone_index = pc.Index("quickstart-index")

加载文档，构建PineconeVectorStore和VectorStoreIndex¶

In [ ]:

Copied!

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore

In [ ]:

Copied!





from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Fiction",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American Dream",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        text="The Shawshank Redemption",
        metadata={
            "author": "Stephen King",
            "theme": "Friendship",
            "year": 1994,
        },
    ),
    TextNode(
        text="The Godfather",
        metadata={
            "director": "Francis Ford Coppola",
            "theme": "Mafia",
            "year": 1972,
        },
    ),
    TextNode(
        text="Inception",
        metadata={
            "director": "Christopher Nolan",
            "theme": "Fiction",
            "year": 2010,
        },
    ),
    TextNode(
        text="To Kill a Mockingbird",
        metadata={
            "author": "Harper Lee",
            "theme": "Fiction",
            "year": 1960,
        },
    ),
    TextNode(
        text="1984",
        metadata={
            "author": "George Orwell",
            "theme": "Totalitarianism",
            "year": 1949,
        },
    ),
    TextNode(
        text="The Great Gatsby",
        metadata={
            "author": "F. Scott Fitzgerald",
            "theme": "The American Dream",
            "year": 1925,
        },
    ),
    TextNode(
        text="Harry Potter and the Sorcerer's Stone",
        metadata={
            "author": "J.K. Rowling",
            "theme": "Fiction",
            "year": 1997,
        },
    ),
]

In [ ]:

Copied!





vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
vector_store = PineconeVectorStore(
    pinecone_index=pinecone_index,
    namespace="test",
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

In [ ]:

Copied!

index = VectorStoreIndex(nodes, storage_context=storage_context)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Upserted vectors:   0%|          | 0/7 [00:00<?, ?it/s]

1.b 定义自动检索器，运行一些示例查询¶

设置`VectorIndexAutoRetriever`¶

其中一个输入是描述向量存储集合包含的内容的schema。这类似于SQL数据库中描述表的表模式。然后将这个模式信息注入到提示中，传递给LLM来推断完整的查询应该是什么（包括元数据过滤器）。

In [ ]:

Copied!

from llama_index.core.retrievers import VectorIndexAutoRetrieverfrom llama_index.core.vector_stores import MetadataInfo, VectorStoreInfovector_store_info = VectorStoreInfo(    content_info="著名书籍和电影",    metadata_info=[        MetadataInfo(            name="导演",            type="str",            description=("导演的姓名"),        ),        MetadataInfo(            name="主题",            type="str",            description=("书籍/电影的主题"),        ),        MetadataInfo(            name="年份",            type="int",            description=("书籍/电影的年份"),        ),    ],)retriever = VectorIndexAutoRetriever(    index,    vector_store_info=vector_store_info,    empty_query_top_k=10,    # 这是一个hack，允许在pinecone中进行空查询    default_empty_query_vector=[0] * 1536,    verbose=True,)
from llama_index.core.retrievers import VectorIndexAutoRetrieverfrom llama_index.core.vector_stores import MetadataInfo, VectorStoreInfovector_store_info = VectorStoreInfo(    content_info="著名书籍和电影",    metadata_info=[        MetadataInfo(            name="导演",            type="str",            description=("导演的姓名"),        ),        MetadataInfo(            name="主题",            type="str",            description=("书籍/电影的主题"),        ),        MetadataInfo(            name="年份",            type="int",            description=("书籍/电影的年份"),        ),    ],)retriever = VectorIndexAutoRetriever(    index,    vector_store_info=vector_store_info,    empty_query_top_k=10,    # 这是一个hack，允许在pinecone中进行空查询    default_empty_query_vector=[0] * 1536,    verbose=True,)

让我们运行一些查询¶

让我们运行一些使用结构化信息的示例查询。

In [ ]:

Copied!

nodes = retriever.retrieve(
    "Tell me about some books/movies after the year 2000"
)
nodes = retriever.retrieve(
    "Tell me about some books/movies after the year 2000"
)

Using query str: 
Using filters: [('year', '>', 2000)]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}

In [ ]:

Copied!

nodes = retriever.retrieve("Tell me about some books that are Fiction")
nodes = retriever.retrieve("Tell me about some books that are Fiction")

Using query str: Fiction
Using filters: [('theme', '==', 'Fiction')]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

Inception
{'director': 'Christopher Nolan', 'theme': 'Fiction', 'year': 2010}
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

传递额外的元数据过滤器¶

如果您有额外的元数据过滤器需要传递进来，而不是自动推断的话，可以按照以下步骤操作。

In [ ]:

Copied!

from llama_index.core.vector_stores import MetadataFiltersfilter_dicts = [{"key": "year", "operator": "==", "value": 1997}]filters = MetadataFilters.from_dicts(filter_dicts)retriever2 = VectorIndexAutoRetriever(    index,    vector_store_info=vector_store_info,    empty_query_top_k=10,    # 这是一个hack，允许在pinecone中进行空查询    default_empty_query_vector=[0] * 1536,    extra_filters=filters,)
from llama_index.core.vector_stores import MetadataFiltersfilter_dicts = [{"key": "year", "operator": "==", "value": 1997}]filters = MetadataFilters.from_dicts(filter_dicts)retriever2 = VectorIndexAutoRetriever(    index,    vector_store_info=vector_store_info,    empty_query_top_k=10,    # 这是一个hack，允许在pinecone中进行空查询    default_empty_query_vector=[0] * 1536,    extra_filters=filters,)

In [ ]:

Copied!





nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever2.retrieve("Tell me about some books that are Fiction")
for node in nodes:
    print(node.text)
    print(node.metadata)

Harry Potter and the Sorcerer's Stone
{'author': 'J.K. Rowling', 'theme': 'Fiction', 'year': 1997}

查询失败的示例¶

请注意，未检索到任何结果！我们稍后会修复这个问题。

In [ ]:

Copied!

nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")

Using query str: books
Using filters: [('theme', '==', 'mafia')]

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

可视化跟踪¶

让我们打开Phoenix来查看这些跟踪！

No description has been provided for this image

让我们来看看自动检索提示。我们可以看到自动检索提示使用了两个少样本示例。

第二部分：扩展自动检索（使用动态元数据检索）¶

现在我们通过定制提示来扩展自动检索。在第一部分中，我们明确添加了一些规则。

在第二部分中，我们实现了动态元数据检索，它将对向量数据库进行第一阶段检索，从中提取相关的元数据，并将其插入自动检索提示中作为少量示例。（当然，第二阶段检索会从向量数据库中检索实际的项目）。

2.a 改进自动检索提示¶

我们的自动检索提示功能虽然可以工作，但还有许多方面可以改进。例如，它包含了2个硬编码的few-shot示例（如何包含自己的示例？），而且自动检索并不总是能够推断出正确的元数据过滤器。

例如，所有的“theme”字段都是大写的。我们如何告诉LLM这一点，以便它不会错误地推断出一个小写的“theme”？

让我们试着修改一下提示！

In [ ]:

Copied!

from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate
from llama_index.core.prompts import display_prompt_dict
from llama_index.core import PromptTemplate

In [ ]:

Copied!

prompts_dict = retriever.get_prompts()
prompts_dict = retriever.get_prompts()

In [ ]:

Copied!

display_prompt_dict(prompts_dict)
display_prompt_dict(prompts_dict)

In [ ]:

Copied!

＃查看所需的模板变量。prompts_dict ["prompt"] .template_vars
＃查看所需的模板变量。prompts_dict ["prompt"] .template_vars

Out[ ]:

['schema_str', 'info_str', 'query_str']

自定义提示¶

让我们稍微定制一下提示。我们要做以下操作：

去掉前面的几个示例以节省标记
添加一条消息，始终将一个字母大写，如果推断出是"主题"。

请注意，提示模板期望schema_str、info_str和query_str被定义。

In [ ]:

Copied!

# 写入提示模板，并对其进行修改。prompt_tmpl_str = """\您的目标是构造用户的查询，以匹配下面提供的请求模式。<< 结构化请求模式 >>在回复时，使用一个Markdown代码片段，其中包含一个JSON对象，格式如下所示：{schema_str}查询字符串应仅包含预期与文档内容匹配的文本。在查询中不应提及筛选条件中的任何条件。确保筛选器仅涉及数据源中存在的属性。确保筛选器考虑属性的描述。确保仅在需要时使用筛选器。如果没有应用筛选器，为筛选值返回[]。如果用户的查询明确提到要检索的文档数量，请将top_k设置为该数字，否则不设置top_k。绝对不要推断筛选器的空值。这将破坏下游程序。相反，不要包括筛选器。<< 示例1。 >>数据源：{{    "metadata_info": [        {{            "name": "author",            "type": "str",            "description": "作者姓名"        }},        {{            "name": "book_title",            "type": "str",            "description": "书名"        }},        {{            "name": "year",            "type": "int",            "description": "出版年份"        }},        {{            "name": "pages",            "type": "int",            "description": "页数"        }},        {{            "name": "summary",            "type": "str",            "description": "书的简介"        }}    ],    "content_info": "经典文学"}}用户查询：简要介绍一些简奥斯汀在1813年后出版的探讨社会地位的婚姻主题的书籍。附加说明：无结构化请求：{{"query": "与社会地位的婚姻主题相关的书籍", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "简奥斯汀", "operator": "=="}}], "top_k": null}}<< 示例2。 >>数据源：{info_str}用户查询：{query_str}附加说明：{additional_instructions}结构化请求："""
# 写入提示模板，并对其进行修改。prompt_tmpl_str = """\您的目标是构造用户的查询，以匹配下面提供的请求模式。<< 结构化请求模式 >>在回复时，使用一个Markdown代码片段，其中包含一个JSON对象，格式如下所示：{schema_str}查询字符串应仅包含预期与文档内容匹配的文本。在查询中不应提及筛选条件中的任何条件。确保筛选器仅涉及数据源中存在的属性。确保筛选器考虑属性的描述。确保仅在需要时使用筛选器。如果没有应用筛选器，为筛选值返回[]。如果用户的查询明确提到要检索的文档数量，请将top_k设置为该数字，否则不设置top_k。绝对不要推断筛选器的空值。这将破坏下游程序。相反，不要包括筛选器。<< 示例1。 >>数据源：{{    "metadata_info": [        {{            "name": "author",            "type": "str",            "description": "作者姓名"        }},        {{            "name": "book_title",            "type": "str",            "description": "书名"        }},        {{            "name": "year",            "type": "int",            "description": "出版年份"        }},        {{            "name": "pages",            "type": "int",            "description": "页数"        }},        {{            "name": "summary",            "type": "str",            "description": "书的简介"        }}    ],    "content_info": "经典文学"}}用户查询：简要介绍一些简奥斯汀在1813年后出版的探讨社会地位的婚姻主题的书籍。附加说明：无结构化请求：{{"query": "与社会地位的婚姻主题相关的书籍", "filters": [{{"key": "year", "value": "1813", "operator": ">"}}, {{"key": "author", "value": "简奥斯汀", "operator": "=="}}], "top_k": null}}<< 示例2。 >>数据源：{info_str}用户查询：{query_str}附加说明：{additional_instructions}结构化请求："""

In [ ]:

Copied!

prompt_tmpl = PromptTemplate(prompt_tmpl_str)
prompt_tmpl = PromptTemplate(prompt_tmpl_str)

您会注意到我们添加了一个additional_instructions模板变量。这使我们能够插入特定于向量集合的指令。

我们将使用partial_format来添加这个指令。

In [ ]:

Copied!

add_instrs = """\如果过滤器中有一个是'theme'，请确保推断值的第一个字母是大写的。只有首字母大写的单词才是"theme"的有效值。\"""prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)
add_instrs = """\如果过滤器中有一个是'theme'，请确保推断值的第一个字母是大写的。只有首字母大写的单词才是"theme"的有效值。\"""prompt_tmpl = prompt_tmpl.partial_format(additional_instructions=add_instrs)

In [ ]:

Copied!

retriever.update_prompts({"prompt": prompt_tmpl})
retriever.update_prompts({"prompt": prompt_tmpl})

重新运行一些查询¶

现在让我们尝试重新运行一些查询，我们会发现数值是自动推断的。

In [ ]:

Copied!

nodes = retriever.retrieve(
    "Tell me about some books that are friendship-themed"
)
nodes = retriever.retrieve(
    "Tell me about some books that are friendship-themed"
)

In [ ]:

Copied!

for node in nodes:
    print(node.text)
    print(node.metadata)
for node in nodes:
    print(node.text)
    print(node.metadata)

2.b 实现动态元数据检索¶

除了在提示中硬编码规则之外，另一个选择是获取相关的少样本元数据示例，以帮助LLM更好地推断正确的元数据过滤器。

这将更好地防止LLM在推断“where”子句时犯错，特别是在拼写/值的正确格式等方面。

我们可以通过向量检索来实现这一点。现有的向量数据库集合存储了原始文本+元数据；我们可以直接查询这个集合，或者单独地只索引元数据并从中检索。在本节中，我们选择前者，但在实际操作中，您可能希望选择后者。

In [ ]:

Copied!

# 定义检索器，用于获取前2个示例。 metadata_retriever = index.as_retriever(similarity_top_k=2)
# 定义检索器，用于获取前2个示例。 metadata_retriever = index.as_retriever(similarity_top_k=2)

我们使用前一节中定义的相同的 prompt_tmpl_str。

In [ ]:

Copied!

from typing import List, Anydef format_additional_instrs(**kwargs: Any) -> str:    """将示例格式化为字符串。"""    nodes = metadata_retriever.retrieve(kwargs["query_str"])    context_str = (        "这是来自数据库集合的相关条目的元数据。"        "这应该帮助您推断出正确的过滤器：\n"    )    for node in nodes:        context_str += str(node.node.metadata) + "\n"    return context_strext_prompt_tmpl = PromptTemplate(    prompt_tmpl_str,    function_mappings={"additional_instructions": format_additional_instrs},)
from typing import List, Anydef format_additional_instrs(**kwargs: Any) -> str:    """将示例格式化为字符串。"""    nodes = metadata_retriever.retrieve(kwargs["query_str"])    context_str = (        "这是来自数据库集合的相关条目的元数据。"        "这应该帮助您推断出正确的过滤器：\n"    )    for node in nodes:        context_str += str(node.node.metadata) + "\n"    return context_strext_prompt_tmpl = PromptTemplate(    prompt_tmpl_str,    function_mappings={"additional_instructions": format_additional_instrs},)

In [ ]:

Copied!

retriever.update_prompts({"prompt": ext_prompt_tmpl})
retriever.update_prompts({"prompt": ext_prompt_tmpl})

重新运行一些查询¶

现在让我们尝试重新运行一些查询，我们会发现数值是自动推断的。

In [ ]:

Copied!





nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me about some books that are mafia-themed")
for node in nodes:
    print(node.text)
    print(node.metadata)

Using query str: books
Using filters: [('theme', '==', 'Mafia')]
The Godfather
{'director': 'Francis Ford Coppola', 'theme': 'Mafia', 'year': 1972}

In [ ]:

Copied!





nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
    print(node.text)
    print(node.metadata)
nodes = retriever.retrieve("Tell me some books authored by HARPER LEE")
for node in nodes:
    print(node.text)
    print(node.metadata)

Using query str: Books authored by Harper Lee
Using filters: [('author', '==', 'Harper Lee')]
To Kill a Mockingbird
{'author': 'Harper Lee', 'theme': 'Fiction', 'year': 1960}

使用Pinecone + Arize Phoenix进行自动检索的简单到高级指南¶

第一部分：设置自动检索¶

1.a 设置Pinecone/Phoenix，加载数据并构建向量索引¶

加载文档，构建PineconeVectorStore和VectorStoreIndex¶

1.b 定义自动检索器，运行一些示例查询¶

设置VectorIndexAutoRetriever¶

让我们运行一些查询¶

传递额外的元数据过滤器¶

查询失败的示例¶

可视化跟踪¶

第二部分：扩展自动检索（使用动态元数据检索）¶

2.a 改进自动检索提示¶

自定义提示¶

重新运行一些查询¶

2.b 实现动态元数据检索¶

重新运行一些查询¶

设置`VectorIndexAutoRetriever`¶