MLflow LlamaIndex 风格

注意

llama_index 风格正在积极开发中，并被标记为实验性。公共API可能会发生变化，随着风格的演进，可能会添加新功能。

介绍

LlamaIndex 🦙 是一个强大的以数据为中心的框架，旨在无缝连接自定义数据源与大型语言模型（LLMs）。它提供了一套全面的数据结构和工具，简化了摄取、结构化和访问私有或领域特定数据以供LLMs使用的过程。LlamaIndex 通过提供高效的索引和检索机制，擅长于实现上下文感知的AI应用，使得构建先进的问答系统、聊天机器人和其他需要集成外部知识的AI驱动应用变得更加容易。

为什么要将 LlamaIndex 与 MLflow 一起使用？

LlamaIndex 库与 MLflow 的集成提供了管理和部署 LlamaIndex 引擎的无缝体验。以下是使用 LlamaIndex 与 MLflow 的一些主要优势：

MLflow Tracking 允许你在 MLflow 中跟踪你的索引，并管理构成你的 LlamaIndex 项目的众多移动部件，例如提示、LLMs、工作流、工具、全局配置等。
MLflow 模型将您的 LlamaIndex 索引/引擎/工作流及其所有依赖版本、输入和输出接口以及其他必要的元数据打包。这使您能够轻松部署 LlamaIndex 模型进行推理，确保在整个 ML 生命周期的不同阶段环境一致。
MLflow Evaluate 在 MLflow 中提供了评估 GenAI 应用的原生功能。此功能有助于高效评估 LlamaIndex 模型的推理结果，确保稳健的性能分析并促进快速迭代。
MLflow 追踪是一个强大的可观测性工具，用于监控和调试 LlamaIndex 模型内部发生的情况，帮助您快速识别潜在的瓶颈或问题。凭借其强大的自动日志记录功能，您可以在不添加任何代码的情况下，通过运行单个命令来检测您的 LlamaIndex 应用程序。

入门指南

在这些入门教程中，您将学习LlamaIndex最基本的组件，以及如何利用与MLflow的集成，为您的LlamaIndex应用程序带来更好的可维护性和可观察性。

Get started with MLflow and LLamaIndex by building a simple agentic Workflow. Learn how to log and load the Workflow for inference, as well as enable tracing for observability.

Get started with MLflow and LlamaIndex by exploring the simplest possible index configuration of a VectorStoreIndex.

概念

备注

工作流集成仅在 LlamaIndex >= 0.11.0 和 MLflow >= 2.17.0 中可用。

`工作流程` 🆕

Workflow 是 LlamaIndex 的事件驱动编排框架。它被设计为一个灵活且可解释的框架，用于构建任意 LLM 应用程序，如代理、RAG 流程、数据提取管道等。MLflow 支持对 Workflow 对象进行跟踪、评估和追踪，这使得它们更具可观察性和可维护性。

`索引`

Index 对象是一个为快速信息检索而索引的文档集合，为诸如检索增强生成（RAG）和代理等应用提供能力。Index 对象可以直接记录到 MLflow 运行中，并加载回来作为推理引擎使用。

`Engine`

Engine 是一个基于 Index 对象构建的通用接口，它提供了一组 API 来与索引进行交互。LlamaIndex 提供了两种类型的引擎：QueryEngine 和 ChatEngine。QueryEngine 简单地接受一个查询并根据索引返回响应。ChatEngine 是为对话代理设计的，它会跟踪对话历史。

`设置`

Settings 对象是一个全局服务上下文，它捆绑了 LlamaIndex 应用程序中常用的资源。它包括诸如 LLM 模型、嵌入模型、回调等设置。当记录 LlamaIndex 索引/引擎/工作流时，MLflow 会跟踪 Settings 对象的状态，以便在加载模型进行推理时可以轻松重现相同的结果（注意，某些对象如 API 密钥、不可序列化的对象等，不会被跟踪）。

用法

在 MLflow 实验中保存和加载索引

创建索引

index 对象是 LlamaIndex 和 MLflow 集成中的核心。通过 LlamaIndex，你可以从一组文档或外部向量存储中创建索引。以下代码从 LlamaIndex 仓库中可用的 Paul Graham 的论文数据创建了一个示例索引。

mkdir -p data
curl -L https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt -o ./data/paul_graham_essay.txt

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

将索引记录到 MLflow

你可以使用 mlflow.llama_index.log_model() 函数将 index 对象记录到 MLflow 实验中。

这里的关键步骤之一是指定 engine_type 参数。引擎类型的选择不会影响索引本身，但决定了你在加载索引进行推理时查询索引的接口。

查询引擎 (engine_type="query") 是为一个简单的查询-响应系统设计的，它接受一个单一的查询字符串并返回一个响应。
ChatEngine (engine_type="chat") 是为一个能够跟踪对话历史并响应用户消息的对话代理设计的。
检索器 (engine_type="retriever") 是一个较低级别的组件，它返回与查询匹配的前 k 个相关文档。

以下代码是使用 chat 引擎类型记录索引到 MLflow 的示例。

import mlflow

mlflow.set_experiment("llama-index-demo")

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        artifact_path="index",
        engine_type="chat",
        input_example="What did the author do growing up?",
    )

备注

上述代码片段直接将索引对象传递给 log_model 函数。此方法仅适用于默认的 SimpleVectorStore 向量存储，该存储仅将嵌入的文档保存在内存中。如果你的索引使用 外部向量存储 如 QdrantVectorStore 或 DatabricksVectorSearch，你可以使用代码日志记录方法。更多详情请参阅如何使用外部向量存储记录索引。

小技巧

在底层，MLflow 调用索引对象上的 as_query_engine() / as_chat_engine() / as_retriever() 方法，将其转换为相应的引擎实例。

加载索引以进行推理

保存的索引可以使用 mlflow.pyfunc.load_model() 函数加载回来进行推理。此函数提供了一个由 LlamaIndex 引擎支持的 MLflow Python 模型，引擎类型在记录时指定。

import mlflow

model = mlflow.pyfunc.load_model(model_info.model_uri)

response = model.predict("What was the first program the author wrote?")
print(response)
# >> The first program the author wrote was on the IBM 1401 ...

# The chat engine keeps track of the conversation history
response = model.predict("How did the author feel about it?")
print(response)
# >> The author felt puzzled by the first program ...

小技巧

要加载索引本身而不是引擎，请使用 mlflow.llama_index.load_model() 函数。

index = mlflow.llama_index.load_model("runs:/<run_id>/index")

启用跟踪

你可以通过调用 mlflow.llama_index.autolog() 函数为你的 LlamaIndex 代码启用跟踪。MLflow 会自动将 LlamaIndex 执行的输入和输出记录到活动的 MLflow 实验中，为你提供模型行为的详细视图。

import mlflow

mlflow.llama_index.autolog()

chat_engine = index.as_chat_engine()
response = chat_engine.chat("What was the first program the author wrote?")

然后你可以导航到 MLflow UI，选择实验，并打开“Traces”标签来找到引擎所做的预测的记录跟踪。看到聊天引擎如何协调和执行一系列任务来回答你的问题，真是令人印象深刻！

你可以通过运行相同的函数并将 disable 参数设置为 True 来禁用跟踪：

mlflow.llama_index.autolog(disable=True)

备注

跟踪支持异步预测和流式响应，但是，它不支持异步和流式的组合，例如 astream_chat 方法。

常见问题解答

如何使用外部向量存储记录和加载索引？

如果你的索引使用默认的 SimpleVectorStore，你可以使用 mlflow.llama_index.log_model() 函数直接将索引记录到 MLflow。MLflow 将内存中的索引数据（嵌入文档）持久化到 MLflow 工件存储中，这使得可以在不重新索引文档的情况下，使用相同的数据加载回索引。

然而，当索引使用像 DatabricksVectorSearch 和 QdrantVectorStore 这样的外部向量存储时，索引数据存储在远程位置，并且它们不支持本地序列化。因此，您不能直接使用这些存储记录索引。对于这种情况，您可以使用 Model-from-Code 记录，它提供了对索引保存过程的更多控制，并允许您使用外部向量存储。

要使用 model-from-code 日志记录，您首先需要创建一个单独的 Python 文件来定义索引。如果您使用的是 Jupyter notebook，可以使用 %%writefile 魔法命令将单元格代码保存到 Python 文件中。

%%writefile index.py

# Create Qdrant client with your own settings.
client = qdrant_client.QdrantClient(
    host="localhost",
    port=6333,
)

# Here we simply load vector store from the existing collection to avoid
# re-indexing documents, because this Python file is executed every time
# when the model is loaded. If you don't have an existing collection, create
# a new one by following the official tutorial:
# https://docs.llamaindex.ai/en/stable/examples/vector_stores/QdrantIndexDemo/
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

# IMPORTANT: call set_model() method to tell MLflow to log this index
mlflow.models.set_model(index)

然后，您可以通过将Python文件路径传递给 mlflow.llama_index.log_model() 函数来记录索引。全局 Settings 对象通常作为模型的一部分保存。

import mlflow

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        "index.py",
        artifact_path="index",
        engine_type="query",
    )

可以使用 mlflow.llama_index.load_model() 或 mlflow.pyfunc.load_model() 函数加载已记录的索引，方式与本地索引相同。

index = mlflow.llama_index.load_model(model_info.model_uri)
index.as_query_engine().query("What is MLflow?")

备注

传递给 set_model() 方法的对象必须是一个与记录期间指定的引擎类型兼容的 LlamaIndex 索引。未来版本将添加更多对象支持。

如何记录和加载 LlamaIndex 工作流？

Mlflow 支持通过 Model-from-Code 功能记录和加载 LlamaIndex 工作流。有关记录和加载 LlamaIndex 工作流的详细示例，请参阅 LlamaIndex Workflows with MLflow 笔记本。

import mlflow

with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        "/path/to/workflow.py",
        artifact_path="model",
        input_example={"input": "What is MLflow?"},
    )

可以使用 mlflow.llama_index.load_model() 或 mlflow.pyfunc.load_model() 函数加载已记录的工作流。

# Use mlflow.llama_index.load_model to load the workflow object as is
workflow = mlflow.llama_index.load_model(model_info.model_uri)
await workflow.run(input="What is MLflow?")

# Use mlflow.pyfunc.load_model to load the workflow as a MLflow Pyfunc Model
# with standard inference APIs for deployment and evaluation.
pyfunc = mlflow.pyfunc.load_model(model_info.model_uri)
pyfunc.predict({"input": "What is MLflow?"})

警告

MLflow PyFunc 模型不支持异步推理。当你使用 mlflow.pyfunc.load_model() 加载工作流时，predict 方法变为同步并会阻塞，直到工作流执行完成。当使用 MLflow Deployment 或 Databricks Model Serving 将记录的 LlamaIndex 工作流部署到生产端点时，这也适用。

我有一个使用 `query` 引擎类型记录的索引。我可以将其加载回 `chat` 引擎吗？

虽然无法就地更新已记录模型的引擎类型，但您始终可以重新加载索引并以所需的引擎类型重新记录。此过程**不需要重新创建索引**，因此是切换不同引擎类型的有效方法。

import mlflow

# Log the index with the query engine type first
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        artifact_path="index-query",
        engine_type="query",
    )

# Load the index back and re-log it with the chat engine type
index = mlflow.llama_index.load_model(model_info.model_uri)
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index,
        artifact_path="index-chat",
        # Specify the chat engine type this time
        engine_type="chat",
    )

或者，您可以利用他们在加载的 LlamaIndex 原生索引对象上的标准推理 API，特别是：

index.as_chat_engine().chat("hi")
index.as_query_engine().query("hi")
index.as_retriever().retrieve("hi")

如何使用加载的引擎进行推理的不同LLMs？

当将索引保存到 MLflow 时，它会持久化全局 Settings 对象作为模型的一部分。该对象包含设置，如 LLM 和嵌入模型，这些模型将由引擎使用。

import mlflow
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI("gpt-4o-mini")

# MLflow saves GPT-4o-Mini as the LLM to use for inference
with mlflow.start_run():
    model_info = mlflow.llama_index.log_model(
        index, artifact_path="index", engine_type="chat"
    )

然后当你稍后重新加载索引时，持久化的设置也会全局应用。这意味着加载的引擎将使用与记录时相同的LLM。

然而，有时你可能希望使用不同的LLM进行推理。在这种情况下，你可以在加载索引后直接更新全局的 Settings 对象。

import mlflow

# Load the index back
loaded_index = mlflow.llama_index.load_model(model_info.model_uri)

assert Settings.llm.model == "gpt-4o-mini"


# Update the settings to use GPT-4 instead
Settings.llm = OpenAI("gpt-4")
query_engine = loaded_index.as_query_engine()
response = query_engine.query("What is the capital of France?")