将MLflow模型部署为本地推理服务器

MLflow 允许您仅用一条命令即可在本地部署模型。这种方法非常适合轻量级应用或在将模型迁移到预发布或生产环境之前进行本地测试。

如果您是MLflow模型部署的新手，请先阅读MLflow Deployment指南，了解MLflow模型和部署的基本概念。

部署推理服务器

在部署之前，您必须拥有一个MLflow模型。如果没有，您可以按照MLflow Tracking快速入门创建一个示例scikit-learn模型。请记住记下模型URI，例如runs:/<run_id>/<artifact_path>（如果您在MLflow Model Registry中注册了模型，则可能是models:/<model_name>/<model_version>）。

准备好模型后，部署到本地服务器非常简单。使用mlflow models serve命令可以一步完成部署。该命令会启动一个本地服务器，监听指定端口并提供模型服务。有关可用选项，请参阅CLI reference。

mlflow models serve -m runs:/<run_id>/model -p 5000

你可以按如下方式向服务器发送测试请求：

curl http://127.0.0.1:5000/invocations -H "Content-Type:application/json"  --data '{"inputs": [[1, 2], [3, 4], [5, 6]]}'

提供多种命令行选项用于自定义服务器的行为。例如，--env-manager选项允许您选择特定的环境管理器（如Anaconda）来创建虚拟环境。mlflow models模块还提供了其他实用命令，例如构建Docker镜像或生成Dockerfile。完整详情请参阅MLflow CLI参考文档。

推理服务器规范

端点

推理服务器提供4个端点：

/invocations: 一个推理端点，接受包含输入数据的POST请求并返回预测结果。
/ping: 用于健康检查。
/health: 与 /ping 相同
/version: 返回MLflow的版本号。

接受的输入格式

/invocations 端点接受 CSV 或 JSON 格式的输入。输入格式必须在 Content-Type 请求头中指定为 application/json 或 application/csv。

CSV输入

CSV输入必须是有效的pandas.DataFrame CSV表示形式。例如：

curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/csv' --data '1,2,3,4'

JSON输入

你可以传入一个与所需模型负载对应的扁平字典，或者将负载包装在一个带有指定负载格式字典键的字典中。

封装的有效载荷字典

如果您的模型格式不在上述支持范围内，或者您希望避免将输入数据转换为所需的载荷格式，您可以利用下方的字典载荷结构。

字段	描述	示例
`dataframe_split`	Pandas DataFrames in the `split` orientation.	`{"dataframe_split": pandas_df.to_dict(orient="split")}`
`dataframe_records`	Pandas DataFrame in the records orientation. We do not recommend using this format because it is not guaranteed to preserve column ordering.	`{"dataframe_records": pandas_df.to_dict(orient="records")}`
`instances`	Tensor input formatted as described in TF Serving’s API docs where the provided inputs will be cast to Numpy arrays.	`{"instances": [1.0, 2.0, 5.0]}`
`inputs`	Same as `instances` but with a different key.	`{"inputs": [["Cheese"], ["and", "Crackers"]]}`

示例
# Prerequisite: serve a custom pyfunc OpenAI model (not mlflow.openai) on localhost:5678
#   that defines inputs in the below format and params of `temperature` and `max_tokens`

import json
import requests

payload = json.dumps(
    {
        "inputs": {"messages": [{"role": "user", "content": "Tell a joke!"}]},
        "params": {
            "temperature": 0.5,
            "max_tokens": 20,
        },
    }
)
response = requests.post(
    url=f"http://localhost:5678/invocations",
    data=payload,
    headers={"Content-Type": "application/json"},
)
print(response.json())

JSON输入还可以包含一个可选的params字段用于传递额外参数。有效参数类型为Union[DataType, List[DataType], None]，其中DataType 是MLflow data types。要传递参数，必须定义包含params的有效Model Signature。

curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '{
    "inputs": {"question": ["What color is it?"],
                "context": ["Some people said it was green but I know that it is pink."]},
    "params": {"max_answer_len": 10}
}'

note

由于JSON会丢弃类型信息，如果模型架构中指定了输入类型，MLflow会将JSON输入转换为指定的输入类型。如果您的模型对输入类型敏感，建议为模型提供架构，以确保在推理时不会出现类型不匹配错误。特别是深度学习模型通常对输入类型要求严格，需要模型架构才能确保正确评分。对于复杂数据类型，请参阅下方的编码复杂数据。

原始载荷字典

如果你的数据负载格式是mlflow服务模型所接受的，并且属于下方支持的模型类型，你可以直接传递原始负载字典。

支持的请求格式	描述	示例
OpenAI Chat	OpenAI chat request payload†	`{ "messages": [{"role": "user", "content": "Tell a joke!"}], # noqa "temperature": 0.0, }`

† 请注意，在使用OpenAI API时，不应包含model参数，因为其配置已由MLflow模型实例设置。只要这些参数在记录模型签名的params参数中定义，其他所有参数都可以自由使用。

示例
# Prerequisite: serve a Pyfunc model accepts OpenAI-compatible chat requests on localhost:5678 that defines
#   `temperature` and `max_tokens` as parameters within the logged model signature

import json
import requests

payload = json.dumps(
    {
        "messages": [{"role": "user", "content": "Tell a joke!"}],
        "temperature": 0.5,
        "max_tokens": 20,
    }
)
requests.post(
    url=f"http://localhost:5678/invocations",
    data=payload,
    headers={"Content-Type": "application/json"},
)
print(requests.json())

编码复杂数据

复杂数据类型，如日期或二进制，没有原生的JSON表示形式。如果包含模型签名，MLflow可以自动从JSON解码支持的数据类型。支持以下数据类型转换：

binary: 数据应为base64编码格式，MLflow会自动进行base64解码。
datetime: 数据预期按照ISO 8601规范编码为字符串。 MLflow会将其解析为对应平台上的适当日期时间表示形式。

示例请求：

# record-oriented DataFrame input with binary column "b"
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '[
    {"a": 0, "b": "dGVzdCBiaW5hcnkgZGF0YSAw"},
    {"a": 1, "b": "dGVzdCBiaW5hcnkgZGF0YSAx"},
    {"a": 2, "b": "dGVzdCBiaW5hcnkgZGF0YSAy"}
]'

# record-oriented DataFrame input with datetime column "b"
curl http://127.0.0.1:5000/invocations -H 'Content-Type: application/json' -d '[
    {"a": 0, "b": "2020-01-01T00:00:00Z"},
    {"a": 1, "b": "2020-02-01T12:34:56Z"},
    {"a": 2, "b": "2021-03-01T00:00:00Z"}
]'

服务框架

默认情况下，MLflow使用FastAPI（一个现代的Python ASGI Web应用框架）来提供推理端点服务。 FastAPI以异步方式处理请求，被公认为最快的Python框架之一。这个生产就绪的框架适用于大多数使用场景。此外，MLflow还集成了MLServer作为替代的服务引擎。MLServer通过利用异步请求/响应范式和工作负载卸载，实现了更高的性能和可扩展性。同时MLServer还被用作Seldon Core和 KServe（原KFServing）等Kubernetes原生框架中的核心Python推理服务器，因此提供了金丝雀部署和开箱即用的自动扩展等高级功能。


使用场景	标准使用场景包括本地测试。	大规模生产环境。
设置	MLflow默认已安装FastAPI。	需要单独安装。
性能	FastAPI原生支持异步请求处理，非常适合I/O密集型任务包括机器学习工作负载。参考FastAPI基准测试查看与其他Python框架的基准对比。	专为高性能机器学习工作负载设计，通常能提供更好的吞吐量和效率。MLServer通过将机器学习推理工作负载卸载到单独的worker池(进程)来支持异步请求/响应范式，这样服务器在处理推理时仍能继续接收新请求。详情请参阅MLServer并行推理了解其实现原理。此外，MLServer支持自适应批处理，能透明地将请求批量处理以提高吞吐量和效率。
Scalability	While FastAPI works well in a distributed environment in general, MLflow simply runs it with `uvicorn` and does not support holizontal scaling out of the box.	Additionally to the support for parallel inference as mentioned above, MLServer is used as the core inference server in Kubernetes-native frameworks such as Seldon Core and KServe (formerly known as KFServing). By deploying MLflow models to Kubernetes with MLServer, you can leverage the advanced features of these frameworks such as autoscaling to achieve high scalability.

MLServer通过/invocations端点暴露相同的评分API。要使用MLServer部署，首先通过pip install mlflow[extras]安装额外依赖，然后使用--enable-mlserver选项执行部署命令。例如，

mlflow models serve -m runs:/<run_id>/model -p 5000 --enable-mlserver

要了解更多关于MLflow与MLServer集成的信息，请查看MLServer文档中的端到端示例。您还可以在将模型部署到Kubernetes中找到使用MLServer将MLflow模型部署到Kubernetes集群的指南。

运行批量推理

无需运行在线推理端点，您可以使用mlflow models predict命令对本地文件执行单次批量推理作业。以下命令将在input.csv上运行模型预测，并将结果输出到output.csv。

Bash
Python

mlflow models predict -m runs:/<run_id>/model -i input.csv -o output.csv

import mlflow

model = mlflow.pyfunc.load_model("runs:/<run_id>/model")
predictions = model.predict(pd.read_csv("input.csv"))
predictions.to_csv("output.csv")

部署推理服务器​

推理服务器规范​

端点​

接受的输入格式​

CSV输入​

JSON输入​

封装的有效载荷字典​

原始载荷字典​

编码复杂数据​

服务框架​

运行批量推理​

故障排除​