使用 MLflow 服务 LLM：利用自定义 PyFunc

介绍

本教程指导您使用带有 MLflow 的自定义 pyfunc 保存和部署大型语言模型 (LLMs)，非常适合 MLflow 默认 transformers 风格不直接支持的模型。

学习目标

理解在特定模型场景中自定义 pyfunc 定义的需求。
学习创建自定义 pyfunc 以管理模型依赖关系和接口数据。
通过自定义 pyfunc 深入了解在部署环境中简化用户界面。

默认实现的挑战

虽然 MLflow 的 transformers 风格通常处理来自 HuggingFace Transformers 库的模型，但某些模型或配置可能不符合这种标准方法。在这种情况下，例如我们的模型无法使用默认的 pipeline 类型，我们面临着使用 MLflow 部署这些模型的独特挑战。

自定义 PyFunc 的力量

为了解决这个问题，MLflow 的自定义 pyfunc 来帮忙了。它允许我们：

高效处理模型加载及其依赖项。
自定义推理过程以适应特定模型的需求。
调整接口数据以在部署的应用程序中创建用户友好的环境。

我们的重点将放在自定义 pyfunc 的实际应用上，以在 MLflow 的生态系统中有效部署 LLMs。

通过本教程的结尾，您将掌握在机器学习项目中应对类似挑战的知识，充分利用 MLflow 的全部潜力进行自定义模型部署。

在进行之前的重要考虑事项

硬件推荐

本指南展示了使用一个特别庞大且复杂的大型语言模型 (LLM) 的方法。鉴于其复杂性：

GPU 要求: 强烈建议在配备至少 64GB VRAM 的 CUDA 兼容 GPU 的系统上运行此示例。
CPU 警告: 虽然在技术上是可行的，但在 CPU 上执行模型会导致推理时间极长，即使在顶级 CPU 上，单次预测也可能需要数十分钟。由于在仅 CPU 系统上运行此模型的性能限制，本笔记本的最后一个单元格故意未执行。然而，在适当强大的 GPU 上，本笔记本的总运行时间约为 8 分钟。

执行建议

如果你正在考虑运行这个笔记本中的代码：

性能: 为了获得更流畅的体验并真正发挥模型的能力，请使用与模型设计相匹配的硬件。
依赖项: 确保您已安装推荐的依赖项以获得最佳模型性能。这些对于高效的模型加载、初始化、注意力计算和推理处理至关重要:

pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

[1]:

# Load necessary libraries

import accelerate
import torch
import transformers
from huggingface_hub import snapshot_download

import mlflow

/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_fields.py:128: UserWarning: Field "model_server_url" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/pydantic/_internal/_config.py:317: UserWarning: Valid config keys have changed in V2:
* 'schema_extra' has been renamed to 'json_schema_extra'
  warnings.warn(message, UserWarning)

下载模型和分词器

首先，我们需要下载我们的模型和分词器。以下是我们的操作方法：

[2]:

# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")

定义自定义 PyFunc

现在，让我们定义我们的自定义 pyfunc。这将决定我们的模型如何加载其依赖项以及如何执行预测。注意我们是如何将模型的复杂性封装在这个类中的。

[3]:

class MPT(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        """
        This method initializes the tokenizer and language model
        using the specified model snapshot directory.
        """
        # Initialize tokenizer and language model
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            context.artifacts["snapshot"], padding_side="left"
        )

        config = transformers.AutoConfig.from_pretrained(
            context.artifacts["snapshot"], trust_remote_code=True
        )
        # If you are running this in a system that has a sufficiently powerful GPU with available VRAM,
        # uncomment the configuration setting below to leverage triton.
        # Note that triton dramatically improves the inference speed performance

        # config.attn_config["attn_impl"] = "triton"

        self.model = transformers.AutoModelForCausalLM.from_pretrained(
            context.artifacts["snapshot"],
            config=config,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )

        # NB: If you do not have a CUDA-capable device or have torch installed with CUDA support
        # this setting will not function correctly. Setting device to 'cpu' is valid, but
        # the performance will be very slow.
        self.model.to(device="cpu")
        # If running on a GPU-compatible environment, uncomment the following line:
        # self.model.to(device="cuda")

        self.model.eval()

    def _build_prompt(self, instruction):
        """
        This method generates the prompt for the model.
        """
        INSTRUCTION_KEY = "### Instruction:"
        RESPONSE_KEY = "### Response:"
        INTRO_BLURB = (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request."
        )

        return f"""{INTRO_BLURB}
        {INSTRUCTION_KEY}
        {instruction}
        {RESPONSE_KEY}
        """

    def predict(self, context, model_input, params=None):
        """
        This method generates prediction for the given input.
        """
        prompt = model_input["prompt"][0]

        # Retrieve or use default values for temperature and max_tokens
        temperature = params.get("temperature", 0.1) if params else 0.1
        max_tokens = params.get("max_tokens", 1000) if params else 1000

        # Build the prompt
        prompt = self._build_prompt(prompt)

        # Encode the input and generate prediction
        # NB: Sending the tokenized inputs to the GPU here explicitly will not work if your system does not have CUDA support.
        # If attempting to run this with GPU support, change 'cpu' to 'cuda' for maximum performance
        encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
        output = self.model.generate(
            encoded_input,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_tokens,
        )

        # Removing the prompt from the generated text
        prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
        generated_response = self.tokenizer.decode(
            output[0][prompt_length:], skip_special_tokens=True
        )

        return {"candidates": [generated_response]}

构建提示

我们自定义 pyfunc 的一个关键方面是构建模型提示。我们的自定义 pyfunc 负责处理这一提示，而不是让最终用户去理解和构建它。这确保了无论模型的要求多么复杂，最终用户界面都保持简单和一致。

查看我们上面类中的方法 _build_prompt() ，了解如何向自定义 pyfunc 添加自定义输入处理逻辑，以支持将用户输入数据转换为与包装模型实例兼容的格式所需的翻译。

[4]:

import numpy as np
import pandas as pd

import mlflow
from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, ParamSchema, ParamSpec, Schema

# Define input and output schema
input_schema = Schema(
    [
        ColSpec(DataType.string, "prompt"),
    ]
)
output_schema = Schema([ColSpec(DataType.string, "candidates")])

parameters = ParamSchema(
    [
        ParamSpec("temperature", DataType.float, np.float32(0.1), None),
        ParamSpec("max_tokens", DataType.integer, np.int32(1000), None),
    ]
)

signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=parameters)


# Define input example
input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})

设置我们将要记录自定义模型的实验

如果实验尚不存在，MLflow 将使用此名称创建一个新实验，并提醒您它已创建了一个新实验。

[5]:

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")

2023/11/29 17:33:23 INFO mlflow.tracking.fluent: Experiment with name 'mpt-7b-instruct-evaluation' does not exist. Creating a new experiment.

[5]:

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/custom-pyfunc-for-llms/notebooks/mlruns/265930820950682761', creation_time=1701297203895, experiment_id='265930820950682761', last_update_time=1701297203895, lifecycle_stage='active', name='mpt-7b-instruct-evaluation', tags={}>

[6]:

# Get the current base version of torch that is installed, without specific version modifiers
torch_version = torch.__version__.split("+")[0]

# Start an MLflow run context and log the MPT-7B model wrapper along with the param-included signature to
# allow for overriding parameters at inference time
with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        "mpt-7b-instruct",
        python_model=MPT(),
        # NOTE: the artifacts dictionary mapping is critical! This dict is used by the load_context() method in our MPT() class.
        artifacts={"snapshot": snapshot_location},
        pip_requirements=[
            f"torch=={torch_version}",
            f"transformers=={transformers.__version__}",
            f"accelerate=={accelerate.__version__}",
            "einops",
            "sentencepiece",
        ],
        input_example=input_example,
        signature=signature,
    )

2023/11/29 17:33:24 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
/Users/benjamin.wilson/miniconda3/envs/mlflow-dev-env/lib/python3.8/site-packages/_distutils_hack/__init__.py:30: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")

加载保存的模型

[7]:

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

/Users/benjamin.wilson/.cache/huggingface/modules/transformers_modules/mpt-7b/configuration_mpt.py:97: UserWarning: alibi is turned on, setting `learned_pos_emb` to `False.`
  warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')

测试模型的推理功能

[ ]:

# The execution of this is commented out for the purposes of runtime on CPU.
# If you are running this on a system with a sufficiently powerful GPU, you may uncomment and interface with the model!

# loaded_model.predict(pd.DataFrame(
#     {"prompt": ["What is machine learning?"]}), params={"temperature": 0.6}
# )

结论

通过本教程，我们看到了 MLflow 自定义 pyfunc 的强大和灵活性。通过理解我们模型的特定需求并定义一个自定义的 pyfunc 来满足这些需求，我们可以确保一个无缝的部署过程和一个用户友好的界面。