Sentence Transformers 和 MLflow 简介

欢迎参加我们的教程，学习如何利用 Sentence Transformers 与 MLflow 进行高级自然语言处理和模型管理。

学习目标

使用 sentence-transformers 设置一个句子嵌入的管道。
使用 MLflow 记录模型和配置。
理解并应用 MLflow 中的模型签名到 sentence-transformers。
使用 MLflow 的功能部署和使用模型进行推理。

什么是句子转换器？

Sentence Transformers 是 Hugging Face Transformers 库的一个扩展，旨在生成语义丰富的句子嵌入。它们利用 BERT 和 RoBERTa 等模型，针对语义搜索和文本聚类等任务进行了微调，生成高质量的句子级嵌入。

将 MLflow 与 Sentence Transformers 集成的优势

将 MLflow 与 Sentence Transformers 结合，可以增强 NLP 项目，具体表现为：

简化实验管理和日志记录。
提供对模型版本和配置的更好控制。
确保结果和模型预测的可重复性。
简化生产环境中的部署过程。

这种集成能够高效地跟踪、管理和部署NLP应用。

[1]:

# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false

设置句子嵌入的环境

通过建立核心工作环境，开始使用 Sentence Transformers 和 MLflow 的旅程。

初始化的关键步骤

导入必要的库：SentenceTransformer 和 mlflow。
初始化 "all-MiniLM-L6-v2" 句子转换器模型。

模型初始化

紧凑且高效的 "all-MiniLM-L6-v2" 模型因其生成有意义句子嵌入的有效性而被选中。在 Hugging Face Hub 探索更多模型。

模型的目的

该模型擅长将句子转换为语义丰富的嵌入，适用于语义搜索和聚类等各种NLP任务。

[2]:

from sentence_transformers import SentenceTransformer

import mlflow

model = SentenceTransformer("all-MiniLM-L6-v2")

使用 MLflow 定义模型签名

定义模型签名是在为我们的句子转换器模型设置过程中至关重要的一步，以确保在推理过程中行为的一致性和预期性。

签名定义的步骤

准备示例句子: 定义示例句子以展示模型的输入和输出格式。
生成模型签名：使用 mlflow.models.infer_signature 函数，结合模型的输入和输出来自动定义签名。

模型签名的重要性

数据格式的清晰性：确保对模型期望和生成的数据类型和结构进行清晰的文档说明。
模型部署与使用: 对于将模型部署到生产环境中至关重要，确保模型接收正确格式的输入并产生预期的输出。
错误预防：通过强制一致的数据格式，有助于在模型推理过程中防止错误。

注意：List[str] 输入类型在推理时等同于 str。MLflow 风格使用 ColSpec[str] 定义作为输入类型。

[3]:

example_sentences = ["A sentence to encode.", "Another sentence to encode."]

# Infer the signature of the custom model by providing an input example and the resultant prediction output.
# We're not including any custom inference parameters in this example, but you can include them as a third argument
# to infer_signature(), as you will see in the advanced tutorials for Sentence Transformers.
signature = mlflow.models.infer_signature(
    model_input=example_sentences,
    model_output=model.encode(example_sentences),
)

# Visualize the signature
signature

[3]:

inputs:
  [string]
outputs:
  [Tensor('float32', (-1, 384))]
params:
  None

创建一个实验

我们创建一个新的 MLflow 实验，这样我们将要记录模型的运行不会记录到默认实验中，而是拥有其上下文相关条目。

[4]:

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Introduction to Sentence Transformers")

[4]:

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434', creation_time=1701280211449, experiment_id='469990615226680434', last_update_time=1701280211449, lifecycle_stage='active', name='Introduction to Sentence Transformers', tags={}>

使用 MLflow 记录 Sentence Transformer 模型

在MLflow中记录模型对于跟踪、版本控制和部署至关重要，遵循我们Sentence Transformer模型的初始化和签名定义。

记录模型的步骤

启动一个MLflow运行：使用 mlflow.start_run() 启动一个新的运行，将所有日志操作分组。
记录模型: 使用 mlflow.sentence_transformers.log_model 来记录模型，提供模型对象、工件路径、签名和输入示例。

模型日志的重要性

模型管理：从训练到部署，促进模型的生命周期管理。
可重复性和追踪: 支持模型版本的追踪，并确保可重复性。
部署简易性：通过允许模型轻松部署以进行推理，简化了部署过程。

[5]:

with mlflow.start_run():
    logged_model = mlflow.sentence_transformers.log_model(
        model=model,
        artifact_path="sbert_model",
        signature=signature,
        input_example=example_sentences,
    )

加载模型并测试推理

在MLflow中记录Sentence Transformer模型后，我们演示了如何加载并测试它以进行实时推理。

将模型作为 PyFunc 加载

为什么选择 PyFunc：使用 mlflow.pyfunc.load_model 加载已记录的模型，以便无缝集成到基于 Python 的服务或应用程序中。
模型URI: 使用 logged_model.model_uri 来准确地定位并从MLflow加载模型。

进行推理测试

测试句子：定义句子以测试模型的嵌入生成能力。
执行预测：使用模型的 predict 方法与测试句子来获取嵌入。
打印嵌入长度：通过检查嵌入数组的长度来验证嵌入生成，该长度对应于每个句子表示的维度。

推理测试的重要性

模型验证：在加载时确认模型的预期行为和数据处理能力。
部署准备: 验证模型是否准备好实时集成到应用程序服务中。

[6]:

inference_test = ["I enjoy pies of both apple and cherry.", "I prefer cookies."]

# Load our custom model by providing the uri for where the model was logged.
loaded_model_pyfunc = mlflow.pyfunc.load_model(logged_model.model_uri)

# Perform a quick test to ensure that our loaded model generates the correct output
embeddings_test = loaded_model_pyfunc.predict(inference_test)

# Verify that the output is a list of lists of floats (our expected output format)
print(f"The return structure length is: {len(embeddings_test)}")

for i, embedding in enumerate(embeddings_test):
    print(f"The size of embedding {i + 1} is: {len(embeddings_test[i])}")

The return structure length is: 2
The size of embedding 1 is: 384
The size of embedding 2 is: 384

显示生成的嵌入样本

检查嵌入内容以验证其质量并理解模型的输出。

检查嵌入样本

采样的目的: 检查每个嵌入中的部分条目，以理解模型生成的向量表示。
打印嵌入样本：使用 embedding[:10] 打印每个嵌入向量的前10个条目，以便初步了解模型的输出。

为什么采样很重要

质量检查：抽样提供了一种快速验证嵌入质量的方法，并确保它们是有意义的且非退化的。
理解模型输出: 查看嵌入向量的部分内容可以直观地理解模型的输出，这对调试和开发非常有帮助。

[7]:

for i, embedding in enumerate(embeddings_test):
    print(f"The sample of the first 10 entries in embedding {i + 1} is: {embedding[:10]}")

The sample of the first 10 entries in embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
  0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the first 10 entries in embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
  0.08282158 -0.03173266  0.04507608  0.02777079]

MLflow 中的原生模型加载以扩展功能

利用MLflow对原生模型加载的支持，探索Sentence Transformer的全部功能。

为什么支持原生加载？

访问原生功能：原生加载解锁了Sentence Transformer模型的所有功能，这对于高级NLP任务至关重要。
本地加载模型: 使用 mlflow.sentence_transformers.load_model 以加载具有全部功能的模型，增强灵活性和效率。

使用原生模型生成嵌入

模型编码：使用模型的原生 encode 方法生成嵌入，利用优化的功能。
本地编码的重要性：本地编码确保了模型全嵌入生成能力的利用，适用于大规模或复杂的自然语言处理应用。

[8]:

# Load the saved model as a native Sentence Transformers model (unlike above, where we loaded as a generic python function)
loaded_model_native = mlflow.sentence_transformers.load_model(logged_model.model_uri)

# Use the native model to generate embeddings by calling encode() (unlike for the generic python function which uses the single entrypoint of `predict`)
native_embeddings = loaded_model_native.encode(inference_test)

for i, embedding in enumerate(native_embeddings):
    print(
        f"The sample of the native library encoding call for embedding {i + 1} is: {embedding[:10]}"
    )

2023/11/30 15:50:24 INFO mlflow.sentence_transformers: 'runs:/eeab3c1b13594fdea13e07585b1c0596/sbert_model' resolved as 'file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/quickstart/mlruns/469990615226680434/eeab3c1b13594fdea13e07585b1c0596/artifacts/sbert_model'

The sample of the native library encoding call for embedding 1 is: [ 0.04866192 -0.03687946  0.02408808  0.03534171 -0.12739632  0.00999414
  0.07135344 -0.01433522  0.04296691 -0.00654414]
The sample of the native library encoding call for embedding 2 is: [-0.03879027 -0.02373698  0.01314073  0.03589077 -0.01641303 -0.0857707
  0.08282158 -0.03173266  0.04507608  0.02777079]

结论：拥抱 Sentence Transformers 与 MLflow 的力量

随着我们到达 Sentence Transformers 教程的结尾，我们已经成功地掌握了将 Sentence Transformers 库与 MLflow 集成的基本知识。这些基础知识为在自然语言处理（NLP）领域中进行更高级和专业化的应用奠定了基础。

关键学习内容的回顾

集成基础: 我们介绍了使用 MLflow 加载和记录 Sentence Transformer 模型的基本步骤。这个过程展示了在 MLflow 生态系统中集成尖端 NLP 工具的简单性和有效性。
签名与推理: 通过创建模型签名和执行推理任务，我们展示了如何操作化Sentence Transformer模型，确保其为实际应用做好准备。
模型加载与预测: 我们探讨了两种加载模型的方法——作为PyFunc模型和使用原生Sentence Transformers加载机制。这种双重方法突显了MLflow在适应不同模型交互方法方面的多功能性。
嵌入探索：通过生成和检查句子嵌入，我们窥见了变压器模型在从文本中捕捉语义信息方面的变革潜力。

展望未来

拓展视野：虽然本教程专注于 Sentence Transformers 和 MLflow 的基础方面，但还有许多高级应用等待探索。从语义相似性分析到释义挖掘，潜在的用例是广泛而多样的。
继续学习: 我们强烈建议您深入研究本系列中的其他教程，这些教程深入探讨了更多有趣的用例，如相似性分析、语义搜索和释义挖掘。这些教程将为您提供对 Sentence Transformers 在各种 NLP 任务中的更广泛理解和更多实际应用。

最后的思考

使用 Sentence Transformers 和 MLflow 进入 NLP 的旅程才刚刚开始。通过本教程获得的知识和见解，您已经具备了探索更复杂和令人兴奋的应用的能力。将先进的 NLP 模型与 MLflow 强大的管理和部署能力相结合，为语言理解领域及其他领域的创新和探索开辟了新的途径。

感谢您加入我们的入门之旅，我们期待看到您如何在NLP工作中应用这些工具和概念！