使用 Sentence Transformers 和 MLflow 进行高级语义相似性分析简介

在本综合教程中深入探讨使用 Sentence Transformers 和 MLflow 进行高级语义相似性分析。

Download this Notebook

学习目标

配置 sentence-transformers 用于语义相似度分析。
探索 MLflow 中的自定义 PythonModel 实现。
使用 MLflow 记录模型和管理配置。
使用 MLflow 的功能部署和应用模型进行推理。

揭示句子变换器在自然语言处理中的力量

Sentence Transformers，是transformer模型的专门化适应版本，擅长生成语义丰富的句子嵌入。这些模型非常适合语义搜索和相似性分析，为自然语言处理任务带来了更深层次的语义理解。

#### MLflow：开创性的灵活模型管理和部署

MLflow 与 Sentence Transformers 的集成引入了增强的实验跟踪和灵活的模型管理，这对 NLP 项目至关重要。学习如何在 MLflow 中实现自定义 PythonModel，以扩展功能以满足独特需求。

在本教程中，您将通过使用 MLflow 管理和部署复杂的 NLP 模型，获得实际操作经验，提升您在语义相似性分析和模型生命周期管理方面的技能。

[1]:

# Disable tokenizers warnings when constructing pipelines
%env TOKENIZERS_PARALLELISM=false

import warnings

# Disable a few less-than-useful UserWarnings from setuptools and pydantic
warnings.filterwarnings("ignore", category=UserWarning)

env: TOKENIZERS_PARALLELISM=false

使用 MLflow 实现自定义 SimilarityModel

了解如何使用 MLflow 的 PythonModel 创建自定义 SimilarityModel 类，以评估句子之间的语义相似性。

相似性模型概述

SimilarityModel 是一个量身定制的 Python 类，它利用了 MLflow 的灵活 PythonModel 接口。它专门设计用于封装使用复杂句子嵌入计算句子对之间语义相似性的复杂性。

自定义模型的关键组件

导入库: 从 MLflow、数据处理和 Sentence Transformers 中导入必要的库，以促进模型功能。
自定义 Python 模型 - SimilarityModel:
- load_context 方法专注于高效且安全的模型加载，这对于处理像 Sentence Transformers 这样的复杂模型至关重要。
- predict 方法配备了输入类型检查和错误处理，确保模型提供准确的余弦相似度分数，反映语义相关性。

自定义相似性模型的重要性

灵活性和定制化: 该模型的设计允许对输入和输出进行专门的处理，完美契合语义相似性任务的独特需求。
强大的错误处理：详细的输入类型检查保证了用户友好的体验，防止常见的输入错误，并确保模型行为的可预测性。
高效模型加载：通过策略性地使用 load_context 方法进行模型初始化，规避了序列化难题，确保了顺畅的操作流程。
目标功能: 自定义的 predict 方法直接计算相似度分数，展示了模型提供任务特定、可操作洞察的能力。

这个自定义的 SimilarityModel 展示了 MLflow 的 PythonModel 在打造定制化 NLP 解决方案中的适应性，为各种机器学习项目中的类似努力树立了先例。

[2]:

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer, util

import mlflow
from mlflow.models.signature import infer_signature
from mlflow.pyfunc import PythonModel


class SimilarityModel(PythonModel):
    def load_context(self, context):
        """Load the model context for inference."""
        from sentence_transformers import SentenceTransformer

        try:
            self.model = SentenceTransformer.load(context.artifacts["model_path"])
        except Exception as e:
            raise ValueError(f"Error loading model: {e}")

    def predict(self, context, model_input, params):
        """Predict method for comparing similarity between two sentences."""
        from sentence_transformers import util

        if isinstance(model_input, pd.DataFrame):
            if model_input.shape[1] != 2:
                raise ValueError("DataFrame input must have exactly two columns.")
            sentence_1, sentence_2 = model_input.iloc[0, 0], model_input.iloc[0, 1]
        elif isinstance(model_input, dict):
            sentence_1 = model_input.get("sentence_1")
            sentence_2 = model_input.get("sentence_2")
            if sentence_1 is None or sentence_2 is None:
                raise ValueError(
                    "Both 'sentence_1' and 'sentence_2' must be provided in the input dictionary."
                )
        else:
            raise TypeError(
                f"Unexpected type for model_input: {type(model_input)}. Must be either a Dict or a DataFrame."
            )

        embedding_1 = self.model.encode(sentence_1)
        embedding_2 = self.model.encode(sentence_2)

        return np.array(util.cos_sim(embedding_1, embedding_2).tolist())

准备句子转换器模型和签名

探索使用 MLflow 设置 Sentence Transformer 模型进行日志记录和部署的关键步骤。

加载和保存预训练模型

模型初始化：加载预训练的 Sentence Transformer 模型 "all-MiniLM-L6-v2"，因其能高效生成适用于多种 NLP 任务的高质量嵌入。
模型保存：模型被本地保存到 /tmp/sbert_model 以便于 MLflow 轻松访问，这是平台中模型记录的前提条件。

准备输入示例和工件

输入示例创建: 准备一个包含示例句子的DataFrame，代表典型的模型输入，并有助于定义模型的输入格式。
定义工件：保存的模型文件路径在 MLflow 中被指定为一个工件，这是将模型与 MLflow 运行关联的关键步骤。

生成签名测试输出

测试输出计算：计算句子嵌入之间的余弦相似度，提供了一个模型输出的实际示例。
签名推断：MLflow 的 infer_signature 函数用于生成一个签名，该签名封装了预期的输入和输出格式，强化了模型的操作模式。

这些步骤的重要性

模型准备：这些准备工作确保模型为在 MLflow 生态系统中的高效日志记录和无缝部署做好了准备。
输入-输出合约: 建立的签名作为明确的合约，定义了模型的输入-输出动态，这对于在部署场景中保持一致性和准确性至关重要。

在精心准备好了 Sentence Transformer 模型及其签名之后，我们现在已准备好推进其在 MLflow 中的集成和管理。

[3]:

# Load a pre-trained sentence transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Create an input example DataFrame
input_example = pd.DataFrame([{"sentence_1": "I like apples", "sentence_2": "I like oranges"}])

# Save the model in the /tmp directory
model_directory = "/tmp/sbert_model"
model.save(model_directory)

# Define artifacts with the absolute path
artifacts = {"model_path": model_directory}

# Generate test output for signature
test_output = np.array(
    util.cos_sim(
        model.encode(input_example["sentence_1"][0]), model.encode(input_example["sentence_2"][0])
    ).tolist()
)

# Define the signature associated with the model
signature = infer_signature(input_example, test_output)

# Visualize the signature
signature

[3]:

inputs:
  ['sentence_1': string, 'sentence_2': string]
outputs:
  [Tensor('float64', (-1, 1))]
params:
  None

创建一个实验

我们创建一个新的 MLflow 实验，这样我们将要记录模型的运行不会记录到默认实验中，而是拥有其上下文相关条目。

[4]:

# If you are running this tutorial in local mode, leave the next line commented out.
# Otherwise, uncomment the following line and set your tracking uri to your local or remote tracking server.

# mlflow.set_tracking_uri("http://127.0.0.1:8080")

mlflow.set_experiment("Semantic Similarity")

[4]:

<Experiment: artifact_location='file:///Users/benjamin.wilson/repos/mlflow-fork/mlflow/docs/source/llms/sentence-transformers/tutorials/semantic-similarity/mlruns/577235153137414660', creation_time=1701280997564, experiment_id='577235153137414660', last_update_time=1701280997564, lifecycle_stage='active', name='Semantic Similarity', tags={}>

使用 MLflow 记录自定义模型

学习如何使用 MLflow 记录自定义 SimilarityModel，以实现有效的模型管理和部署。

为 PyFunc 模型创建路径

我们建立 pyfunc_path，这是 Python 模型的临时存储位置。这个路径对于 MLflow 有效地序列化和存储模型至关重要。

在 MLflow 中记录模型

启动 MLflow 运行：启动一个 MLflow 运行，将所有模型日志记录过程封装在一个结构化的框架内。
模型日志详情: 该模型被标识为 "similarity"，为未来的模型检索和分析提供了清晰的参考。记录了一个 SimilarityModel 实例，封装了 Sentence Transformer 模型和相似度预测逻辑。一个示例的 DataFrame 展示了预期的模型输入格式，有助于用户理解和模型的可用性。包含输入-输出模式的推断签名被包含在内，强化了模型的正确使用。artifacts 字典指定了序列化的 Sentence Transformer 模型的位置，这对模型重建至关重要。列出了 sentence_transformers 和 numpy 等依赖项，确保模型在不同部署环境中的功能完整性。

模型日志的重要性

模型跟踪与版本控制：日志记录有助于全面的跟踪和有效的版本控制，增强模型生命周期管理。
可重复性和部署：记录的模型，包括其依赖项、输入示例和签名，变得易于重复和部署，促进了在不同环境中的应用一致性。

在将我们的 SimilarityModel 记录到 MLflow 后，它已准备好用于高级应用，如比较分析、版本管理和实际推理用例的部署。

[5]:

pyfunc_path = "/tmp/sbert_pyfunc"

with mlflow.start_run() as run:
    model_info = mlflow.pyfunc.log_model(
        "similarity",
        python_model=SimilarityModel(),
        input_example=input_example,
        signature=signature,
        artifacts=artifacts,
        pip_requirements=["sentence_transformers", "numpy"],
    )

2023/11/30 16:10:34 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false

模型推理与测试相似性预测

演示如何使用 SimilarityModel 在通过 MLflow 记录后计算句子之间的语义相似度。

加载模型以进行推理

使用 MLflow 加载: 使用 mlflow.pyfunc.load_model 和模型的 URI 来加载自定义的 SimilarityModel 进行推理。
模型准备就绪: 加载的模型，命名为 loaded_dynamic，配备了 SimilarityModel 中定义的逻辑，并准备好计算相似度。

为相似性预测准备数据

创建输入数据：构建一个 DataFrame，similarity_data，其中包含用于计算相似度的句子对，展示模型的输入灵活性。

计算和显示相似度分数

预测相似性：使用 similarity_data 调用 loaded_dynamic 上的 predict 方法，以计算句子嵌入之间的余弦相似度。
解释结果: 得到的 similarity_score 数值上表示了语义相似度，提供了对模型输出的即时洞察。

本次测试的重要性

模型验证：在新数据上预测时，确认自定义模型的预期行为，确保其有效性。
实际应用：突出模型在现实场景中的实际效用，展示其在语义相似性分析中的能力。

[6]:

# Load our custom semantic similarity model implementation by providing the uri that the model was logged to
loaded_dynamic = mlflow.pyfunc.load_model(model_info.model_uri)

# Create an evaluation test DataFrame
similarity_data = pd.DataFrame([{"sentence_1": "I like apples", "sentence_2": "I like oranges"}])

# Verify that the model generates a reasonable prediction
similarity_score = loaded_dynamic.predict(similarity_data)

print(f"The similarity between these sentences is: {similarity_score}")

The similarity between these sentences is: [[0.63414472]]

评估具有不同文本对的语义相似性

探索模型在精心挑选的文本对中辨别不同程度语义相似性的能力。

文本对的选择

低相似度对: 句子中的多样主题预示着低相似度评分，展示了模型识别对比语义内容的能力。
高度相似对：主题和语气相似的句子预计会获得高相似度分数，展示了模型对语义平行的检测能力。

sBERT 模型在相似度计算中的作用

语义理解：利用 sBERT 将语义本质编码为向量。
余弦相似度：计算相似度分数以量化语义接近度。

计算和显示相似度分数

预测低相似度对: 观察模型对语义上相距较远的句子的解释。
高相似度对预测: 评估模型在上下文相关句子中检测语义相似性的能力。

为何这很重要

模型验证：这些测试证实了模型对语言细微差别的理解以及对语义关系的量化能力。
实际影响：模型对语义内容处理的见解应用于内容推荐、信息检索和文本比较。

[7]:

low_similarity = {
    "sentence_1": "The explorer stood at the edge of the dense rainforest, "
    "contemplating the journey ahead. The untamed wilderness was "
    "a labyrinth of exotic plants and unknown dangers, a challenge "
    "for even the most seasoned adventurer, brimming with the "
    "prospect of new discoveries and uncharted territories.",
    "sentence_2": "To install the software, begin by downloading the latest "
    "version from the official website. Once downloaded, run the "
    "installer and follow the on-screen instructions. Ensure that "
    "your system meets the minimum requirements and agree to the "
    "license terms to complete the installation process successfully.",
}

high_similarity = {
    "sentence_1": "Standing in the shadow of the Great Pyramids of Giza, I felt a "
    "profound sense of awe. The towering structures, a testament to "
    "ancient ingenuity, rose majestically against the clear blue sky. "
    "As I walked around the base of the pyramids, the intricate "
    "stonework and sheer scale of these wonders of the ancient world "
    "left me speechless, enveloped in a deep sense of history.",
    "sentence_2": "My visit to the Great Pyramids of Giza was an unforgettable "
    "experience. Gazing upon these monumental structures, I was "
    "captivated by their grandeur and historical significance. Each "
    "step around these ancient marvels filled me with a deep "
    "appreciation for the architectural prowess of a civilization long "
    "gone, yet still speaking through these timeless monuments.",
}

# Validate that semantically unrelated texts return a low similarity score
low_similarity_score = loaded_dynamic.predict(low_similarity)

print(f"The similarity score for the 'low_similarity' pair is: {low_similarity_score}")

# Validate that semantically similar texts return a high similarity score
high_similarity_score = loaded_dynamic.predict(high_similarity)

print(f"The similarity score for the 'high_similarity' pair is: {high_similarity_score}")

The similarity score for the 'low_similarity' pair is: [[-0.00052751]]
The similarity score for the 'high_similarity' pair is: [[0.83703309]]

结论：在NLP中利用自定义MLflow Python函数的力量

随着本教程的结束，让我们回顾一下我们在使用 Sentence Transformers 和 MLflow 理解和应用高级 NLP 技术方面所取得的重大进展。

教程要点

多功能NLP建模：我们探讨了如何利用Sentence Transformers的高级功能进行语义相似性分析，这是许多NLP应用中的关键任务。
自定义 MLflow Python 函数: 在 MLflow 中实现自定义 SimilarityModel 展示了使用 Python 函数扩展和适应预训练模型功能以满足特定项目需求的强大和灵活性。
模型管理和部署：我们深入探讨了使用 MLflow 记录、管理和部署这些模型的过程，展示了 MLflow 如何简化机器学习生命周期的这些方面。
实用语义分析：通过实际示例，我们展示了模型在辨别句子对之间不同程度的语义相似性方面的能力，验证了其在实际语义分析任务中的有效性。

MLflow 的 Python 函数的强大与灵活性

特定需求的定制化：本教程的亮点之一是展示了如何定制 MLflow 的 PythonModel。这种定制不仅强大，而且对于将模型定制到超出标准模型功能的特定 NLP 任务是必要的。
适应性和扩展性: MLflow 中的 PythonModel 框架为实现各种 NLP 模型提供了坚实的基础。其适应性允许扩展基础模型的功能，例如将句子嵌入模型转换为语义相似度比较工具。

赋能高级自然语言处理应用

易于修改：本教程展示了针对 MLflow 中不同风格的 PythonModel 实现进行修改可以相对容易地完成，使您能够创建完全符合项目需求的模型。
广泛适用性：无论是语义搜索、内容推荐，还是自动化文本比较，本教程中概述的方法都可以适应广泛的NLP任务，为该领域的创新应用打开了大门。

继续前进

通过本教程所获得的知识和技能，你现在已具备在项目中应用这些高级自然语言处理技术的能力。Sentence Transformers 与 MLflow 强大的模型管理和部署功能的完美结合，为开发复杂、高效且有效的自然语言处理解决方案铺平了道路。

感谢您加入我们，一起探索使用 Sentence Transformers 和 MLflow 进行高级 NLP 建模的旅程。我们希望本教程能激发您进一步探索并在您的 NLP 工作中创新！

快乐建模！