TSNE中的近似最近邻#

本示例展示了如何在管道中链接KNeighborsTransformer和TSNE。 它还展示了如何包装 nmslibpynndescent 包以替换KNeighborsTransformer并执行近似最近邻。这些包可以通过 pip install nmslib pynndescent 进行安装。

注意:在KNeighborsTransformer中,我们使用的定义包括每个训练点作为其自身的邻居在 n_neighbors 的计数中,并且为了兼容性原因,当 mode == 'distance' 时会计算一个额外的邻居。 请注意,我们在提议的 nmslib 包装中也做了同样的处理。

# 作者:scikit-learn 开发者
# SPDX-License-Identifier: BSD-3-Clause

首先我们尝试导入这些包,并在缺失时警告用户。

import sys

try:
    import nmslib
except ImportError:
    print("The package 'nmslib' is required to run this example.")
    sys.exit()

try:
    from pynndescent import PyNNDescentTransformer
except ImportError:
    print("The package 'pynndescent' is required to run this example.")
    sys.exit()

我们定义了一个包装类,用于实现 nmslib 的 scikit-learn API,以及一个加载函数。

import joblib
import numpy as np
from scipy.sparse import csr_matrix

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle


class NMSlibTransformer(TransformerMixin, BaseEstimator):
    """用于将nmslib作为sklearn的KNeighborsTransformer的包装器"""

    def __init__(self, n_neighbors=5, metric="euclidean", method="sw-graph", n_jobs=-1):
        self.n_neighbors = n_neighbors
        self.method = method
        self.metric = metric
        self.n_jobs = n_jobs

    def fit(self, X):
        self.n_samples_fit_ = X.shape[0]

        # 请参阅手册中的更多指标
        # https://github.com/nmslib/nmslib/tree/master/manual
        space = {
            "euclidean": "l2",
            "cosine": "cosinesimil",
            "l1": "l1",
            "l2": "l2",
        }[self.metric]

        self.nmslib_ = nmslib.init(method=self.method, space=space)
        self.nmslib_.addDataPointBatch(X.copy())
        self.nmslib_.createIndex()
        return self

    def transform(self, X):
        n_samples_transform = X.shape[0]

        # 出于兼容性原因,由于每个样本都被视为其自身的邻居,因此将多计算一个额外的邻居。
        n_neighbors = self.n_neighbors + 1

        if self.n_jobs < 0:
            # 与 joblib 对负 n_jobs 值的处理相同:
            # 特别地, `n_jobs == -1` 表示“线程数与 CPU 数相同”。
            num_threads = joblib.cpu_count() + self.n_jobs + 1
        else:
            num_threads = self.n_jobs

        results = self.nmslib_.knnQueryBatch(
            X.copy(), k=n_neighbors, num_threads=num_threads
        )
        indices, distances = zip(*results)
        indices, distances = np.vstack(indices), np.vstack(distances)

        indptr = np.arange(0, n_samples_transform * n_neighbors + 1, n_neighbors)
        kneighbors_graph = csr_matrix(
            (distances.ravel(), indices.ravel(), indptr),
            shape=(n_samples_transform, self.n_samples_fit_),
        )

        return kneighbors_graph


def load_mnist(n_samples):
    """加载MNIST,打乱数据,只返回n_samples。"""
    mnist = fetch_openml("mnist_784", as_frame=False)
    X, y = shuffle(mnist.data, mnist.target, random_state=2)
    return X[:n_samples] / 255, y[:n_samples]

我们对不同的精确/近似最近邻变换器进行基准测试。

import time

from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsTransformer
from sklearn.pipeline import make_pipeline

datasets = [
    ("MNIST_10000", load_mnist(n_samples=10_000)),
    ("MNIST_20000", load_mnist(n_samples=20_000)),
]

n_iter = 500
perplexity = 30
metric = "euclidean"
# TSNE 需要一定数量的邻居,这取决于困惑度参数。
# 由于我们将每个样本都包含为其自身的邻居,因此需要增加一个。
n_neighbors = int(3.0 * perplexity + 1) + 1

tsne_params = dict(
    init="random",  # pca not supported for sparse matrices
    perplexity=perplexity,
    method="barnes_hut",
    random_state=42,
    n_iter=n_iter,
    learning_rate="auto",
)

transformers = [
    (
        "KNeighborsTransformer",
        KNeighborsTransformer(n_neighbors=n_neighbors, mode="distance", metric=metric),
    ),
    (
        "NMSlibTransformer",
        NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),
    ),
    (
        "PyNNDescentTransformer",
        PyNNDescentTransformer(
            n_neighbors=n_neighbors, metric=metric, parallel_batch_queries=True
        ),
    ),
]

for dataset_name, (X, y) in datasets:
    msg = f"Benchmarking on {dataset_name}:"
    print(f"\n{msg}\n" + str("-" * len(msg)))

    for transformer_name, transformer in transformers:
        longest = np.max([len(name) for name, model in transformers])
        start = time.time()
        transformer.fit(X)
        fit_duration = time.time() - start
        print(f"{transformer_name:<{longest}} {fit_duration:.3f} sec (fit)")
        start = time.time()
        Xt = transformer.transform(X)
        transform_duration = time.time() - start
        print(f"{transformer_name:<{longest}} {transform_duration:.3f} sec (transform)")
        if transformer_name == "PyNNDescentTransformer":
            start = time.time()
            Xt = transformer.transform(X)
            transform_duration = time.time() - start
            print(
                f"{transformer_name:<{longest}} {transform_duration:.3f} sec"
                " (transform)"
            )

示例输出:

在MNIST_10000上的基准测试:#

KNeighborsTransformer 0.007 秒(拟合) KNeighborsTransformer 1.139 秒(转换) NMSlibTransformer 0.208 秒(拟合) NMSlibTransformer 0.315 秒(转换) PyNNDescentTransformer 4.823 秒(拟合) PyNNDescentTransformer 4.884 秒(转换) PyNNDescentTransformer 0.744 秒(转换)

在MNIST_20000上的基准测试:#

KNeighborsTransformer 0.011 秒(拟合) KNeighborsTransformer 5.769 秒(转换) NMSlibTransformer 0.733 秒(拟合) NMSlibTransformer 1.077 秒(转换) PyNNDescentTransformer 14.448 秒(拟合) PyNNDescentTransformer 7.103 秒(转换) PyNNDescentTransformer 1.759 秒(转换)

请注意, PyNNDescentTransformer 在第一次 fit 和第一次 transform 时会花费更多时间,这是由于 numba 即时编译器的开销。但在第一次调用之后,编译后的 Python 代码会被 numba 缓存在缓存中,后续调用不会再有这种初始开销。KNeighborsTransformerNMSlibTransformer 在这里只运行一次,因为它们的 fittransform 时间更稳定(它们没有 PyNNDescentTransformer 的冷启动问题)。

import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter

transformers = [
    ("TSNE with internal NearestNeighbors", TSNE(metric=metric, **tsne_params)),
    (
        "TSNE with KNeighborsTransformer",
        make_pipeline(
            KNeighborsTransformer(
                n_neighbors=n_neighbors, mode="distance", metric=metric
            ),
            TSNE(metric="precomputed", **tsne_params),
        ),
    ),
    (
        "TSNE with NMSlibTransformer",
        make_pipeline(
            NMSlibTransformer(n_neighbors=n_neighbors, metric=metric),
            TSNE(metric="precomputed", **tsne_params),
        ),
    ),
]

# 初始化图表
nrows = len(datasets)
ncols = np.sum([1 for name, model in transformers if "TSNE" in name])
fig, axes = plt.subplots(
    nrows=nrows, ncols=ncols, squeeze=False, figsize=(5 * ncols, 4 * nrows)
)
axes = axes.ravel()
i_ax = 0

for dataset_name, (X, y) in datasets:
    msg = f"Benchmarking on {dataset_name}:"
    print(f"\n{msg}\n" + str("-" * len(msg)))

    for transformer_name, transformer in transformers:
        longest = np.max([len(name) for name, model in transformers])
        start = time.time()
        Xt = transformer.fit_transform(X)
        transform_duration = time.time() - start
        print(
            f"{transformer_name:<{longest}} {transform_duration:.3f} sec"
            " (fit_transform)"
        )

        # 绘制 TSNE 嵌入图,该图在不同方法之间应该非常相似
        axes[i_ax].set_title(transformer_name + "\non " + dataset_name)
        axes[i_ax].scatter(
            Xt[:, 0],
            Xt[:, 1],
            c=y.astype(np.int32),
            alpha=0.2,
            cmap=plt.cm.viridis,
        )
        axes[i_ax].xaxis.set_major_formatter(NullFormatter())
        axes[i_ax].yaxis.set_major_formatter(NullFormatter())
        axes[i_ax].axis("tight")
        i_ax += 1

fig.tight_layout()
plt.show()

示例输出:

在MNIST_10000上的基准测试:#

使用内部NearestNeighbors的TSNE 24.828秒(fit_transform) 使用KNeighborsTransformer的TSNE 20.111秒(fit_transform) 使用NMSlibTransformer的TSNE 21.757秒(fit_transform)

在MNIST_20000上的基准测试:#

使用内部NearestNeighbors的TSNE 51.955秒(fit_transform) 使用KNeighborsTransformer的TSNE 50.994秒(fit_transform) 使用NMSlibTransformer的TSNE 43.536秒(fit_transform)

我们可以观察到,默认的 TSNE 估计器及其内部的 NearestNeighbors 实现,在性能方面大致等同于包含 TSNEKNeighborsTransformer 的管道。这是预期的,因为这两个管道在内部都依赖于相同的 NearestNeighbors 实现来执行精确的邻居搜索。近似的 NMSlibTransformer 在最小的数据集上已经比精确搜索稍快,但这种速度差异预计在样本数量较大的数据集上会更加显著。

请注意,并非所有的近似搜索方法都能保证比默认的精确搜索方法提高速度:实际上,自从scikit-learn 1.1以来,精确搜索的实现已经显著改进。此外,蛮力精确搜索方法在 fit 时不需要构建索引。因此,要在:class:~sklearn.manifold.TSNE 管道的上下文中获得整体性能提升,近似搜索在 transform 时的收益需要大于在 fit 时构建近似搜索索引所花费的额外时间。

最终,TSNE算法本身也是计算密集型的,与最近邻搜索无关。因此,将最近邻搜索步骤加速5倍,并不会使整个流程加速5倍。

Related examples

__sklearn_is_fitted__ 作为开发者 API

__sklearn_is_fitted__ 作为开发者 API

归纳聚类

归纳聚类

离散数据结构上的高斯过程

离散数据结构上的高斯过程

文本文档的外存分类

文本文档的外存分类

Gallery generated by Sphinx-Gallery