.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/neighbors/approximate_nearest_neighbors.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_neighbors_approximate_nearest_neighbors.py: ===================================== TSNE中的近似最近邻 ===================================== 本示例展示了如何在管道中链接KNeighborsTransformer和TSNE。 它还展示了如何包装 `nmslib` 和 `pynndescent` 包以替换KNeighborsTransformer并执行近似最近邻。这些包可以通过 `pip install nmslib pynndescent` 进行安装。 注意:在KNeighborsTransformer中,我们使用的定义包括每个训练点作为其自身的邻居在 `n_neighbors` 的计数中,并且为了兼容性原因,当 `mode == 'distance'` 时会计算一个额外的邻居。 请注意,我们在提议的 `nmslib` 包装中也做了同样的处理。 .. GENERATED FROM PYTHON SOURCE LINES 12-16 .. code-block:: Python # 作者:scikit-learn 开发者 # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 17-18 首先我们尝试导入这些包,并在缺失时警告用户。 .. GENERATED FROM PYTHON SOURCE LINES 18-33 .. code-block:: Python import sys try: import nmslib except ImportError: print("The package 'nmslib' is required to run this example.") sys.exit() try: from pynndescent import PyNNDescentTransformer except ImportError: print("The package 'pynndescent' is required to run this example.") sys.exit() .. GENERATED FROM PYTHON SOURCE LINES 34-35 我们定义了一个包装类,用于实现 `nmslib` 的 scikit-learn API,以及一个加载函数。 .. GENERATED FROM PYTHON SOURCE LINES 35-106 .. code-block:: Python import joblib import numpy as np from scipy.sparse import csr_matrix from sklearn.base import BaseEstimator, TransformerMixin from sklearn.datasets import fetch_openml from sklearn.utils import shuffle class NMSlibTransformer(TransformerMixin, BaseEstimator): """用于将nmslib作为sklearn的KNeighborsTransformer的包装器""" def __init__(self, n_neighbors=5, metric="euclidean", method="sw-graph", n_jobs=-1): self.n_neighbors = n_neighbors self.method = method self.metric = metric self.n_jobs = n_jobs def fit(self, X): self.n_samples_fit_ = X.shape[0] # 请参阅手册中的更多指标 # https://github.com/nmslib/nmslib/tree/master/manual space = { "euclidean": "l2", "cosine": "cosinesimil", "l1": "l1", "l2": "l2", }[self.metric] self.nmslib_ = nmslib.init(method=self.method, space=space) self.nmslib_.addDataPointBatch(X.copy()) self.nmslib_.createIndex() return self def transform(self, X): n_samples_transform = X.shape[0] # 出于兼容性原因,由于每个样本都被视为其自身的邻居,因此将多计算一个额外的邻居。 n_neighbors = self.n_neighbors + 1 if self.n_jobs < 0: # 与 joblib 对负 n_jobs 值的处理相同: # 特别地, `n_jobs == -1` 表示“线程数与 CPU 数相同”。 num_threads = joblib.cpu_count() + self.n_jobs + 1 else: num_threads = self.n_jobs results = self.nmslib_.knnQueryBatch( X.copy(), k=n_neighbors, num_threads=num_threads ) indices, distances = zip(*results) indices, distances = np.vstack(indices), np.vstack(distances) indptr = np.arange(0, n_samples_transform * n_neighbors + 1, n_neighbors) kneighbors_graph = csr_matrix( (distances.ravel(), indices.ravel(), indptr), shape=(n_samples_transform, self.n_samples_fit_), ) return kneighbors_graph def load_mnist(n_samples): """加载MNIST,打乱数据,只返回n_samples。""" mnist = fetch_openml("mnist_784", as_frame=False) X, y = shuffle(mnist.data, mnist.target, random_state=2) return X[:n_samples] / 255, y[:n_samples] .. GENERATED FROM PYTHON SOURCE LINES 107-108 我们对不同的精确/近似最近邻变换器进行基准测试。 .. GENERATED FROM PYTHON SOURCE LINES 108-176 .. code-block:: Python import time from sklearn.manifold import TSNE from sklearn.neighbors import KNeighborsTransformer from sklearn.pipeline import make_pipeline datasets = [ ("MNIST_10000", load_mnist(n_samples=10_000)), ("MNIST_20000", load_mnist(n_samples=20_000)), ] n_iter = 500 perplexity = 30 metric = "euclidean" # TSNE 需要一定数量的邻居,这取决于困惑度参数。 # 由于我们将每个样本都包含为其自身的邻居,因此需要增加一个。 n_neighbors = int(3.0 * perplexity + 1) + 1 tsne_params = dict( init="random", # pca not supported for sparse matrices perplexity=perplexity, method="barnes_hut", random_state=42, n_iter=n_iter, learning_rate="auto", ) transformers = [ ( "KNeighborsTransformer", KNeighborsTransformer(n_neighbors=n_neighbors, mode="distance", metric=metric), ), ( "NMSlibTransformer", NMSlibTransformer(n_neighbors=n_neighbors, metric=metric), ), ( "PyNNDescentTransformer", PyNNDescentTransformer( n_neighbors=n_neighbors, metric=metric, parallel_batch_queries=True ), ), ] for dataset_name, (X, y) in datasets: msg = f"Benchmarking on {dataset_name}:" print(f"\n{msg}\n" + str("-" * len(msg))) for transformer_name, transformer in transformers: longest = np.max([len(name) for name, model in transformers]) start = time.time() transformer.fit(X) fit_duration = time.time() - start print(f"{transformer_name:<{longest}} {fit_duration:.3f} sec (fit)") start = time.time() Xt = transformer.transform(X) transform_duration = time.time() - start print(f"{transformer_name:<{longest}} {transform_duration:.3f} sec (transform)") if transformer_name == "PyNNDescentTransformer": start = time.time() Xt = transformer.transform(X) transform_duration = time.time() - start print( f"{transformer_name:<{longest}} {transform_duration:.3f} sec" " (transform)" ) .. GENERATED FROM PYTHON SOURCE LINES 177-200 示例输出: 在MNIST_10000上的基准测试: ---------------------------- KNeighborsTransformer 0.007 秒(拟合) KNeighborsTransformer 1.139 秒(转换) NMSlibTransformer 0.208 秒(拟合) NMSlibTransformer 0.315 秒(转换) PyNNDescentTransformer 4.823 秒(拟合) PyNNDescentTransformer 4.884 秒(转换) PyNNDescentTransformer 0.744 秒(转换) 在MNIST_20000上的基准测试: ---------------------------- KNeighborsTransformer 0.011 秒(拟合) KNeighborsTransformer 5.769 秒(转换) NMSlibTransformer 0.733 秒(拟合) NMSlibTransformer 1.077 秒(转换) PyNNDescentTransformer 14.448 秒(拟合) PyNNDescentTransformer 7.103 秒(转换) PyNNDescentTransformer 1.759 秒(转换) 请注意, `PyNNDescentTransformer` 在第一次 `fit` 和第一次 `transform` 时会花费更多时间,这是由于 numba 即时编译器的开销。但在第一次调用之后,编译后的 Python 代码会被 numba 缓存在缓存中,后续调用不会再有这种初始开销。:class:`~sklearn.neighbors.KNeighborsTransformer` 和 `NMSlibTransformer` 在这里只运行一次,因为它们的 `fit` 和 `transform` 时间更稳定(它们没有 PyNNDescentTransformer 的冷启动问题)。 .. GENERATED FROM PYTHON SOURCE LINES 202-265 .. code-block:: Python import matplotlib.pyplot as plt from matplotlib.ticker import NullFormatter transformers = [ ("TSNE with internal NearestNeighbors", TSNE(metric=metric, **tsne_params)), ( "TSNE with KNeighborsTransformer", make_pipeline( KNeighborsTransformer( n_neighbors=n_neighbors, mode="distance", metric=metric ), TSNE(metric="precomputed", **tsne_params), ), ), ( "TSNE with NMSlibTransformer", make_pipeline( NMSlibTransformer(n_neighbors=n_neighbors, metric=metric), TSNE(metric="precomputed", **tsne_params), ), ), ] # 初始化图表 nrows = len(datasets) ncols = np.sum([1 for name, model in transformers if "TSNE" in name]) fig, axes = plt.subplots( nrows=nrows, ncols=ncols, squeeze=False, figsize=(5 * ncols, 4 * nrows) ) axes = axes.ravel() i_ax = 0 for dataset_name, (X, y) in datasets: msg = f"Benchmarking on {dataset_name}:" print(f"\n{msg}\n" + str("-" * len(msg))) for transformer_name, transformer in transformers: longest = np.max([len(name) for name, model in transformers]) start = time.time() Xt = transformer.fit_transform(X) transform_duration = time.time() - start print( f"{transformer_name:<{longest}} {transform_duration:.3f} sec" " (fit_transform)" ) # 绘制 TSNE 嵌入图,该图在不同方法之间应该非常相似 axes[i_ax].set_title(transformer_name + "\non " + dataset_name) axes[i_ax].scatter( Xt[:, 0], Xt[:, 1], c=y.astype(np.int32), alpha=0.2, cmap=plt.cm.viridis, ) axes[i_ax].xaxis.set_major_formatter(NullFormatter()) axes[i_ax].yaxis.set_major_formatter(NullFormatter()) axes[i_ax].axis("tight") i_ax += 1 fig.tight_layout() plt.show() .. GENERATED FROM PYTHON SOURCE LINES 266-285 示例输出: 在MNIST_10000上的基准测试: ---------------------------- 使用内部NearestNeighbors的TSNE 24.828秒(fit_transform) 使用KNeighborsTransformer的TSNE 20.111秒(fit_transform) 使用NMSlibTransformer的TSNE 21.757秒(fit_transform) 在MNIST_20000上的基准测试: ---------------------------- 使用内部NearestNeighbors的TSNE 51.955秒(fit_transform) 使用KNeighborsTransformer的TSNE 50.994秒(fit_transform) 使用NMSlibTransformer的TSNE 43.536秒(fit_transform) 我们可以观察到,默认的 :class:`~sklearn.manifold.TSNE` 估计器及其内部的 :class:`~sklearn.neighbors.NearestNeighbors` 实现,在性能方面大致等同于包含 :class:`~sklearn.manifold.TSNE` 和 :class:`~sklearn.neighbors.KNeighborsTransformer` 的管道。这是预期的,因为这两个管道在内部都依赖于相同的 :class:`~sklearn.neighbors.NearestNeighbors` 实现来执行精确的邻居搜索。近似的 `NMSlibTransformer` 在最小的数据集上已经比精确搜索稍快,但这种速度差异预计在样本数量较大的数据集上会更加显著。 请注意,并非所有的近似搜索方法都能保证比默认的精确搜索方法提高速度:实际上,自从scikit-learn 1.1以来,精确搜索的实现已经显著改进。此外,蛮力精确搜索方法在 `fit` 时不需要构建索引。因此,要在:class:`~sklearn.manifold.TSNE` 管道的上下文中获得整体性能提升,近似搜索在 `transform` 时的收益需要大于在 `fit` 时构建近似搜索索引所花费的额外时间。 最终,TSNE算法本身也是计算密集型的,与最近邻搜索无关。因此,将最近邻搜索步骤加速5倍,并不会使整个流程加速5倍。 .. _sphx_glr_download_auto_examples_neighbors_approximate_nearest_neighbors.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/neighbors/approximate_nearest_neighbors.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: approximate_nearest_neighbors.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: approximate_nearest_neighbors.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: approximate_nearest_neighbors.zip ` .. include:: approximate_nearest_neighbors.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_