.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/release_highlights/plot_release_highlights_1_5_0.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_release_highlights_plot_release_highlights_1_5_0.py: ======================================= scikit-learn 1.5 版本发布亮点 ======================================= .. currentmodule:: sklearn 我们很高兴地宣布发布 scikit-learn 1.5!此次更新包含了许多错误修复和改进,以及一些关键的新功能。以下是本次发布的亮点。 **有关所有更改的详尽列表** ,请参阅 :ref:`发布说明 ` 。 要安装最新版本(使用 pip):: pip install --upgrade scikit-learn 或使用 conda:: conda install -c conda-forge scikit-learn .. GENERATED FROM PYTHON SOURCE LINES 22-25 固定阈值分类器:设置二元分类器的决策阈值 --------------------------------------------------- scikit-learn的所有二元分类器都使用固定的0.5决策阈值,将概率估计(即 `predict_proba` 的输出)转换为类别预测。然而,对于给定的问题,0.5几乎从来不是理想的阈值。:class:`~model_selection.FixedThresholdClassifier` 允许包装任何二元分类器并设置自定义决策阈值。 .. GENERATED FROM PYTHON SOURCE LINES 25-38 .. code-block:: Python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import ConfusionMatrixDisplay X, y = make_classification(n_samples=10_000, weights=[0.9, 0.1], random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) classifier_05 = LogisticRegression(C=1e6, random_state=0).fit(X_train, y_train) _ = ConfusionMatrixDisplay.from_estimator(classifier_05, X_test, y_test) .. image-sg:: /auto_examples/release_highlights/images/sphx_glr_plot_release_highlights_1_5_0_001.png :alt: plot release highlights 1 5 0 :srcset: /auto_examples/release_highlights/images/sphx_glr_plot_release_highlights_1_5_0_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 39-40 降低阈值,即允许更多样本被分类为正类,会增加真正例的数量,但代价是更多的假正例(这在ROC曲线的凹性中是众所周知的)。 .. GENERATED FROM PYTHON SOURCE LINES 40-47 .. code-block:: Python from sklearn.model_selection import FixedThresholdClassifier classifier_01 = FixedThresholdClassifier(classifier_05, threshold=0.1) classifier_01.fit(X_train, y_train) _ = ConfusionMatrixDisplay.from_estimator(classifier_01, X_test, y_test) .. image-sg:: /auto_examples/release_highlights/images/sphx_glr_plot_release_highlights_1_5_0_002.png :alt: plot release highlights 1 5 0 :srcset: /auto_examples/release_highlights/images/sphx_glr_plot_release_highlights_1_5_0_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 48-62 调优阈值分类器CV:调优二元分类器的决策阈值 ------------------------------------------------ 可以使用 :class:`~model_selection.TunedThresholdClassifierCV` 调优二元分类器的决策阈值,以优化给定的指标。 它特别适用于在模型要部署到特定应用场景时找到最佳决策阈值,在这种场景中,我们可以为真正例、真负例、假正例和假负例分配不同的收益或成本。 让我们通过考虑一个任意的案例来说明这一点: - 每个真正例获得1单位的利润,例如欧元、健康生活年限等; - 真负例不获得也不损失任何东西; - 每个假负例损失2单位; - 每个假正例损失0.1单位。 我们的指标量化了每个样本的平均利润,其定义如下Python函数: .. GENERATED FROM PYTHON SOURCE LINES 62-73 .. code-block:: Python from sklearn.metrics import confusion_matrix def custom_score(y_observed, y_pred): tn, fp, fn, tp = confusion_matrix(y_observed, y_pred, normalize="all").ravel() return tp - 2 * fn - 0.1 * fp print("Untuned decision threshold: 0.5") print(f"Custom score: {custom_score(y_test, classifier_05.predict(X_test)):.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Untuned decision threshold: 0.5 Custom score: -0.12 .. GENERATED FROM PYTHON SOURCE LINES 74-77 有趣的是,观察到每次预测的平均收益为负,这意味着该决策系统平均在亏损。 调整阈值以优化此自定义指标会得到一个较小的阈值,从而允许更多样本被分类为正类。结果是,每次预测的平均收益提高。 .. GENERATED FROM PYTHON SOURCE LINES 77-90 .. code-block:: Python from sklearn.model_selection import TunedThresholdClassifierCV from sklearn.metrics import make_scorer custom_scorer = make_scorer( custom_score, response_method="predict", greater_is_better=True ) tuned_classifier = TunedThresholdClassifierCV( classifier_05, cv=5, scoring=custom_scorer ).fit(X, y) print(f"Tuned decision threshold: {tuned_classifier.best_threshold_:.3f}") print(f"Custom score: {custom_score(y_test, tuned_classifier.predict(X_test)):.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Tuned decision threshold: 0.071 Custom score: 0.04 .. GENERATED FROM PYTHON SOURCE LINES 91-96 我们观察到,调整决策阈值可以将一个平均上会造成损失的基于机器学习的系统转变为一个有益的系统。 在实际操作中,定义一个有意义的特定应用指标可能涉及到使错误预测的成本和正确预测的收益依赖于每个数据点特有的辅助元数据,例如在欺诈检测系统中交易金额。 为实现这一目标,:class:`~model_selection.TunedThresholdClassifierCV` 利用元数据路由支持 (:ref:`元数据路由用户指南` ) 允许优化复杂的业务指标,详见 :ref:`成本敏感学习的决策阈值后调优` 。 .. GENERATED FROM PYTHON SOURCE LINES 98-101 PCA中的性能改进 ---------------- :class:`~decomposition.PCA` 有一个新的求解器 `"covariance_eigh"` ,对于具有大量数据点和少量特征的数据集,该求解器比其他求解器快一个数量级,并且更节省内存。 .. GENERATED FROM PYTHON SOURCE LINES 101-113 .. code-block:: Python from sklearn.datasets import make_low_rank_matrix from sklearn.decomposition import PCA X = make_low_rank_matrix( n_samples=10_000, n_features=100, tail_strength=0.1, random_state=0 ) pca = PCA(n_components=10, svd_solver="covariance_eigh").fit(X) print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Explained variance: 0.88 .. GENERATED FROM PYTHON SOURCE LINES 114-115 新的求解器也接受稀疏输入数据: .. GENERATED FROM PYTHON SOURCE LINES 115-123 .. code-block:: Python from scipy.sparse import random X = random(10_000, 100, format="csr", random_state=0) pca = PCA(n_components=10, svd_solver="covariance_eigh").fit(X) print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Explained variance: 0.13 .. GENERATED FROM PYTHON SOURCE LINES 124-127 `"full"` 求解器也得到了改进,使用更少的内存并允许更快的转换。默认的 `svd_solver="auto"` 选项利用了新的求解器,现在能够为稀疏数据集选择合适的求解器。 与大多数其他PCA求解器类似,如果通过启用对 :ref:`Array API ` 的实验性支持,将输入数据作为PyTorch或CuPy数组传递,则新的 `"covariance_eigh"` 求解器可以利用GPU计算。 .. GENERATED FROM PYTHON SOURCE LINES 129-132 ColumnTransformer 是可下标的 ---------------------------------- 现在可以通过名称索引直接访问 :class:`~compose.ColumnTransformer` 的转换器。 .. GENERATED FROM PYTHON SOURCE LINES 132-146 .. code-block:: Python import numpy as np from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder X = np.array([[0, 1, 2], [3, 4, 5]]) column_transformer = ColumnTransformer( [("std_scaler", StandardScaler(), [0]), ("one_hot", OneHotEncoder(), [1, 2])] ) column_transformer.fit(X) print(column_transformer["std_scaler"]) print(column_transformer["one_hot"]) .. rst-class:: sphx-glr-script-out .. code-block:: none StandardScaler() OneHotEncoder() .. GENERATED FROM PYTHON SOURCE LINES 147-151 自定义SimpleImputer的填补策略 -------------------------------- :class:`~impute.SimpleImputer` 现在支持使用自定义策略进行填补, 可以使用一个可调用对象从列向量的非缺失值中计算出一个标量值。 .. GENERATED FROM PYTHON SOURCE LINES 151-175 .. code-block:: Python from sklearn.impute import SimpleImputer X = np.array( [ [-1.1, 1.1, 1.1], [3.9, -1.2, np.nan], [np.nan, 1.3, np.nan], [-0.1, -1.4, -1.4], [-4.9, 1.5, -1.5], [np.nan, 1.6, 1.6], ] ) def smallest_abs(arr): """返回一维数组的最小绝对值。""" return np.min(np.abs(arr)) imputer = SimpleImputer(strategy=smallest_abs) imputer.fit_transform(X) .. rst-class:: sphx-glr-script-out .. code-block:: none array([[-1.1, 1.1, 1.1], [ 3.9, -1.2, 1.1], [ 0.1, 1.3, 1.1], [-0.1, -1.4, -1.4], [-4.9, 1.5, -1.5], [ 0.1, 1.6, 1.6]]) .. GENERATED FROM PYTHON SOURCE LINES 176-179 成对距离与非数值数组 ---------------------- :func:`~metrics.pairwise_distances` 现在可以使用可调用的度量来计算非数值数组之间的距离。 .. GENERATED FROM PYTHON SOURCE LINES 179-200 .. code-block:: Python from sklearn.metrics import pairwise_distances X = ["cat", "dog"] Y = ["cat", "fox"] def levenshtein_distance(x, y): """返回两个字符串之间的Levenshtein距离。""" if x == "" or y == "": return max(len(x), len(y)) if x[0] == y[0]: return levenshtein_distance(x[1:], y[1:]) return 1 + min( levenshtein_distance(x[1:], y), levenshtein_distance(x, y[1:]), levenshtein_distance(x[1:], y[1:]), ) pairwise_distances(X, Y, metric=levenshtein_distance) .. rst-class:: sphx-glr-script-out .. code-block:: none array([[0., 3.], [3., 2.]]) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.734 seconds) .. _sphx_glr_download_auto_examples_release_highlights_plot_release_highlights_1_5_0.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/release_highlights/plot_release_highlights_1_5_0.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_release_highlights_1_5_0.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_release_highlights_1_5_0.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_release_highlights_1_5_0.zip ` .. include:: plot_release_highlights_1_5_0.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_