.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/preprocessing/plot_target_encoder_cross_val.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_preprocessing_plot_target_encoder_cross_val.py: ======================================= 目标编码器的内部交叉拟合 ======================================= .. currentmodule:: sklearn.preprocessing :class:`TargetEncoder` 用目标变量的收缩均值替换分类特征的每个类别。此方法在分类特征与目标变量之间存在强关系的情况下非常有用。为了防止过拟合,:meth:`TargetEncoder.fit_transform` 使用内部的 :term:`cross fitting` 方案对训练数据进行编码,以供下游模型使用。该方案涉及将数据分成 *k* 折,并使用其他 *k-1* 折学习到的编码对每一折进行编码。在此示例中,我们演示了交叉拟合过程对于防止过拟合的重要性。 .. GENERATED FROM PYTHON SOURCE LINES 12-21 创建合成数据集 ================ 在此示例中,我们构建了一个包含三个分类特征的数据集: * 一个具有中等基数的信息特征(“informative”) * 一个具有中等基数的不信息特征(“shuffled”) * 一个具有高基数的不信息特征(“near_unique”) 首先,我们生成信息特征: .. GENERATED FROM PYTHON SOURCE LINES 21-45 .. code-block:: Python import numpy as np from sklearn.preprocessing import KBinsDiscretizer n_samples = 50_000 rng = np.random.RandomState(42) y = rng.randn(n_samples) noise = 0.5 * rng.randn(n_samples) n_categories = 100 kbins = KBinsDiscretizer( n_bins=n_categories, encode="ordinal", strategy="uniform", random_state=rng, subsample=None, ) X_informative = kbins.fit_transform((y + noise).reshape(-1, 1)) # 通过置换X_informative的值,去除y与bin索引之间的线性关系: permuted_categories = rng.permutation(n_categories) X_informative = permuted_categories[X_informative.astype(np.int32)] .. GENERATED FROM PYTHON SOURCE LINES 46-47 通过对信息特征进行置换并去除与目标的关系,生成具有中等基数的不信息特征: .. GENERATED FROM PYTHON SOURCE LINES 47-49 .. code-block:: Python X_shuffled = rng.permutation(X_informative) .. GENERATED FROM PYTHON SOURCE LINES 50-51 高基数的无信息特征是独立于目标变量生成的。我们将展示没有交叉拟合的目标编码会导致下游回归器的灾难性过拟合。这些高基数特征基本上是样本的唯一标识符,通常应从机器学习数据集中删除。在这个例子中,我们生成它们以展示TargetEncoder的默认交叉拟合行为如何自动缓解过拟合问题。 .. GENERATED FROM PYTHON SOURCE LINES 51-56 .. code-block:: Python X_near_unique_categories = rng.choice( int(0.9 * n_samples), size=n_samples, replace=True ).reshape(-1, 1) .. GENERATED FROM PYTHON SOURCE LINES 57-58 最终,我们组装数据集并进行训练测试拆分: .. GENERATED FROM PYTHON SOURCE LINES 58-72 .. code-block:: Python import pandas as pd from sklearn.model_selection import train_test_split X = pd.DataFrame( np.concatenate( [X_informative, X_shuffled, X_near_unique_categories], axis=1, ), columns=["informative", "shuffled", "near_unique"], ) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 73-76 训练岭回归模型 ================ 在本节中,我们将在有编码和无编码的数据集上训练岭回归模型,并探讨目标编码器在有和无内部交叉拟合情况下的影响。首先,我们看到在原始特征上训练的岭回归模型性能较低。这是因为我们打乱了信息特征的顺序,意味着 `X_informative` 在原始状态下并不具有信息性。 .. GENERATED FROM PYTHON SOURCE LINES 76-89 .. code-block:: Python import sklearn from sklearn.linear_model import Ridge # 配置transformers始终输出DataFrame sklearn.set_config(transform_output="pandas") ridge = Ridge(alpha=1e-6, solver="lsqr", fit_intercept=False) raw_model = ridge.fit(X_train, y_train) print("Raw Model score on training set: ", raw_model.score(X_train, y_train)) print("Raw Model score on test set: ", raw_model.score(X_test, y_test)) .. rst-class:: sphx-glr-script-out .. code-block:: none Raw Model score on training set: 0.0049896314219657345 Raw Model score on test set: 0.004577621580701519 .. GENERATED FROM PYTHON SOURCE LINES 90-91 接下来,我们创建一个包含目标编码器和岭回归模型的管道。该管道使用 :meth:`TargetEncoder.fit_transform` ,该方法使用 :term:`交叉拟合` 。我们可以看到模型很好地拟合了数据,并且能够推广到测试集: .. GENERATED FROM PYTHON SOURCE LINES 91-100 .. code-block:: Python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import TargetEncoder model_with_cf = make_pipeline(TargetEncoder(random_state=0), ridge) model_with_cf.fit(X_train, y_train) print("Model with CF on train set: ", model_with_cf.score(X_train, y_train)) print("Model with CF on test set: ", model_with_cf.score(X_test, y_test)) .. rst-class:: sphx-glr-script-out .. code-block:: none Model with CF on train set: 0.8000184677460311 Model with CF on test set: 0.7927845601690897 .. GENERATED FROM PYTHON SOURCE LINES 101-102 线性模型的系数表明,大部分权重集中在列索引为0的特征上,即信息特征。 .. GENERATED FROM PYTHON SOURCE LINES 102-118 .. code-block:: Python import matplotlib.pyplot as plt import pandas as pd plt.rcParams["figure.constrained_layout.use"] = True coefs_cf = pd.Series( model_with_cf[-1].coef_, index=model_with_cf[-1].feature_names_in_ ).sort_values() ax = coefs_cf.plot(kind="barh") _ = ax.set( title="Target encoded with cross fitting", xlabel="Ridge coefficient", ylabel="Feature", ) .. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png :alt: Target encoded with cross fitting :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 119-120 虽然 :meth:`TargetEncoder.fit_transform` 使用内部的 :term:`交叉拟合` 方案来学习训练集的编码,但 :meth:`TargetEncoder.transform` 本身并不使用这种方案。它使用完整的训练集来学习编码并转换分类特征。因此,我们可以使用 :meth:`TargetEncoder.fit` 后跟 :meth:`TargetEncoder.transform` 来禁用 :term:`交叉拟合` 。然后将这种编码传递给岭回归模型。 .. GENERATED FROM PYTHON SOURCE LINES 120-128 .. code-block:: Python target_encoder = TargetEncoder(random_state=0) target_encoder.fit(X_train, y_train) X_train_no_cf_encoding = target_encoder.transform(X_train) X_test_no_cf_encoding = target_encoder.transform(X_test) model_no_cf = ridge.fit(X_train_no_cf_encoding, y_train) .. GENERATED FROM PYTHON SOURCE LINES 129-130 我们评估了在编码时未使用 :term:`交叉拟合` 的模型,发现它过拟合了: .. GENERATED FROM PYTHON SOURCE LINES 130-143 .. code-block:: Python print( "Model without CF on training set: ", model_no_cf.score(X_train_no_cf_encoding, y_train), ) print( "Model without CF on test set: ", model_no_cf.score( X_test_no_cf_encoding, y_test, ), ) .. rst-class:: sphx-glr-script-out .. code-block:: none Model without CF on training set: 0.858486250088675 Model without CF on test set: 0.6338211367101004 .. GENERATED FROM PYTHON SOURCE LINES 144-145 岭回归模型过拟合,因为它对信息量极低的极高基数(“近乎唯一”)和中等基数(“打乱”)特征赋予了比使用:term:`交叉拟合` 编码特征时更高的权重。 .. GENERATED FROM PYTHON SOURCE LINES 145-156 .. code-block:: Python coefs_no_cf = pd.Series( model_no_cf.coef_, index=model_no_cf.feature_names_in_ ).sort_values() ax = coefs_no_cf.plot(kind="barh") _ = ax.set( title="Target encoded without cross fitting", xlabel="Ridge coefficient", ylabel="Feature", ) .. image-sg:: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png :alt: Target encoded without cross fitting :srcset: /auto_examples/preprocessing/images/sphx_glr_plot_target_encoder_cross_val_002.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 157-160 结论 ==== 本示例演示了 :class:`TargetEncoder` 内部 :term:`交叉拟合` 的重要性。在将训练数据传递给机器学习模型之前,使用 :meth:`TargetEncoder.fit_transform` 对其进行编码是很重要的。当 :class:`TargetEncoder` 是 :class:`~sklearn.pipeline.Pipeline` 的一部分并且管道已拟合时,管道将正确调用 :meth:`TargetEncoder.fit_transform` 并在编码训练数据时使用 :term:`交叉拟合` 。 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.183 seconds) .. _sphx_glr_download_auto_examples_preprocessing_plot_target_encoder_cross_val.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/preprocessing/plot_target_encoder_cross_val.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_target_encoder_cross_val.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_target_encoder_cross_val.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_target_encoder_cross_val.zip ` .. include:: plot_target_encoder_cross_val.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_