.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/ensemble/plot_feature_transformation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_ensemble_plot_feature_transformation.py: =============================================== 使用树集成进行特征转换 =============================================== 将您的特征转换为更高维度的稀疏空间。然后在这些特征上训练一个线性模型。 首先在训练集上拟合一个树集成(完全随机树、随机森林或梯度提升树)。然后,集成中每棵树的每个叶子在一个新的特征空间中被分配一个固定的任意特征索引。这些叶子索引随后以独热编码的方式进行编码。 每个样本通过集成中每棵树的决策,并最终在每棵树中落入一个叶子。通过将这些叶子的特征值设置为1,其他特征值设置为0来对样本进行编码。 由此产生的转换器便学习到了数据的监督、稀疏、高维度的类别嵌入。 .. GENERATED FROM PYTHON SOURCE LINES 15-19 .. code-block:: Python # 作者:scikit-learn 开发者 # SPDX-License-Identifier: BSD-3-Clause .. GENERATED FROM PYTHON SOURCE LINES 20-27 首先,我们将创建一个大型数据集并将其拆分为三个集合: - 一个用于训练集成方法的数据集,这些方法随后用作特征工程转换器; - 一个用于训练线性模型的数据集; - 一个用于测试线性模型的数据集。 重要的是以避免因数据泄露而导致过拟合的方式划分数据。 .. GENERATED FROM PYTHON SOURCE LINES 27-40 .. code-block:: Python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=80_000, random_state=10) X_full_train, X_test, y_full_train, y_test = train_test_split( X, y, test_size=0.5, random_state=10 ) X_train_ensemble, X_train_linear, y_train_ensemble, y_train_linear = train_test_split( X_full_train, y_full_train, test_size=0.5, random_state=10 ) .. GENERATED FROM PYTHON SOURCE LINES 41-42 对于每个集成方法,我们将使用10个估计器和最大深度为3级。 .. GENERATED FROM PYTHON SOURCE LINES 42-47 .. code-block:: Python n_estimators = 10 max_depth = 3 .. GENERATED FROM PYTHON SOURCE LINES 48-49 首先,我们将在分离的训练集上训练随机森林和梯度提升模型。 .. GENERATED FROM PYTHON SOURCE LINES 49-63 .. code-block:: Python from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier random_forest = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth, random_state=10 ) random_forest.fit(X_train_ensemble, y_train_ensemble) gradient_boosting = GradientBoostingClassifier( n_estimators=n_estimators, max_depth=max_depth, random_state=10 ) _ = gradient_boosting.fit(X_train_ensemble, y_train_ensemble) .. GENERATED FROM PYTHON SOURCE LINES 64-67 请注意,:class:`~sklearn.ensemble.HistGradientBoostingClassifier` 比 :class:`~sklearn.ensemble.GradientBoostingClassifier` 快得多,尤其是在处理中等规模的数据集( `n_samples >= 10_000` )时,但这不适用于当前的示例。 :class:`~sklearn.ensemble.RandomTreesEmbedding` 是一种无监督方法,因此不需要单独进行训练。 .. GENERATED FROM PYTHON SOURCE LINES 67-74 .. code-block:: Python from sklearn.ensemble import RandomTreesEmbedding random_tree_embedding = RandomTreesEmbedding( n_estimators=n_estimators, max_depth=max_depth, random_state=0 ) .. GENERATED FROM PYTHON SOURCE LINES 75-78 现在,我们将创建三个管道,这些管道将使用上述嵌入作为预处理阶段。 随机树嵌入可以直接与逻辑回归进行流水线处理,因为它是一个标准的scikit-learn转换器。 .. GENERATED FROM PYTHON SOURCE LINES 78-85 .. code-block:: Python from sklearn.linear_model import LogisticRegression from sklearn.pipeline import make_pipeline rt_model = make_pipeline(random_tree_embedding, LogisticRegression(max_iter=1000)) rt_model.fit(X_train_linear, y_train_linear) .. raw:: html
Pipeline(steps=[('randomtreesembedding',
                     RandomTreesEmbedding(max_depth=3, n_estimators=10,
                                          random_state=0)),
                    ('logisticregression', LogisticRegression(max_iter=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 86-87 然后,我们可以将随机森林或梯度提升与逻辑回归进行流水线处理。然而,特征转换将通过调用方法 `apply` 来实现。scikit-learn 中的流水线期望调用 `transform` 。因此,我们将对 `apply` 的调用包装在 `FunctionTransformer` 中。 .. GENERATED FROM PYTHON SOURCE LINES 87-106 .. code-block:: Python from sklearn.preprocessing import FunctionTransformer, OneHotEncoder def rf_apply(X, model): return model.apply(X) rf_leaves_yielder = FunctionTransformer(rf_apply, kw_args={"model": random_forest}) rf_model = make_pipeline( rf_leaves_yielder, OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=1000), ) rf_model.fit(X_train_linear, y_train_linear) .. raw:: html
Pipeline(steps=[('functiontransformer',
                     FunctionTransformer(func=<function rf_apply at 0xffff7f3cfa60>,
                                         kw_args={'model': RandomForestClassifier(max_depth=3,
                                                                                  n_estimators=10,
                                                                                  random_state=10)})),
                    ('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
                    ('logisticregression', LogisticRegression(max_iter=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 107-122 .. code-block:: Python def gbdt_apply(X, model): return model.apply(X)[:, :, 0] gbdt_leaves_yielder = FunctionTransformer( gbdt_apply, kw_args={"model": gradient_boosting} ) gbdt_model = make_pipeline( gbdt_leaves_yielder, OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=1000), ) gbdt_model.fit(X_train_linear, y_train_linear) .. raw:: html
Pipeline(steps=[('functiontransformer',
                     FunctionTransformer(func=<function gbdt_apply at 0xffff79083ce0>,
                                         kw_args={'model': GradientBoostingClassifier(n_estimators=10,
                                                                                      random_state=10)})),
                    ('onehotencoder', OneHotEncoder(handle_unknown='ignore')),
                    ('logisticregression', LogisticRegression(max_iter=1000))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


.. GENERATED FROM PYTHON SOURCE LINES 123-124 我们终于可以展示所有模型的不同ROC曲线了。 .. GENERATED FROM PYTHON SOURCE LINES 124-147 .. code-block:: Python import matplotlib.pyplot as plt from sklearn.metrics import RocCurveDisplay _, ax = plt.subplots() models = [ ("RT embedding -> LR", rt_model), ("RF", random_forest), ("RF embedding -> LR", rf_model), ("GBDT", gradient_boosting), ("GBDT embedding -> LR", gbdt_model), ] model_displays = {} for name, pipeline in models: model_displays[name] = RocCurveDisplay.from_estimator( pipeline, X_test, y_test, ax=ax, name=name ) _ = ax.set_title("ROC curve") .. image-sg:: /auto_examples/ensemble/images/sphx_glr_plot_feature_transformation_001.png :alt: ROC curve :srcset: /auto_examples/ensemble/images/sphx_glr_plot_feature_transformation_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 148-155 .. code-block:: Python _, ax = plt.subplots() for name, pipeline in models: model_displays[name].plot(ax=ax) ax.set_xlim(0, 0.2) ax.set_ylim(0.8, 1) _ = ax.set_title("ROC curve (zoomed in at top left)") .. image-sg:: /auto_examples/ensemble/images/sphx_glr_plot_feature_transformation_002.png :alt: ROC curve (zoomed in at top left) :srcset: /auto_examples/ensemble/images/sphx_glr_plot_feature_transformation_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 4.538 seconds) .. _sphx_glr_download_auto_examples_ensemble_plot_feature_transformation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/ensemble/plot_feature_transformation.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_feature_transformation.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_feature_transformation.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_feature_transformation.zip ` .. include:: plot_feature_transformation.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_