目标编码器与其他编码器的比较#

TargetEncoder 使用目标值对每个分类特征进行编码。在本例中,我们将比较处理分类特征的三种不同方法:TargetEncoderOrdinalEncoderOneHotEncoder 以及删除类别。

Note

fit(X, y).transform(X) 不等于 fit_transform(X, y) ,因为在 fit_transform 中使用了交叉拟合方案进行编码。详情请参见 用户指南

从 OpenML 加载数据#

首先,我们加载葡萄酒评论数据集,其中目标是评论者给出的评分:

from sklearn.datasets import fetch_openml

wine_reviews = fetch_openml(data_id=42074, as_frame=True)

df = wine_reviews.frame
df.head()
country description designation points price province region_1 region_2 variety winery
0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude


对于这个示例,我们使用数据中以下数值和类别特征的子集。目标是从80到100的连续值:

numerical_features = ["price"]
categorical_features = [
    "country",
    "province",
    "region_1",
    "region_2",
    "variety",
    "winery",
]
target_name = "points"

X = df[numerical_features + categorical_features]
y = df[target_name]

_ = y.hist()
plot target encoder

训练和评估使用不同编码器的管道#

在本节中,我们将评估使用不同编码策略的 HistGradientBoostingRegressor 管道。首先,我们列出将用于预处理分类特征的编码器:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, TargetEncoder

categorical_preprocessors = [
    ("drop", "drop"),
    ("ordinal", OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)),
    (
        "one_hot",
        OneHotEncoder(handle_unknown="ignore", max_categories=20, sparse_output=False),
    ),
    ("target", TargetEncoder(target_type="continuous")),
]

接下来,我们使用交叉验证来评估模型并记录结果:

from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline

n_cv_folds = 3
max_iter = 20
results = []


def evaluate_model_and_store(name, pipe):
    result = cross_validate(
        pipe,
        X,
        y,
        scoring="neg_root_mean_squared_error",
        cv=n_cv_folds,
        return_train_score=True,
    )
    rmse_test_score = -result["test_score"]
    rmse_train_score = -result["train_score"]
    results.append(
        {
            "preprocessor": name,
            "rmse_test_mean": rmse_test_score.mean(),
            "rmse_test_std": rmse_train_score.std(),
            "rmse_train_mean": rmse_train_score.mean(),
            "rmse_train_std": rmse_train_score.std(),
        }
    )


for name, categorical_preprocessor in categorical_preprocessors:
    preprocessor = ColumnTransformer(
        [
            ("numerical", "passthrough", numerical_features),
            ("categorical", categorical_preprocessor, categorical_features),
        ]
    )
    pipe = make_pipeline(
        preprocessor, HistGradientBoostingRegressor(random_state=0, max_iter=max_iter)
    )
    evaluate_model_and_store(name, pipe)

原生类别特征支持#

在本节中,我们构建并评估一个使用原生类别特征支持的管道,使用的是 HistGradientBoostingRegressor ,该类仅支持最多255个唯一类别。在我们的数据集中,大多数类别特征的唯一类别数超过255个:

n_unique_categories = df[categorical_features].nunique().sort_values(ascending=False)
n_unique_categories
winery      14810
region_1     1236
variety       632
province      455
country        48
region_2       18
dtype: int64

为了规避上述限制,我们将分类特征分为低基数和高基数特征。高基数特征将进行目标编码,而低基数特征将在梯度提升中使用原生的分类特征。

high_cardinality_features = n_unique_categories[n_unique_categories > 255].index
low_cardinality_features = n_unique_categories[n_unique_categories <= 255].index
mixed_encoded_preprocessor = ColumnTransformer(
    [
        ("numerical", "passthrough", numerical_features),
        (
            "high_cardinality",
            TargetEncoder(target_type="continuous"),
            high_cardinality_features,
        ),
        (
            "low_cardinality",
            OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1),
            low_cardinality_features,
        ),
    ],
    verbose_feature_names_out=False,
)

# 预处理器的输出必须设置为pandas格式,以便梯度提升模型能够检测到低基数特征。
mixed_encoded_preprocessor.set_output(transform="pandas")
mixed_pipe = make_pipeline(
    mixed_encoded_preprocessor,
    HistGradientBoostingRegressor(
        random_state=0, max_iter=max_iter, categorical_features=low_cardinality_features
    ),
)
mixed_pipe
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('numerical', 'passthrough',
                                                  ['price']),
                                                 ('high_cardinality',
                                                  TargetEncoder(target_type='continuous'),
                                                  Index(['winery', 'region_1', 'variety', 'province'], dtype='object')),
                                                 ('low_cardinality',
                                                  OrdinalEncoder(handle_unknown='use_encoded_value',
                                                                 unknown_value=-1),
                                                  Index(['country', 'region_2'], dtype='object'))],
                                   verbose_feature_names_out=False)),
                ('histgradientboostingregressor',
                 HistGradientBoostingRegressor(categorical_features=Index(['country', 'region_2'], dtype='object'),
                                               max_iter=20, random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


最后,我们使用交叉验证评估管道并记录结果:

evaluate_model_and_store("mixed_target", mixed_pipe)

绘制结果#

在本节中,我们通过绘制测试和训练分数来显示结果:

import matplotlib.pyplot as plt
import pandas as pd

results_df = (
    pd.DataFrame(results).set_index("preprocessor").sort_values("rmse_test_mean")
)

fig, (ax1, ax2) = plt.subplots(
    1, 2, figsize=(12, 8), sharey=True, constrained_layout=True
)
xticks = range(len(results_df))
name_to_color = dict(
    zip((r["preprocessor"] for r in results), ["C0", "C1", "C2", "C3", "C4"])
)

for subset, ax in zip(["test", "train"], [ax1, ax2]):
    mean, std = f"rmse_{subset}_mean", f"rmse_{subset}_std"
    data = results_df[[mean, std]].sort_values(mean)
    ax.bar(
        x=xticks,
        height=data[mean],
        yerr=data[std],
        width=0.9,
        color=[name_to_color[name] for name in data.index],
    )
    ax.set(
        title=f"RMSE ({subset.title()})",
        xlabel="Encoding Scheme",
        xticks=xticks,
        xticklabels=data.index,
    )
RMSE (Test), RMSE (Train)

在评估测试集上的预测性能时,删除类别的表现最差,而目标编码器的表现最好。这可以解释如下:

  • 删除分类特征会使管道的表达能力降低,从而导致欠拟合;

  • 由于高基数性并为了减少训练时间,独热编码方案使用了 max_categories=20 ,这防止了特征扩展过多,但可能导致欠拟合。

  • 如果我们没有设置 max_categories=20 ,独热编码方案可能会导致管道过拟合,因为特征数量会随着罕见类别的出现而爆炸,这些类别在训练集中偶然与目标相关;

  • 序数编码对特征施加了一个任意顺序,然后被 HistGradientBoostingRegressor 视为数值处理。由于该模型将数值特征分为每个特征256个箱,许多不相关的类别可能会被分到一起,结果整个管道可能会欠拟合;

  • 使用目标编码器时,同样的分箱会发生,但由于编码值按与目标变量的边际关联统计排序,HistGradientBoostingRegressor 使用的分箱是合理的,并且会产生良好的结果:平滑目标编码和分箱的结合作为一种良好的正则化策略,防止过拟合,同时不过多限制管道的表达能力。

Total running time of the script: (0 minutes 12.336 seconds)

Related examples

梯度提升中的类别特征支持

梯度提升中的类别特征支持

scikit-learn 1.2 版本发布亮点

scikit-learn 1.2 版本发布亮点

目标编码器的内部交叉拟合

目标编码器的内部交叉拟合

使用堆叠方法结合预测器

使用堆叠方法结合预测器

Gallery generated by Sphinx-Gallery