HistGradientBoostingClassifier#

class sklearn.ensemble.HistGradientBoostingClassifier(loss='log_loss', *, learning_rate=0.1, max_iter=100, max_leaf_nodes=31, max_depth=None, min_samples_leaf=20, l2_regularization=0.0, max_features=1.0, max_bins=255, categorical_features='warn', monotonic_cst=None, interaction_cst=None, warm_start=False, early_stopping='auto', scoring='loss', validation_fraction=0.1, n_iter_no_change=10, tol=1e-07, verbose=0, random_state=None, class_weight=None)#

直方图梯度提升分类树。

该估计器在处理大数据集（n_samples >= 10000）时比 GradientBoostingClassifier 快得多。

该估计器对缺失值（NaNs）有原生支持。在训练期间，树生长器根据潜在收益，在每个分裂点学习是否将缺失值样本分配到左子节点或右子节点。在预测时，缺失值样本相应地被分配到左子节点或右子节点。如果在给定特征的训练过程中没有遇到缺失值，则缺失值样本被映射到样本最多的子节点。

该实现灵感来自 LightGBM 。

更多信息请参阅用户指南。

Added in version 0.21.

Parameters:

loss{‘log_loss’}, default=’log_loss’

在提升过程中使用的损失函数。

对于二分类问题，’log_loss’ 也称为逻辑损失、二项式偏差或二元交叉熵。在内部，模型在每次提升迭代中拟合一棵树，并使用逻辑 sigmoid 函数（expit）作为逆链接函数来计算预测的正类概率。

对于多分类问题，’log_loss’ 也称为多项式偏差或多类交叉熵。在内部，模型在每次提升迭代和每个类别中拟合一棵树，并使用 softmax 函数作为逆链接函数来计算类别的预测概率。

learning_ratefloat, default=0.1

学习率，也称为 shrinkage。这用于作为叶子值的乘法因子。使用 1 表示无收缩。

max_iterint, default=100

提升过程的最大迭代次数，即二分类的最大树数。对于多分类，每次迭代构建 n_classes 棵树。

max_leaf_nodesint or None, default=31

每棵树的最大叶子数。必须严格大于 1。如果为 None，则没有最大限制。

max_depthint or None, default=None

每棵树的最大深度。树的深度是从根到最深叶子的边数。默认情况下，深度不受限制。

min_samples_leafint, default=20

每个叶子的最小样本数。对于少于几百个样本的小数据集，建议降低此值，因为只会构建非常浅的树。

l2_regularizationfloat, default=0

L2 正则化参数，惩罚具有小 hessians 的叶子。使用 0 表示无正则化（默认）。

max_featuresfloat, default=1.0

每个节点分裂时随机选择的特征比例。这是一种正则化形式，较小的值使树更弱的学习者，可能防止过拟合。如果存在 interaction_cst 的交互约束，则只考虑允许的特征进行子采样。

Added in version 1.4.

max_binsint, default=255

非缺失值使用的最大箱数。在训练之前，输入数组 X 的每个特征被分箱为整数值箱，从而加快训练阶段。具有少量唯一值的特征可能使用少于 max_bins 个箱。除了 max_bins 个箱外，总是为缺失值保留一个额外的箱。必须不大于 255。

categorical_featuresarray-like of {bool, int, str} of shape (n_features) or shape (n_categorical_features,), default=None

指示分类特征。

None : 不考虑任何特征为分类特征。
布尔数组 : 布尔掩码指示分类特征。
整数数组 : 整数索引指示分类特征。
字符串数组 : 分类特征的名称（假设训练数据有特征名称）。
"from_dtype" : 数据框中数据类型为 “category” 的列被认为是分类特征。输入必须是暴露 __dataframe__ 方法的对象，例如 pandas 或 polars 数据框。

对于每个分类特征，最多可以有 max_bins 个唯一类别。编码为数值类型的分类特征的负值被视为缺失值。所有分类值被转换为浮点数。这意味着 1.0 和 1 被视为相同的类别。

更多信息请参阅用户指南。

Added in version 0.24.

Changed in version 1.2: 添加了对特征名称的支持。

Changed in version 1.4: 添加了 "from_dtype" 选项。默认值将在 v1.6 中更改为 "from_dtype" 。

monotonic_cstarray-like of int of shape (n_features) or dict, default=None

对每个特征强制执行的单调约束，使用以下整数值指定：

1: 单调增加
0: 无约束
-1: 单调减少

如果是一个带有字符串键的字典，则按名称映射特征到单调约束。如果是一个数组，则按位置映射特征到约束。请参阅 monotonic_cst_features_names 以获取使用示例。

约束仅对二分类有效，并保持在正类的概率上。更多信息请参阅用户指南。

Added in version 0.23.

Changed in version 1.2: 接受带有特征名称作为键的约束字典。

interaction_cst{“pairwise”, “no_interactions”} or sequence of lists/tuples/sets of int, default=None

指定交互约束，即在子节点分裂时可以相互作用的特征集合。

每个项指定允许相互作用的特征索引集合。如果特征多于这些约束中指定的特征，则它们被视为额外的集合。

字符串 “pairwise” 和 “no_interactions” 是允许成对交互或不允许交互的简写。

例如，总共有 5 个特征， interaction_cst=[{0, 1}] 等同于 interaction_cst=[{0, 1}, {2, 3, 4}] ，并指定树的每个分支要么只在特征 0 和 1 上分裂，要么只在特征 2、3 和 4 上分裂。

Added in version 1.2.

warm_startbool, default=False

当设置为 True 时，重用上一次调用 fit 的解决方案，并在集成中添加更多估计器。为了结果有效，估计器应在相同数据上重新训练。请参阅术语表。

early_stopping‘auto’ or bool, default=’auto’

如果为 ‘auto’，则在样本大小大于 10000 时启用早期停止。如果为 True，则启用早期停止，否则禁用早期停止。

Added in version 0.23.

scoringstr or callable or None, default=’loss’

用于早期停止的评分参数。它可以是单个字符串（请参阅 scoring_parameter ）或可调用对象（请参阅从指标函数定义您的评分策略）。如果为 None，则使用估计器的默认评分器。如果 scoring='loss' ，则根据损失值检查早期停止。仅在执行早期停止时使用。

validation_fractionint or float or None, default=0.1

留作验证数据用于早期停止的训练数据比例（或绝对大小）。如果为 None，则在训练数据上进行早期停止。仅在执行早期停止时使用。

n_iter_no_changeint, default=10

用于确定何时“早期停止”。当最近的 n_iter_no_change 个分数都不比第 n_iter_no_change - 1 个分数好时，停止拟合过程，直到某个容差。仅在执行早期停止时使用。

tolfloat, default=1e-7

比较分数时使用的绝对容差。容差越高，我们越有可能早期停止：较高的容差意味着后续迭代更难被认为是相对于参考分数的改进。

verboseint, default=0

详细级别。如果不为零，打印一些关于拟合过程的信息。

random_stateint, RandomState instance or None, default=None

伪随机数生成器，用于控制分箱过程中的子采样，以及启用早期停止时的训练/验证数据分割。传递一个 int 以在多次函数调用中获得可重复的输出。请参阅术语表。

class_weightdict or ‘balanced’, default=None

与类关联的权重，形式为 {class_label: weight} 。如果未给出，则所有类都被认为具有相同的权重。”balanced” 模式使用 y 的值自动调整权重，与输入数据中的类频率成反比，即 n_samples / (n_classes * np.bincount(y)) 。请注意，如果指定了 sample_weight （通过 fit 方法传递），这些权重将与 sample_weight 相乘。

Added in version 1.2.

Attributes:

classes_array, shape = (n_classes,): 类标签。
do_early_stopping_bool: 指示是否在训练期间使用早期停止。
n_iter_int: 迭代增强过程的次数。
n_trees_per_iteration_int: 每次迭代构建的树数。对于二分类，这等于 1，对于多分类，这等于 n_classes 。
train_score_ndarray, shape (n_iter_+1,): 每次迭代在训练数据上的分数。第一个条目是第一次迭代前集合的分数。分数根据 scoring 参数计算。如果 scoring 不是 ‘loss’，分数在最多 10000 个样本的子集上计算。如果没有早期停止，则为空。
validation_score_ndarray, shape (n_iter_+1,): 每次迭代在保留的验证数据上的分数。第一个条目是第一次迭代前集合的分数。分数根据 scoring 参数计算。如果没有早期停止或 validation_fraction 为 None，则为空。
is_categorical_ndarray, shape (n_features, ) or None: 分类特征的布尔掩码。如果没有分类特征，则为 None 。
n_features_in_int: 在 fit 期间看到的特征数。

Added in version 0.24.
feature_names_in_ndarray of shape ( n_features_in_ ,): 在 fit 期间看到的特征名称。仅当 X 的特征名称均为字符串时定义。

Added in version 1.0.

See also

GradientBoostingClassifier: 精确的梯度提升方法，在样本数量较多的数据集上表现不佳。
sklearn.tree.DecisionTreeClassifier: 决策树分类器。
RandomForestClassifier: 一种元估计器，在数据集的各个子样本上拟合多个决策树分类器，并使用平均来提高预测准确性和控制过拟合。
AdaBoostClassifier: 一种元估计器，首先在原始数据集上拟合一个分类器，然后在相同数据集上拟合分类器的额外副本，其中错误分类实例的权重被调整，使得后续分类器更多地关注困难案例。

Examples

>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> from sklearn.datasets import load_iris
>>> X, y = load_iris(return_X_y=True)
>>> clf = HistGradientBoostingClassifier().fit(X, y)
>>> clf.score(X, y)
1.0

decision_function(X)#

计算 X 的决策函数。

Parameters:

Xarray-like, shape (n_samples, n_features): 输入样本。

Returns:

decisionndarray, shape (n_samples,) or (n_samples, n_trees_per_iteration): 每个样本的原始预测值（即树的叶子节点的总和）。n_trees_per_iteration 在多类分类中等于类的数量。

fit(X, y, sample_weight=None)#

拟合梯度提升模型。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 输入样本。
y形状为 (n_samples,) 的类数组: 目标值。
sample_weight形状为 (n_samples,) 的类数组，默认=None: 训练数据的权重。

Added in version 0.23.

Returns:

selfobject: 拟合的估计器。

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_params(deep=True)#

获取此估计器的参数。

Parameters:

deepbool, 默认=True: 如果为True，将返回此估计器和包含的子对象（也是估计器）的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

property n_iter_#: 迭代增强过程的次数。

predict(X)#

预测X的类别。

Parameters:

Xarray-like, shape (n_samples, n_features): 输入样本。

Returns:

yndarray, shape (n_samples,): 预测的类别。

predict_proba(X)#

预测X的类别概率。

Parameters:

Xarray-like, shape (n_samples, n_features): 输入样本。

Returns:

pndarray, shape (n_samples, n_classes): 输入样本的类别概率。

score(X, y, sample_weight=None)#

返回给定测试数据和标签的平均准确率。

在多标签分类中，这是子集准确率，这是一个严格的指标，因为你要求每个样本的每个标签集都被正确预测。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 测试样本。
y形状为 (n_samples,) 或 (n_samples, n_outputs) 的类数组: ` X`的真实标签。
sample_weight形状为 (n_samples,) 的类数组，默认=None: 样本权重。

Returns:

scorefloat: self.predict(X) 相对于 y 的平均准确率。

set_fit_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → HistGradientBoostingClassifier#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to fit .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit .

Returns:

selfobject: The updated object.

set_params(**params)#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ）。后者具有形式为 <component>__<parameter> 的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → HistGradientBoostingClassifier#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to score .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score .

Returns:

selfobject: The updated object.

staged_decision_function(X)#

计算每个迭代中 X 的决策函数。

此方法允许在每个阶段后进行监控（即确定测试集上的错误）。

Parameters:

Xarray-like of shape (n_samples, n_features): 输入样本。

Yields:

decisiongenerator of ndarray of shape (n_samples,) or (n_samples, n_trees_per_iteration): 输入样本的决策函数，对应于从集成树预测的原始值。类别对应于属性 classes_ 中的类别。

staged_predict(X)#

在每次迭代中预测类别。

此方法允许在每个阶段后进行监控（即确定测试集上的错误）。

Added in version 0.24.

Parameters:

Xarray-like of shape (n_samples, n_features): 输入样本。

Yields:

ygenerator of ndarray of shape (n_samples,): 每次迭代中输入样本的预测类别。

staged_predict_proba(X)#

在每次迭代中预测类别概率。

此方法允许在每个阶段之后进行监控（即确定测试集上的误差）。

Parameters:

Xarray-like of shape (n_samples, n_features): 输入样本。

Yields:

ygenerator of ndarray of shape (n_samples,): 每次迭代时输入样本的预测类别概率。

Gallery examples#

scikit-learn 1.4 版本发布亮点

scikit-learn 0.24 版本发布亮点

scikit-learn 0.23 版本发布亮点

scikit-learn 0.22 版本发布亮点

使用树集成进行特征转换

比较随机森林和直方图梯度提升模型

调整决策阈值以适应成本敏感学习