DecisionTreeClassifier#

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0, monotonic_cst=None)#

一个决策树分类器。

在用户指南中阅读更多内容。

Parameters:

criterion{“gini”, “entropy”, “log_loss”}, default=”gini”

用于衡量分裂质量的函数。支持的标准有 “gini” 表示基尼不纯度，”log_loss” 和 “entropy” 都表示香农信息增益，参见数学公式。

splitter{“best”, “random”}, default=”best”

用于选择每个节点分裂的策略。支持的策略有 “best” 选择最佳分裂和 “random” 选择最佳随机分裂。

max_depthint, default=None

树的最大深度。如果为 None，则节点会扩展直到所有叶子节点都是纯的或所有叶子节点包含的样本数少于 min_samples_split。

min_samples_splitint or float, default=2

分裂内部节点所需的最小样本数：

如果是 int，则将 min_samples_split 视为最小样本数。
如果是 float，则 min_samples_split 是一个比例， ceil(min_samples_split * n_samples) 是每次分裂的最小样本数。

Changed in version 0.18: 添加了用于比例的浮点值。

min_samples_leafint or float, default=1

叶子节点所需的最小样本数。任何深度的分裂点只有在左右分支中至少留下 min_samples_leaf 个训练样本时才会被考虑。这可能会对模型产生平滑效果，特别是在回归中。

如果是 int，则将 min_samples_leaf 视为最小样本数。
如果是 float，则 min_samples_leaf 是一个比例， ceil(min_samples_leaf * n_samples) 是每个节点的最小样本数。

Changed in version 0.18: 添加了用于比例的浮点值。

min_weight_fraction_leaffloat, default=0.0

叶子节点所需的权重总和的最小加权比例（所有输入样本的权重总和）。当未提供 sample_weight 时，样本具有相同的权重。

max_featuresint, float or {“sqrt”, “log2”}, default=None

查找最佳分裂时要考虑的特征数：

如果是 int，则在每次分裂时考虑 max_features 个特征。

如果是 float，则 max_features 是一个比例，在每次分裂时考虑 max(1, int(max_features * n_features_in_)) 个特征。

如果是 “sqrt”，则 max_features=sqrt(n_features) 。

如果是 “log2”，则 max_features=log2(n_features) 。

如果是 None，则 max_features=n_features 。

注意：分裂的搜索不会停止，直到找到至少一个有效的节点样本分区，即使这需要实际检查超过 max_features 的特征。

random_stateint, RandomState instance or None, default=None

控制估计器的随机性。即使 splitter 设置为 "best" ，每次分裂时特征也会随机排列。当 max_features < n_features 时，算法会在每次分裂前随机选择 max_features ，然后在它们中找到最佳分裂。但即使 max_features=n_features ，不同运行之间找到的最佳分裂也可能不同。如果多个分裂的标准改进相同，则必须随机选择一个分裂。要在拟合期间获得确定性行为，必须将 random_state 固定为一个整数。详情参见 Glossary 。

max_leaf_nodesint, default=None

以最佳优先方式生长具有 max_leaf_nodes 的树。最佳节点定义为相对减少的不纯度。如果为 None，则叶子节点数量无限制。

min_impurity_decreasefloat, default=0.0

如果分裂导致的不纯度减少大于或等于此值，则会进行分裂。

加权不纯度减少方程如下:

N_t / N * (impurity - N_t_R / N_t * right_impurity
                    - N_t_L / N_t * left_impurity)

其中 N 是样本总数， N_t 是当前节点的样本数， N_t_L 是左子节点的样本数， N_t_R 是右子节点的样本数。

如果传递了 sample_weight ，则 N 、 N_t 、 N_t_R 和 N_t_L 都指加权总和。

Added in version 0.19.

class_weightdict, list of dict or “balanced”, default=None

以 {class_label: weight} 形式表示的类权重。如果为 None，则假定所有类的权重为一。对于多输出问题，可以按 y 的列顺序提供字典列表。

注意，对于多输出（包括多标签）权重应为每列的每个类定义自己的字典。例如，对于四类多标签分类，权重应为 [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] 而不是 [{1:1}, {2:5}, {3:1}, {4:1}]。

“balanced” 模式使用 y 的值自动调整权重，反比于输入数据中的类频率，即 n_samples / (n_classes * np.bincount(y)) 。

对于多输出，y 的每列的权重将相乘。

注意，如果指定了 sample_weight，这些权重将与 sample_weight（通过 fit 方法传递）相乘。

ccp_alphanon-negative float, default=0.0

用于最小成本复杂度剪枝的复杂度参数。选择成本复杂度小于 ccp_alpha 的最大子树。默认情况下，不进行剪枝。详情参见最小成本复杂度剪枝。

Added in version 0.22.

monotonic_cstarray-like of int of shape (n_features), default=None

指示对每个特征施加的单调性约束。

1: 单调增加
0: 无约束
-1: 单调减少

如果 monotonic_cst 为 None，则不施加约束。

单调性约束不支持：

多类分类（即 n_classes > 2 ），
多输出分类（即 n_outputs_ > 1 ），
在具有缺失值的数据上训练的分类。

约束适用于正类的概率。

在用户指南中阅读更多内容。

Added in version 1.4.

Attributes:

classes_ndarray of shape (n_classes,) or list of ndarray: 类标签（单输出问题），或类标签数组的列表（多输出问题）。
feature_importances_ndarray of shape (n_features,): 返回特征重要性。
max_features_int: 推断的 max_features 值。
n_classes_int or list of int: 类的数量（单输出问题），或包含每个输出类数量的列表（多输出问题）。
n_features_in_int: 在 fit 期间看到的特征数。

Added in version 0.24.
feature_names_in_ndarray of shape ( n_features_in_ ,): 在 fit 期间看到的特征名称。仅当 X 的特征名称都是字符串时定义。

Added in version 1.0.
n_outputs_int: 执行 fit 时的输出数量。
tree_Tree instance: 底层的 Tree 对象。请参阅 help(sklearn.tree._tree.Tree) 了解 Tree 对象的属性，并参见理解决策树结构了解这些属性的基本用法。

See also

DecisionTreeRegressor: 一个决策树回归器。

Notes

控制树大小的参数（例如 max_depth 、 min_samples_leaf 等）的默认值会导致完全生长且未剪枝的树，这在某些数据集上可能非常大。为了减少内存消耗，应通过设置这些参数值来控制树的复杂性和大小。

predict 方法使用 numpy.argmax 函数操作 predict_proba 的输出。这意味着如果最高预测概率相同，分类器将预测在 classes_ 中索引最低的类。

References

[1]

https://en.wikipedia.org/wiki/Decision_tree_learning

[2]

L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.

[3]

T. Hastie, R. Tibshirani and J. Friedman. “Elements of Statistical Learning”, Springer, 2009.

[4]

L. Breiman, and A. Cutler, “Random Forests”, https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

Examples

>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...                             
...
array([ 1.     ,  0.93...,  0.86...,  0.93...,  0.93...,
        0.93...,  0.93...,  1.     ,  0.93...,  1.      ])

apply(X, check_input=True)#

返回每个样本被预测为的叶子的索引。

Added in version 0.17.

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csr_matrix 。
check_inputbool, 默认=True: 允许绕过多个输入检查。除非你知道自己在做什么，否则不要使用此参数。

Returns:

X_leavesarray-like，形状为 (n_samples,): 对于 X 中的每个数据点 x，返回 x 最终所在的叶子的索引。叶子编号在 [0; self.tree_.node_count) 范围内，编号可能会有间隙。

cost_complexity_pruning_path(X, y, sample_weight=None)#

计算在最小成本复杂度剪枝过程中的剪枝路径。

有关剪枝过程的详细信息，请参见最小成本复杂度剪枝。

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): 训练输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csc_matrix 。
yarray-like of shape (n_samples,) or (n_samples, n_outputs): 目标值（类标签）为整数或字符串。
sample_weightarray-like of shape (n_samples,), default=None: 样本权重。如果为 None，则样本等权重。在每个节点中搜索分割时，会忽略创建子节点净零或负权重的分割。如果分割会导致任一子节点中的任何单一类别负权重，也会忽略这些分割。

Returns:

ccp_pathBunch

类似字典的对象，具有以下属性。

ccp_alphasndarray: 剪枝过程中子树的有效 alphas。
impuritiesndarray: 对应于 ccp_alphas 中的 alpha 值的子树叶子总杂质。

decision_path(X, check_input=True)#

返回树中的决策路径。

Added in version 0.18.

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏 csr_matrix 。
check_inputbool, 默认=True: 允许绕过多个输入检查。除非你知道自己在做什么，否则不要使用此参数。

Returns:

indicator形状为 (n_samples, n_nodes) 的稀疏矩阵: 返回一个节点指示 CSR 矩阵，其中非零元素表示样本经过这些节点。

property feature_importances_#

返回特征重要性。

特征的重要性计算为其带来的准则（归一化）总减少量。这也被称为基尼重要性。

警告：基于不纯度的特征重要性对于高基数特征（许多唯一值）可能会产生误导。请参阅 sklearn.inspection.permutation_importance 作为替代方法。

Returns:

feature_importances_ndarray of shape (n_features,): 特征带来的准则归一化总减少量（基尼重要性）。

fit(X, y, sample_weight=None, check_input=True)#

构建一个从训练集（X, y）生成的决策树分类器。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 训练输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csc_matrix 。
yarray-like，形状为 (n_samples,) 或 (n_samples, n_outputs): 目标值（类标签）为整数或字符串。
sample_weightarray-like，形状为 (n_samples,)，默认=None: 样本权重。如果为 None，则样本权重相等。在每个节点中搜索分割时，会忽略那些会创建子节点净零或负权重的分割。如果分割会导致任一子节点中任何单一类别的权重为负，也会被忽略。
check_inputbool，默认=True: 允许绕过多个输入检查。除非你知道自己在做什么，否则不要使用此参数。

Returns:

selfDecisionTreeClassifier: 拟合的估计器。

get_depth()#

返回决策树的深度。

树的深度是根节点与任何叶节点之间的最大距离。

Returns:

self.tree_.max_depthint: 树的最大深度。

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRequest: MetadataRequest 封装的路由信息。

get_n_leaves()#

返回决策树的叶子数量。

Returns:

self.tree_.n_leavesint: 叶子数量。

get_params(deep=True)#

获取此估计器的参数。

Parameters:

deepbool, 默认=True: 如果为True，将返回此估计器和包含的子对象（也是估计器）的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

predict(X, check_input=True)#

预测X的类别或回归值。

对于分类模型，返回X中每个样本的预测类别。对于回归模型，返回基于X的预测值。

Parameters:

X{array-like, sparse matrix} of shape (n_samples, n_features): 输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csr_matrix 。
check_inputbool, default=True: 允许绕过多个输入检查。除非你知道自己在做什么，否则不要使用此参数。

Returns:

yarray-like of shape (n_samples,) or (n_samples, n_outputs): 预测的类别或预测值。

predict_log_proba(X)#

预测输入样本X的类别对数概率。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csr_matrix 。

Returns:

probandarray，形状为 (n_samples, n_classes) 或 n_outputs 个此类数组的列表（如果 n_outputs > 1）: 输入样本的类别对数概率。类别的顺序与属性 classes_ 中的顺序相对应。

predict_proba(X, check_input=True)#

预测输入样本X的类别概率。

预测的类别概率是叶子中相同类别样本的比例。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 输入样本。内部将转换为 dtype=np.float32 ，如果提供稀疏矩阵，则转换为稀疏的 csr_matrix 。
check_inputbool, 默认=True: 允许绕过多个输入检查。除非你知道自己在做什么，否则不要使用此参数。

Returns:

probandarray，形状为 (n_samples, n_classes) 或 n_outputs 个此类数组的列表（如果 n_outputs > 1）: 输入样本的类别概率。类别的顺序与属性 classes_ 中的顺序相对应。

score(X, y, sample_weight=None)#

返回给定测试数据和标签的平均准确率。

在多标签分类中，这是子集准确率，这是一个严格的指标，因为你要求每个样本的每个标签集都被正确预测。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 测试样本。
y形状为 (n_samples,) 或 (n_samples, n_outputs) 的类数组: ` X`的真实标签。
sample_weight形状为 (n_samples,) 的类数组，默认=None: 样本权重。

Returns:

scorefloat: self.predict(X) 相对于 y 的平均准确率。

set_fit_request(*, check_input: bool | None | str = '$UNCHANGED$', sample_weight: bool | None | str = '$UNCHANGED$') → DecisionTreeClassifier#

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to fit .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

check_inputstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for check_input parameter in fit .
sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in fit .

Returns:

selfobject: The updated object.

set_params(**params)#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ）。后者具有形式为 <component>__<parameter> 的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

set_predict_proba_request(*, check_input: bool | None | str = '$UNCHANGED$') → DecisionTreeClassifier#

Request metadata passed to the predict_proba method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to predict_proba if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to predict_proba .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

check_inputstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for check_input parameter in predict_proba .

Returns:

selfobject: The updated object.

set_predict_request(*, check_input: bool | None | str = '$UNCHANGED$') → DecisionTreeClassifier#

Request metadata passed to the predict method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to predict .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

check_inputstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for check_input parameter in predict .

Returns:

selfobject: The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → DecisionTreeClassifier#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to score .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score .

Returns:

selfobject: The updated object.

Gallery examples#

scikit-learn 1.3 版本发布亮点

分类器比较

理解决策树结构

绘制在鸢尾花数据集上训练的决策树的决策边界

通过代价复杂度剪枝对决策树进行后剪枝

二分类AdaBoost

多类AdaBoost决策树

绘制VotingClassifier的决策边界

绘制鸢尾花数据集上树集成的决策边界

交叉验证评分和GridSearchCV的多指标评估演示

多类训练元估计器概述