TunedThresholdClassifierCV#

class sklearn.model_selection.TunedThresholdClassifierCV(estimator, *, scoring='balanced_accuracy', response_method='auto', thresholds=100, cv=None, refit=True, n_jobs=None, random_state=None, store_cv_results=False)#

分类器，使用交叉验证后调决策阈值。

该估计器后调决策阈值（截断点），用于将后验概率估计（即 predict_proba 的输出）或决策分数（即 decision_function 的输出）转换为类别标签。调整是通过优化二元度量来完成的，可能受另一个度量的约束。

更多信息请参阅用户指南。

Added in version 1.5.

Parameters:

estimator估计器实例

我们希望优化其预测期间使用的决策阈值的分类器，已拟合或未拟合。

scoringstr 或 callable, 默认=”balanced_accuracy”

要优化的目标度量。可以是以下之一：

与二分类评分函数相关的字符串（见 scoring_parameter ）；
使用 make_scorer 创建的评分器可调用对象；

response_method{“auto”, “decision_function”, “predict_proba”}, 默认=”auto”

分类器 estimator 对应于我们希望找到阈值的决策函数的方法。可以是：

如果 "auto" ，它将尝试按顺序调用每个分类器的 "predict_proba" 或 "decision_function" 。
否则，可以是 "predict_proba" 或 "decision_function" 。如果分类器未实现该方法，将引发错误。

thresholdsint 或 array-like, 默认=100

离散化分类器 method 输出时要使用的决策阈值数量。传递 array-like 以手动指定要使用的阈值。

cvint, float, 交叉验证生成器, iterable 或 “prefit”, 默认=None

确定用于训练分类器的交叉验证分割策略。cv 的可能输入包括：

None ，使用默认的 5 折分层 K 折交叉验证；
整数，指定分层 k 折中的折数；
浮点数，指定单次洗牌分割。浮点数应在 (0, 1) 之间，并表示验证集的大小；
用作交叉验证生成器的对象；
生成训练、测试分割的可迭代对象；
"prefit" ，绕过交叉验证。

请参阅用户指南以了解可以在此处使用的各种交叉验证策略。

Warning

使用 cv="prefit" 并将相同的数据集用于拟合 estimator 和调整截断点容易导致过拟合。您可以参考 TunedThresholdClassifierCV_no_cv 查看示例。

此选项仅应在用于拟合 estimator 的数据集与用于调整截断点的数据集（通过调用 TunedThresholdClassifierCV.fit ）不同时使用。

refitbool, 默认=True

一旦找到决策阈值，是否在整个训练集上重新拟合分类器。请注意，在交叉验证有多于一个分割的情况下强制 refit=False 将引发错误。同样， refit=True 与 cv="prefit" 结合使用将引发错误。

n_jobsint, 默认=None

并行运行的作业数量。当 cv 表示交叉验证策略时，每个数据分割的拟合和评分将在并行中进行。 None 表示 1，除非在 joblib.parallel_backend 上下文中。 -1 表示使用所有处理器。有关更多详细信息，请参阅 Glossary 。

random_stateint, RandomState 实例或 None, 默认=None

当 cv 为浮点数时，控制交叉验证的随机性。请参阅 Glossary 。

store_cv_resultsbool, 默认=False

是否存储在交叉验证过程中计算的所有分数和阈值。

Attributes:

estimator_估计器实例: 用于预测的已拟合分类器。
best_threshold_float: 新的决策阈值。
best_score_float 或 None: 目标度量的最佳分数，在 best_threshold_ 处评估。
cv_results_dict 或 None: 包含在交叉验证过程中计算的分数和阈值的字典。仅当 store_cv_results=True 时存在。键为 "thresholds" 和 "scores" 。
classes_ndarray of shape (n_classes,): 标签类。
n_features_in_int: 在 fit 期间看到的特征数量。仅当基础估计器在拟合时暴露此类属性时定义。
feature_names_in_ndarray of shape ( n_features_in_ ,): 在 fit 期间看到的特征名称。仅当基础估计器在拟合时暴露此类属性时定义。

See also

sklearn.model_selection.FixedThresholdClassifier: 使用常量阈值的分类器。
sklearn.calibration.CalibratedClassifierCV: 校准概率的估计器。

Examples

>>> from sklearn.datasets import make_classification
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.metrics import classification_report
>>> from sklearn.model_selection import TunedThresholdClassifierCV, train_test_split
>>> X, y = make_classification(
...     n_samples=1_000, weights=[0.9, 0.1], class_sep=0.8, random_state=42
... )
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, stratify=y, random_state=42
... )
>>> classifier = RandomForestClassifier(random_state=0).fit(X_train, y_train)
>>> print(classification_report(y_test, classifier.predict(X_test)))
              precision    recall  f1-score   support

           0       0.94      0.99      0.96       224
           1       0.80      0.46      0.59        26

    accuracy                           0.93       250
   macro avg       0.87      0.72      0.77       250
weighted avg       0.93      0.93      0.92       250

>>> classifier_tuned = TunedThresholdClassifierCV(
...     classifier, scoring="balanced_accuracy"
... ).fit(X_train, y_train)
>>> print(
...     f"Cut-off point found at {classifier_tuned.best_threshold_:.3f}"
... )
Cut-off point found at 0.342
>>> print(classification_report(y_test, classifier_tuned.predict(X_test)))
              precision    recall  f1-score   support

           0       0.96      0.95      0.96       224
           1       0.61      0.65      0.63        26

    accuracy                           0.92       250
   macro avg       0.78      0.80      0.79       250
weighted avg       0.92      0.92      0.92       250

property classes_#: 标签类。

decision_function(X)#

决策函数使用拟合的估计器对 X 中的样本进行计算。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 训练向量，其中 n_samples 是样本数量， n_features 是特征数量。

Returns:

decisionsndarray，形状为 (n_samples,): 由拟合的估计器计算的决策函数。

fit(X, y, **params)#

拟合分类器。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 训练数据。
yarray-like，形状为 (n_samples,): 目标值。
**paramsdict: 传递给底层分类器 fit 方法的参数。

Returns:

selfobject: 返回 self 的实例。

get_metadata_routing()#

获取此对象的元数据路由。

请查看用户指南以了解路由机制的工作原理。

Returns:

routingMetadataRouter: MetadataRouter 封装的路由信息。

get_params(deep=True)#

获取此估计器的参数。

Parameters:

deepbool, 默认=True: 如果为True，将返回此估计器和包含的子对象（也是估计器）的参数。

Returns:

paramsdict: 参数名称映射到它们的值。

predict(X)#

预测新样本的目标。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 样本，如 estimator.predict 所接受的那样。

Returns:

class_labelsndarray，形状为 (n_samples,): 预测的类别。

predict_log_proba(X)#

预测使用拟合估计器的 X 的对数类别概率。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 训练向量，其中 n_samples 是样本数量， n_features 是特征数量。

Returns:

log_probabilitiesndarray，形状为 (n_samples, n_classes): 输入样本的对数类别概率。

predict_proba(X)#

预测使用拟合估计器的 X 的类别概率。

Parameters:

X{array-like, sparse matrix}，形状为 (n_samples, n_features): 训练向量，其中 n_samples 是样本数量， n_features 是特征数量。

Returns:

probabilitiesndarray，形状为 (n_samples, n_classes): 输入样本的类别概率。

score(X, y, sample_weight=None)#

返回给定测试数据和标签的平均准确率。

在多标签分类中，这是子集准确率，这是一个严格的指标，因为你要求每个样本的每个标签集都被正确预测。

Parameters:

X形状为 (n_samples, n_features) 的类数组: 测试样本。
y形状为 (n_samples,) 或 (n_samples, n_outputs) 的类数组: ` X`的真实标签。
sample_weight形状为 (n_samples,) 的类数组，默认=None: 样本权重。

Returns:

scorefloat: self.predict(X) 相对于 y 的平均准确率。

set_params(**params)#

设置此估计器的参数。

该方法适用于简单估计器以及嵌套对象（例如 Pipeline ）。后者具有形式为 <component>__<parameter> 的参数，以便可以更新嵌套对象的每个组件。

Parameters:

**paramsdict: 估计器参数。

Returns:

selfestimator instance: 估计器实例。

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → TunedThresholdClassifierCV#

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config ). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

True : metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False : metadata is not requested and the meta-estimator will not pass it to score .
None : metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str : metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default ( sklearn.utils.metadata_routing.UNCHANGED ) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline . Otherwise it has no effect.

Parameters:

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED: Metadata routing for sample_weight parameter in score .

Returns:

selfobject: The updated object.

Gallery examples#

scikit-learn 1.5 版本发布亮点

后处理调整决策函数的截断点

调整决策阈值以适应成本敏感学习