Mlxtend.feature selection

mlxtend version: 0.23.1

ColumnSelector

ColumnSelector(cols=None, drop_axis=False)

用于从数据集中选择特定列的对象.

Parameters

cols : 类数组 (默认: None)

指定要选择的特征索引列表.例如, [1, 4, 5] 用于选择第2、第5和第6个特征列,以及 ['A','C','D'] 用于选择特征列A、C和D. 如果为None,则返回数组中的所有列.
drop_axis : bool (默认=False)

如果为True且仅选择了一列,则删除最后一个轴.这在以下情况下很有用: 例如,当ColumnSelector用于仅选择一列且结果数组应传递给例如 scikit-learn的列选择器.例如,而不是返回一个形状为 (n_samples, 1) 的数组,drop_axis=True 将返回一个形状为 (n_samples,) 的数组.

Examples

有关使用示例,请参见 https://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/

Methods

fit(X, y=None)

Mock方法.什么也不做.

Parameters

X : {类数组, 稀疏矩阵}, shape = [n_samples, n_features]

训练向量,其中n_samples是样本数量,n_features是特征数量.
y : 类数组, shape = [n_samples] (默认: None)

Returns

self

fit_transform(X, y=None)

返回输入数组的切片.

Parameters

X : {类数组, 稀疏矩阵}, shape = [n_samples, n_features]

训练向量,其中 n_samples 是样本数量, n_features 是特征数量.
y : 类数组, shape = [n_samples] (默认: None)

Returns

X_slice : shape = [n_samples, k_features]

特征空间的子集,其中 k_features <= n_features

get_metadata_routing()

Get metadata routing of this object.

Please check :ref:`User Guide <metadata_routing>` on how the routing
mechanism works.

Returns

routing : MetadataRequest

A :class:~sklearn.utils.metadata_routing.MetadataRequest encapsulating routing information.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.

Parameters

**params : dict

Estimator parameters.

Returns

self : estimator instance

Estimator instance.

transform(X, y=None)

返回输入数组的切片.

Parameters

X : {类数组, 稀疏矩阵}, shape = [n_samples, n_features]

训练向量,其中 n_samples 是样本数量, n_features 是特征数量.
y : 类数组, shape = [n_samples] (默认: None)

Returns

X_slice : shape = [n_samples, k_features]

特征空间的子集,其中 k_features <= n_features

ExhaustiveFeatureSelector

ExhaustiveFeatureSelector(estimator, min_features=1, max_features=1, print_progress=True, scoring='accuracy', cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

Exhaustive Feature Selection for Classification and Regression. (new in v0.4.3)

Parameters

estimator : scikit-learn分类器或回归器
min_features : int (默认: 1)

选择的最小特征数量
max_features : int (默认: 1)

选择的最大特征数量.如果参数feature_groups不为None,特征数量等于特征组的数量,即len(feature_groups).例如,如果feature_groups = [[0], [1], [2, 3], [4]],那么max_features的值不能超过4.
print_progress : bool (默认: True)

将进度打印为epoch数到stderr.
scoring : str, (默认: 'accuracy')

评分指标,分类器为{accuracy, f1, precision, recall, roc_auc},回归器为{'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'r2'},或具有签名scorer(estimator, X, y)的可调用对象或函数.
cv : int (默认: 5)

Scikit-learn交叉验证生成器或int.如果估计器是分类器（或y由整数类标签组成）,则执行分层k折,否则执行常规k折交叉验证.如果cv为None、False或0,则不进行交叉验证.
n_jobs : int (默认: 1)

用于并行评估不同特征子集的CPU数量.-1表示使用所有CPU.
pre_dispatch : int, 或字符串 (默认: '2*n_jobs')

控制并行执行期间分派的作业数量,如果n_jobs > 1或n_jobs=-1.减少此数量可以避免当分派的作业多于CPU可以处理的作业时内存消耗的爆炸.该参数可以是: None,在这种情况下,所有作业都会立即创建并派生.适用于轻量级和快速运行的作业,以避免按需派生作业的延迟. 一个整数,给出派生的总作业的确切数量. 一个字符串,给出作为n_jobs函数的表达式,如2*n_jobs.
clone_estimator : bool (默认: True)

如果为True,则克隆估计器;如果为False,则使用原始估计器实例.如果估计器未实现scikit-learn的set_params和get_params方法,则设置为False.此外,需要设置cv=0,n_jobs=1.
fixed_features : tuple (默认: None)

如果不是None,作为元组提供的特征索引将被特征选择器视为固定.例如,如果fixed_features=(1, 3, 7),则保证第2、4和8个特征存在于解决方案中.请注意,如果fixed_features不是None,请确保要选择的特征数量大于len(fixed_features).换句话说,确保k_features > len(fixed_features).
feature_groups : list 或 None (默认: None)

用于将某些特征视为组的可选参数.这意味着组内的特征总是被一起选择,不会被拆分.例如,feature_groups=[[1], [2], [3, 4, 5]]指定了3个特征组.在这种情况下,可能的特征选择结果,当k_features=2时,是[[1], [2]],[[1], [3, 4, 5]],或[[2], [3, 4, 5]].特征组对于可解释性很有用,例如,如果特征3、4、5是独热编码特征.（更多详情,请阅读此docstring底部的注释）.在mlxtend v. 0.21.0中新增.

Attributes

best_idx_ : array-like, shape = [n_predictions]

所选特征子集的特征索引.
best_feature_names_ : array-like, shape = [n_predictions]

所选特征子集的特征名称.如果在fit方法中使用了pandas DataFrame,特征名称对应于列名.否则,特征名称是特征数组索引的字符串表示.在v 0.13.0中新增.
best_score_ : float

所选子集的交叉验证平均分数.
subsets_ : dict

在穷举选择过程中选择的特征子集的字典,字典键是这些特征子集的长度k.字典值本身是具有以下键的字典:'feature_idx'（特征子集的索引元组）,'feature_names'（特征子集的特征名称元组）,'cv_scores'（交叉验证分数列表）,'avg_score'（交叉验证平均分数）.请注意,如果在fit方法中使用了pandas DataFrame,'feature_names'对应于列名.否则,特征名称是特征数组索引的字符串表示.'feature_names'在v. 0.13.0中新增.

Notes

(1) 如果参数feature_groups不为None,特征数量等于特征组的数量,即len(feature_groups).例如,如果feature_groups = [[0], [1], [2, 3], [4]],那么max_features的值不能超过4.

(2) 尽管两个或多个单独的特征可能在整个特征选择过程中被视为一个组,但这并不意味着该组的单个特征对结果有相同的影响.例如,在线性回归中,特征2和3的系数可能不同,即使它们在feature_groups中被视为一个组.

(3) 如果同时指定了fixed_features和feature_groups,请确保每个特征组包含fixed_features选择.例如,对于一个3特征集,fixed_features=[0, 1]和feature_groups=[[0, 1], [2]]是有效的;fixed_features=[0, 1]和feature_groups=[[0], [1, 2]]是无效的.

(4) 在KeyboardInterrupt的情况下,字典subsets可能未完成.如果用户仍然对获取最佳分数感兴趣,他们可以使用方法`finalize_fit`.

Examples

有关使用示例,请参见 https://rasbt.github.io/mlxtend/user_guide/feature_selection/ExhaustiveFeatureSelector/

Methods

finalize_fit()

None

fit(X, y, groups=None, fit_params)

执行特征选择并从训练数据中学习模型.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中 n_samples 是样本数量,n_features 是特征数量. 在 v 0.13.0 中新增:pandas DataFrames 现在也可以作为 X 的参数.
y : array-like, shape = [n_samples]

目标值.
groups : array-like, shape = [n_samples], 可选

用于划分训练集/测试集的样本组标签.传递给交叉验证器的 fit 方法.
fit_params : dict of string -> object, 可选

传递给分类器 fit 方法的参数.

Returns

self : object

fit_transform(X, y, groups=None, fit_params)

拟合训练数据并返回从X中选出的最佳特征.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中n_samples是样本数量,n_features是特征数量. 自v 0.13.0起新增:pandas DataFrames现在也可以作为X的参数.
y : array-like, shape = [n_samples]

目标值.
groups : array-like, shape = [n_samples], 可选

样本的分组标签,用于在将数据集划分为训练/测试集时使用.传递给交叉验证器的fit方法.
fit_params : dict of string -> object, 可选

传递给分类器fit方法的参数.

Returns

X的特征子集, shape={n_samples, k_features}

get_metadata_routing()

Get metadata routing of this object.

Please check :ref:`User Guide <metadata_routing>` on how the routing
mechanism works.

Returns

routing : MetadataRequest

A :class:~sklearn.utils.metadata_routing.MetadataRequest encapsulating routing information.

get_metric_dict(confidence_interval=0.95)

返回指标字典

Parameters

confidence_interval : float (默认值: 0.95)

一个介于0.0和1.0之间的正浮点数,用于计算交叉验证分数平均值的置信区间边界.

Returns

字典,其中每个字典值都是一个列表,列表长度为迭代次数（特征子集数量）. 这些列表对应的字典键如下: 'feature_idx': 特征子集的索引元组 'cv_scores': 各个交叉验证分数的列表 'avg_score': 交叉验证平均分数 'std_dev': 交叉验证分数平均值的标准差 'std_err': 交叉验证分数平均值的标准误差 'ci_bound': 交叉验证分数平均值的置信区间边界

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_fit_request(self: mlxtend.feature_selection.exhaustive_feature_selector.ExhaustiveFeatureSelector, , groups: Union[bool, NoneType, str] = '$UNCHANGED$') -> mlxtend.feature_selection.exhaustive_feature_selector.ExhaustiveFeatureSelector*

Request metadata passed to the fit method.

Note that this method is only relevant if
``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
Please see :ref:`User Guide <metadata_routing>` on how the routing
mechanism works.

The options for each parameter are:

- ``True``: metadata is requested, and passed to ``fit`` if provided. The request is ignored if metadata is not provided.

- ``False``: metadata is not requested and the meta-estimator will not pass it to ``fit``.

- ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

- ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
existing request. This allows you to change the request for some
parameters and not others.

.. versionadded:: 1.3

.. note::
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
:class:`~sklearn.pipeline.Pipeline`. Otherwise it has no effect.

Parameters

groups : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for groups parameter in fit.

Returns

self : object

The updated object.

set_params(params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects
(such as :class:`~sklearn.pipeline.Pipeline`). The latter have
parameters of the form ``<component>__<parameter>`` so that it's
possible to update each component of a nested object.

Parameters

**params : dict

Estimator parameters.

Returns

self : estimator instance

Estimator instance.

transform(X)

返回从X中选出的最佳特征.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中n_samples是样本数量,n_features是特征数量. 自v 0.13.0起新增:pandas DataFrames现在也可以作为X的参数.

Returns

X的特征子集,shape={n_samples, k_features}

SequentialFeatureSelector

SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True, fixed_features=None, feature_groups=None)*

顺序特征选择用于分类和回归.

Parameters

estimator : scikit-learn分类器或回归器
k_features : int 或 tuple 或 str (默认: 1)

要选择的特征数量,其中 k_features < 完整特征集. 自 0.4.2 版本起新增:可以提供一个包含最小值和最大值的元组, SFS 将考虑返回在交叉验证中得分最高的任何特征组合,范围在最小值和最大值之间.例如, 元组 (1, 4) 将返回从 1 到 4 个特征的任何组合,而不是固定的特征数量 k. 自 0.8.0 版本起新增:字符串参数 "best" 或 "parsimonious". 如果提供 "best",特征选择器将返回具有最佳交叉验证性能的特征子集. 如果提供 "parsimonious" 作为参数,将选择最小的特征子集,该子集的交叉验证性能在一个标准误差范围内.
forward : bool (默认: True)

如果为 True,进行前向选择; 否则进行后向选择
floating : bool (默认: False)

如果为 True,添加条件排除/包含.
verbose : int (默认: 0),日志记录中使用的详细级别.

如果为 0,无输出; 如果为 1,当前集合中的特征数量;如果为 2,包括时间戳和步骤中的交叉验证分数的详细日志记录.
scoring : str, callable, 或 None (默认: None)

如果为 None（默认）,对于 sklearn 分类器使用 'accuracy',对于 sklearn 回归器使用 'r2'. 如果为 str,使用 sklearn 评分指标字符串标识符,例如分类器:{accuracy, f1, precision, recall, roc_auc}, 回归器:{'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'}. 如果提供了可调用对象或函数,它必须符合 sklearn 的签名 scorer(estimator, X, y); 详见 https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html.
cv : int (默认: 5)

整数或生成训练、测试拆分的可迭代对象.如果 cv 是整数且 estimator 是分类器（或 y 由整数类标签组成）,则进行分层 k 折交叉验证. 否则进行常规 k 折交叉验证.如果 cv 为 None、False 或 0,则不进行交叉验证.
n_jobs : int (默认: 1)

用于并行评估不同特征子集的 CPU 数量.-1 表示使用所有 CPU.
pre_dispatch : int, 或字符串 (默认: '2*n_jobs')

控制并行执行期间分派的作业数量,如果 n_jobs > 1 或 n_jobs=-1. 减少此数量可以避免在分派的作业数量超过 CPU 处理能力时内存消耗爆炸. 该参数可以是: None,在这种情况下,所有作业都会立即创建并生成. 适用于轻量级和快速运行的作业,以避免按需生成作业的延迟. 一个整数,给出生成的总作业的确切数量. 一个字符串,给出作为 n_jobs 函数的表达式,例如 2*n_jobs.
clone_estimator : bool (默认: True)

如果为 True,克隆估计器;如果为 False,使用原始估计器实例. 如果估计器未实现 scikit-learn 的 set_params 和 get_params 方法,请设置为 False. 此外,需要设置 cv=0 和 n_jobs=1.
fixed_features : tuple (默认: None)

如果不为 None,作为元组提供的特征索引将被特征选择器视为固定特征.例如, 如果 fixed_features=(1, 3, 7),则第 2、4 和 8 个特征将保证出现在解决方案中. 请注意,如果 fixed_features 不为 None,请确保要选择的特征数量大于 len(fixed_features). 换句话说,确保 k_features > len(fixed_features). 自 mlxtend v. 0.18.0 起新增.
feature_groups : list 或 None (默认: None)

可选参数,用于将某些特征视为一组.这意味着组内的特征总是被一起选择,不会被拆分. 例如,feature_groups=[[1], [2], [3, 4, 5]] 指定了 3 个特征组.在这种情况下, 可能的特征选择结果,当 k_features=2 时,可以是 [[1], [2]]、[[1], [3, 4, 5]] 或 [[2], [3, 4, 5]]. 特征组对于可解释性很有用,例如,如果特征 3、4、5 是独热编码特征.（更多详情请阅读此文档字符串底部的注释）. 自 mlxtend v. 0.21.0 起新增.

Attributes

k_feature_idx_ : array-like, shape = [n_predictions]

所选特征子集的特征索引.
k_feature_names_ : array-like, shape = [n_predictions]

所选特征子集的特征名称.如果在 fit 方法中使用了 pandas DataFrame, 特征名称对应于列名.否则,特征名称是特征数组索引的字符串表示.自 v 0.13.0 起新增.
k_score_ : float

所选子集的交叉验证平均得分.
subsets_ : dict

在顺序选择过程中选择的特征子集的字典,字典键是这些特征子集的长度 k. 如果参数 feature_groups 不为 None,键的值表示一起选择的组数.字典值本身是字典,包含以下键: 'feature_idx'（特征子集的索引元组） 'feature_names'（特征子集的特征名称元组） 'cv_scores'（交叉验证分数列表） 'avg_score'（交叉验证平均分数）请注意,如果在 fit 方法中使用了 pandas DataFrame,'feature_names' 对应于列名. 否则,特征名称是特征数组索引的字符串表示.'feature_names' 自 v 0.13.0 起新增.

Notes

(1) 如果参数 feature_groups 不为 None,特征数量等于特征组的数量,即 len(feature_groups). 例如,如果 feature_groups = [[0], [1], [2, 3], [4]],则 max_features 值不能超过 4.

(2) 尽管两个或多个单独的特征可能在整个特征选择过程中被视为一组,但这并不意味着这些特征对结果的影响相同.
例如,在线性回归中,特征 2 和 3 的系数可能不同,即使它们在 feature_groups 中被视为一组.

(3) 如果同时指定了 fixed_features 和 feature_groups,请确保每个特征组包含 fixed_features 选择.
例如,对于一个 3 特征集,fixed_features=[0, 1] 和 feature_groups=[[0, 1], [2]] 是有效的;
fixed_features=[0, 1] 和 feature_groups=[[0], [1, 2]] 是无效的.

(4) 在 KeyboardInterrupt 的情况下,字典 subsets 可能未完成.如果用户仍对获取最佳分数感兴趣,他们可以使用方法 `finalize_fit`.

Examples

有关使用示例,请参见 https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

Methods

finalize_fit()

None

fit(X, y, groups=None, fit_params)

执行特征选择并从训练数据中学习模型.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中 n_samples 是样本数量,n_features 是特征数量. 自 v 0.13.0 起新增:pandas DataFrames 现在也可以作为 X 的参数.
y : array-like, shape = [n_samples]

目标值. 自 v 0.13.0 起新增:pandas DataFrames 现在也可以作为 y 的参数.
groups : array-like, shape = (n_samples,), 可选

样本的分组标签,用于在将数据集划分为训练/测试集时使用.传递给交叉验证器的 fit 方法.
fit_params : 各种类型, 可选

传递给估计器的额外参数.例如,sample_weights=weights.

Returns

self : object

fit_transform(X, y, groups=None, fit_params)

拟合训练数据后,将X缩减为其最重要的特征.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中n_samples是样本数量,n_features是特征数量. 自v 0.13.0起新增:pandas DataFrames现在也可以作为X的参数.
y : array-like, shape = [n_samples]

目标值. 自v 0.13.0起新增:pandas Series现在也可以作为y的参数.
groups : array-like, shape = (n_samples,), 可选

样本的分组标签,用于在将数据集划分为训练/测试集时使用.传递给交叉验证器的fit方法.
fit_params : 各种类型, 可选

传递给估计器的额外参数.例如,sample_weights=weights.

Returns

X的缩减特征子集,shape={n_samples, k_features}

generate_error_message_k_features(name)

None

get_metadata_routing()

Get metadata routing of this object.

Please check :ref:`User Guide <metadata_routing>` on how the routing
mechanism works.

Returns

routing : MetadataRequest

A :class:~sklearn.utils.metadata_routing.MetadataRequest encapsulating routing information.

get_metric_dict(confidence_interval=0.95)

返回指标字典

Parameters

confidence_interval : float (默认值: 0.95)

一个介于0.0和1.0之间的正浮点数,用于计算交叉验证评分平均值的置信区间边界.

Returns

字典,其中每个字典值都是一个列表,列表的长度为迭代次数（特征子集的数量）. 这些列表对应的字典键如下: 'feature_idx': 特征子集的索引元组 'cv_scores': 各个交叉验证评分列表 'avg_score': 交叉验证评分平均值 'std_dev': 交叉验证评分平均值的标准差 'std_err': 交叉验证评分平均值的标准误差 'ci_bound': 交叉验证评分平均值的置信区间边界

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : bool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : dict

Parameter names mapped to their values.

set_fit_request(self: mlxtend.feature_selection.sequential_feature_selector.SequentialFeatureSelector, , groups: Union[bool, NoneType, str] = '$UNCHANGED$') -> mlxtend.feature_selection.sequential_feature_selector.SequentialFeatureSelector*

Request metadata passed to the fit method.

Note that this method is only relevant if
``enable_metadata_routing=True`` (see :func:`sklearn.set_config`).
Please see :ref:`User Guide <metadata_routing>` on how the routing
mechanism works.

The options for each parameter are:

- ``True``: metadata is requested, and passed to ``fit`` if provided. The request is ignored if metadata is not provided.

- ``False``: metadata is not requested and the meta-estimator will not pass it to ``fit``.

- ``None``: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

- ``str``: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (``sklearn.utils.metadata_routing.UNCHANGED``) retains the
existing request. This allows you to change the request for some
parameters and not others.

.. versionadded:: 1.3

.. note::
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
:class:`~sklearn.pipeline.Pipeline`. Otherwise it has no effect.

Parameters

groups : str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED

Metadata routing for groups parameter in fit.

Returns

self : object

The updated object.

set_params(params)

设置此估计器的参数. 有效参数键可以通过 get_params() 列出.

Returns

self

transform(X)

Reduce X to its most important features.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

训练向量,其中 n_samples 是样本的数量,n_features 是特征的数量. 自 v 0.13.0 起新增:pandas DataFrames 现在也可以作为 X 的参数.

Returns

Reduced feature subset of X, shape={n_samples, k_features}

Properties

named_estimators

Returns

命名估计器元组列表,例如 [('svc', SVC(...))]