Note

Go to the end to download the full example code. or to run this example in your browser via Binder

ANOVA SVM 管道#

本示例展示了如何将特征选择轻松集成到机器学习管道中。

我们还展示了如何轻松检查管道的一部分。

我们将首先生成一个二分类数据集。随后，我们会将数据集分成两个子集。

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
    n_features=20,
    n_informative=3,
    n_redundant=0,
    n_classes=2,
    n_clusters_per_class=2,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

一个常见的特征选择错误是在整个数据集上搜索具有区分性的特征子集，而不是仅使用训练集。使用scikit-learn的 Pipeline 可以防止这种错误。

在这里，我们将演示如何构建一个以特征选择为第一步的管道。

当对训练数据调用 fit 时，将选择一部分特征，并存储这些选定特征的索引。特征选择器随后会减少特征的数量，并将这个子集传递给分类器进行训练。

from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import make_pipeline
from sklearn.svm import LinearSVC

anova_filter = SelectKBest(f_classif, k=3)
clf = LinearSVC()
anova_svm = make_pipeline(anova_filter, clf)
anova_svm.fit(X_train, y_train)

Pipeline(steps=[('selectkbest', SelectKBest(k=3)), ('linearsvc', LinearSVC())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

一旦训练完成，我们可以对新的未见样本进行预测。在这种情况下，特征选择器将仅根据训练期间存储的信息选择最具辨别力的特征。然后，数据将传递给分类器进行预测。

在这里，我们通过分类报告展示最终指标。

from sklearn.metrics import classification_report

y_pred = anova_svm.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.80      0.86        15
           1       0.75      0.90      0.82        10

    accuracy                           0.84        25
   macro avg       0.84      0.85      0.84        25
weighted avg       0.85      0.84      0.84        25

请注意，您可以检查管道中的某个步骤。例如，我们可能对分类器的参数感兴趣。由于我们选择了三个特征，我们预计会有三个系数。

anova_svm[-1].coef_

array([[0.75788833, 0.27161955, 0.26113448]])

然而，我们不知道从原始数据集中选择了哪些特征。我们可以通过几种方式进行处理。在这里，我们将对这些系数的变换进行逆变换，以获取关于原始空间的信息。

anova_svm[:-1].inverse_transform(anova_svm[-1].coef_)

array([[0.        , 0.        , 0.75788833, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.27161955,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.26113448]])

我们可以看到，具有非零系数的特征是第一步选择的特征。

Total running time of the script: (0 minutes 0.009 seconds)