特征重要性置换:通过特征置换估计特征重要性。

一个函数,用于根据置换重要性估计分类器和回归器的特征重要性。

> `from mlxtend.evaluate import feature_importance_permutation`

概述

排列重要性是一种直观的、与模型无关的方法,用于估计分类器和回归模型的特征重要性。该方法相对简单明了:

  1. 选取一个拟合过训练数据集的模型
  2. 在一个独立数据集(例如,验证数据集)上估计模型的预测性能,并将其记录为基线性能
  3. 对于每个特征i
  4. 随机打乱原始数据集中特征列i
  5. 记录模型在包含打乱列的数据集上的预测性能
  6. 计算特征重要性为基线性能(步骤2)与打乱数据集上的性能之间的差异

排列重要性通常被认为是一种相对高效的技术,在实践中效果很好[1],而其缺点是相关特征的重要性可能被高估[2]。

参考文献

示例 1 -- 分类器的特征重要性

以下示例说明了基于排列重要性的特征重要性估计,适用于分类模型。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from mlxtend.evaluate import feature_importance_permutation

生成一个玩具数据集

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier

# 构建一个使用3个信息特征的分类任务
X, y = make_classification(n_samples=10000,
                           n_features=10,
                           n_informative=3,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=0,
                           shuffle=False)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1, stratify=y)

基于随机森林的特征重要性

首先,我们通过随机森林直接计算特征重要性,使用均值纯度下降(在代码部分之后进行描述):

forest = RandomForestClassifier(n_estimators=250,
                                random_state=0)

forest.fit(X_train, y_train)

print('Training accuracy:', np.mean(forest.predict(X_train) == y_train)*100)
print('Test accuracy:', np.mean(forest.predict(X_test) == y_test)*100)

importance_vals = forest.feature_importances_
print(importance_vals)

Training accuracy: 100.0
Test accuracy: 95.06666666666666
[0.283357   0.30846795 0.24204291 0.02229767 0.02364941 0.02390578
 0.02501543 0.0234225  0.02370816 0.0241332 ]

There are several strategies for computing the feature importance in random forest. The method implemented in scikit-learn (used in the next code example) is based on the Breiman and Friedman's CART (Breiman, Friedman, "Classification and regression trees", 1984), the so-called mean impurity decrease. Here, the importance value of a features is computed by averaging the impurity decrease for that feature, when splitting a parent node into two child nodes, across all the trees in the ensemble. Note that the impurity decrease values are weighted by the number of samples that are in the respective nodes. This process is repeated for all features in the dataset, and the feature importance values are then normalized so that they sum up to 1. In CART, the authors also note that this fast way of computing feature importance values is relatively consistent with the permutation importance.

接下来,让我们可视化随机森林中的特征重要性值,包括平均不纯度减少的变异性度量(这里:标准差):

std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)
indices = np.argsort(importance_vals)[::-1]

# 绘制森林的特征重要性
plt.figure()
plt.title("Random Forest feature importance")
plt.bar(range(X.shape[1]), importance_vals[indices],
        yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.ylim([0, 0.5])
plt.show()

png

正如我们所看到的,特征1、0和2被估计为随机森林分类器中最具信息量的特征。接下来,让我们通过置换重要性方法计算特征的重要性。

排列重要性

imp_vals, _ = feature_importance_permutation(
    predict_method=forest.predict, 
    X=X_test,
    y=y_test,
    metric='accuracy',
    num_rounds=1,
    seed=1)

imp_vals

array([ 0.26833333,  0.26733333,  0.261     , -0.002     , -0.00033333,
        0.00066667,  0.00233333,  0.00066667,  0.00066667, -0.00233333])

请注意,feature_importance_permutation 返回两个数组。第一个数组(这里是 imp_vals)包含我们感兴趣的实际重要性值。如果 num_rounds > 1,则会多次重复置换(使用不同的随机种子),在这种情况下,第一个数组包含从不同运行中计算出的重要性的平均值。第二个数组(这里被赋值为 _,因为我们不使用它)则包含来自这些运行的所有单独值(稍后会详细讨论)。

现在,让我们在条形图中可视化重要性值:

indices = np.argsort(imp_vals)[::-1]
plt.figure()
plt.title("Random Forest feature importance via permutation importance")
plt.bar(range(X.shape[1]), imp_vals[indices])
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.ylim([0, 0.5])
plt.show()

png

正如我们所看到的,这里特征1、0和2被预测为最重要的特征,这与我们之前通过均值不纯度降低方法计算得到的特征重要性值是一致的。

(请注意,在随机森林的背景下,通过置换重要性计算的特征重要性通常是使用随机森林的袋外样本,而在此实现中,使用的是一个独立的数据集。)

之前提到过,如果 num_rounds > 1,则排列会重复多次。在这种情况下,由 feature_importance_permutation 返回的第二个数组包含这些单独运行的重要性值(该数组的形状为 [num_features, num_rounds]),我们可以用它来计算这些运行之间的某种变异性。

imp_vals, imp_all = feature_importance_permutation(
    predict_method=forest.predict, 
    X=X_test,
    y=y_test,
    metric='accuracy',
    num_rounds=10,
    seed=1)


std = np.std(imp_all, axis=1)
indices = np.argsort(imp_vals)[::-1]

plt.figure()
plt.title("Random Forest feature importance via permutation importance w. std. dev.")
plt.bar(range(X.shape[1]), imp_vals[indices],
        yerr=std[indices])
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

png

请注意,特征重要性值之和并不等于1,因为它们没有经过归一化(如果需要,可以通过将这些值除以重要性值的总和进行归一化)。这里的主要观点是关注重要性值相对于彼此的关系,而不是过度解读绝对值。

支持向量机

尽管置换重要性方法产生的结果通常与随机森林的均值杂质降低特征重要性值一致,但这是一种与模型无关的方法,可以与任何类型的分类器或回归器一起使用。下面的示例将 feature_importance_permutation 函数应用于支持向量机:

from sklearn.svm import SVC


svm = SVC(C=1.0, kernel='rbf')
svm.fit(X_train, y_train)

print('Training accuracy', np.mean(svm.predict(X_train) == y_train)*100)
print('Test accuracy', np.mean(svm.predict(X_test) == y_test)*100)

Training accuracy 94.87142857142857
Test accuracy 94.89999999999999
imp_vals, imp_all = feature_importance_permutation(
    predict_method=svm.predict, 
    X=X_test,
    y=y_test,
    metric='accuracy',
    num_rounds=10,
    seed=1)


std = np.std(imp_all, axis=1)
indices = np.argsort(imp_vals)[::-1]

plt.figure()
plt.title("SVM feature importance via permutation importance")
plt.bar(range(X.shape[1]), imp_vals[indices],
        yerr=std[indices])
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()

png

示例 2 -- 回归模型的特征重要性

import numpy as np
import matplotlib.pyplot as plt
from mlxtend.evaluate import feature_importance_permutation
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.svm import SVR


X, y = make_regression(n_samples=1000,
                       n_features=5,
                       n_informative=2,
                       n_targets=1,
                       random_state=123,
                       shuffle=False)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123)    

svm = SVR(kernel='rbf')
svm.fit(X_train, y_train)

imp_vals, _ = feature_importance_permutation(
    predict_method=svm.predict, 
    X=X_test,
    y=y_test,
    metric='r2',
    num_rounds=1,
    seed=1)

imp_vals

array([ 0.43309137,  0.22058866,  0.00148447,  0.01613953, -0.00529505])
plt.figure()
plt.bar(range(X.shape[1]), imp_vals)
plt.xticks(range(X.shape[1]))
plt.xlim([-1, X.shape[1]])
plt.ylim([0, 0.5])
plt.show()

png

示例 3 -- 使用独热编码特征的特征重要性

在进行独热编码时,一个具有10个不同类别的特征变量将被拆分成10个新的特征列(或者如果你去掉一个冗余列,则为9个)。如果我们想将每个新的特征列视为一个独立的特征变量,我们可以像往常一样使用feature_importance_permutation

下面的例子对此进行了说明。

准备数据集

在这里,我们查看一个数据集,该数据集由一个分类特征('categorical')和三个数值特征('measurement1''measurement2''measurement3')组成。

import pandas as pd


df_data = pd.read_csv('https://gist.githubusercontent.com/rasbt/b99bf69079bc0d601eeae8a49248d358/raw/a114be9801647ec5460089f3a9576713dabf5f1f/onehot-numeric-mixed-data.csv')
df_data.head()

categorical measurement1 measurement2 label measurement3
0 F 1.428571 2.721313 0 2.089
1 R 0.685939 0.982976 0 0.637
2 P 1.055817 0.624210 0 0.226
3 S 0.995956 0.321101 0 0.138
4 R 1.376773 1.578309 0 0.478
from sklearn.model_selection import train_test_split


df_X = df_data[['measurement1', 'measurement2', 'measurement3', 'categorical']]
df_y = df_data['label']


df_X_train, df_X_test, df_y_train, df_y_test = train_test_split(
     df_X, df_y, test_size=0.33, random_state=42, stratify=df_y)

在这里,我们对分类特征进行独热编码,并将其与数值列合并:

from sklearn.preprocessing import OneHotEncoder
import numpy as np


ohe = OneHotEncoder(drop='first')
ohe.fit(df_X_train[['categorical']])

df_X_train_ohe = df_X_train.drop(columns=['categorical'])
df_X_test_ohe = df_X_test.drop(columns=['categorical'])

ohe_train = np.asarray(ohe.transform(df_X_train[['categorical']]).todense())
ohe_test = np.asarray(ohe.transform(df_X_test[['categorical']]).todense())

X_train_ohe = np.hstack((df_X_train_ohe.values, ohe_train))
X_test_ohe = np.hstack((df_X_test_ohe.values, ohe_test))

# 看前3行
print(X_train_ohe[:3])

[[0.65747208 0.95105388 0.36       0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         1.         0.         0.
  0.         0.         0.        ]
 [1.17503636 1.01094494 0.653      0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         1.         0.
  0.         0.         0.        ]
 [1.25516647 0.67575824 0.176      0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.         0.         0.         0.
  0.         0.         0.        ]]

适应基线模型以进行分析

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV


pipe = make_pipeline(StandardScaler(),
                     MLPClassifier(max_iter=10000, random_state=123))

params = {
    'mlpclassifier__hidden_layer_sizes': [(30, 20, 10), 
                                          (20, 10), 
                                          (20,),
                                          (10,)],
    'mlpclassifier__activation': ['tanh', 'relu'],
    'mlpclassifier__solver': ['sgd'],
    'mlpclassifier__alpha': [0.0001],
    'mlpclassifier__learning_rate': ['adaptive'],
}

gs = GridSearchCV(estimator=pipe, 
                  param_grid=params, 
                  scoring='accuracy', 
                  refit=True,
                  n_jobs=-1,
                  cv=10)

gs = gs.fit(X_train_ohe, df_y_train.values)
model = gs.best_estimator_

正常排列重要性

在这里,我们按照常规方法计算特征重要性,将每个独热编码特征视为一个独立变量。

from mlxtend.evaluate import feature_importance_permutation

imp_vals, imp_all = feature_importance_permutation(
    predict_method=model.predict, 
    X=X_test_ohe,
    y=df_y_test.values,
    metric='accuracy',
    num_rounds=50,
    seed=1)

feat_names_with_ohe = ['measurement1', 'measurement2', 'measurement3'] \
    + [f'categorical_ohe_{i}' for i in range(2, 20)]

%matplotlib inline
import matplotlib.pyplot as plt


std = np.std(imp_all, axis=1)
indices = np.argsort(imp_vals)[::-1]

plt.figure()
#plt.title("Feature importance via permutation importance w. std. dev.")
plt.bar(range(len(feat_names_with_ohe)), imp_vals[indices],
        yerr=std[indices])
plt.xticks(range(len(feat_names_with_ohe)),
           np.array(feat_names_with_ohe)[indices], rotation=90)
plt.xlim([-1, len(feat_names_with_ohe)])
plt.show()

png

然而,请注意,如果我们有很多类别值,经过独热编码后的单个二元特征的重要性就很难解释。在某些情况下,希望在特征置换重要性评估中将独热编码的二元特征视为一个单独的变量。我们可以通过使用特征组来实现这一点。

使用特征组

在下面的示例中,所有的独热编码变量被视为一个特征组。这意味着它们被全部打乱并作为一个单一特征在特征置换重要性分析中进行分析。

feature_groups = [0, 1, 2, range(3, 21)]

imp_vals, imp_all = feature_importance_permutation(
    predict_method=model.predict, 
    X=X_test_ohe,
    y=df_y_test.values,
    metric='accuracy',
    num_rounds=50,
    feature_groups=feature_groups,
    seed=1)

feature_names = ['measurement1', 'measurement2', 'measurement3', 'categorical']

std = np.std(imp_all, axis=1)
indices = np.argsort(imp_vals)[::-1]

plt.figure()
plt.bar(range(len(feature_names)), imp_vals[indices],
        yerr=std[indices])
plt.xticks(range(len(feature_names)),
           np.array(feature_names)[indices], rotation=90)
plt.xlim([-1, len(feature_names)])
plt.show()

png

API

feature_importance_permutation(X, y, predict_method, metric, num_rounds=1, feature_groups=None, seed=None)

Feature importance imputation via permutation importance

Parameters

Returns

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/

ython ython