挤压函数如何影响特征重要性

在机器学习模型中，当你使用非线性函数来转换模型的输出时，特征的重要性可能会发生显著变化。最常见的这种转换类型是使用“压缩”函数。压缩函数如逻辑变换常用于将无界的“边际”空间转换为有界的概率空间。边际空间的值以信息单位表示，而概率空间中的值以概率单位表示。在不同情况下，你关心的空间可能不同。边际空间更适合加减操作，并直接对应于信息论意义上的“证据”。然而，如果你只关心概率的变化，而不是证据，那么使用概率空间会更好。选择概率空间意味着，从98%概率到99.99%概率的大量强有力证据，远不如从50%概率到60%概率的小量证据重要。为什么从98%概率到99.99%需要更多证据，而从50%概率到60%则不需要？这是因为从信息论的角度来看，从98%的确定性到99.99%需要更多的信息，而从50%的确定性到60%则不需要。

需要注意的是，尽管逻辑函数是一个单调变换，但它仍然可以改变模型中哪些特征最重要的排序。特征的排序可能会改变，因为某些特征对于达到99.9%的概率非常重要，而其他特征通常有助于达到60%的概率。下面的简单示例展示了如何使用压缩函数改变特征的重要性：

[3]:

import numpy as np
import pandas as pd
import scipy

import shap

[4]:

shap.initjs()

[5]:

# build a simple dataset
N = 500
M = 4
X = np.random.randn(N, M)
X[0, 0] = 0
X[0, 1] = 0
X = pd.DataFrame(X, columns=["A", "B", "C", "D"])


# a function (a made up ML model) with an output in "margin" space...
def f(X):
    return (X[:, 0] > 0) * 1 + (X[:, 1] > 1.5) * 100


# ...and then also change its output to probability space
def f_logistic(X):
    return scipy.special.expit(f(X))

[7]:

# explain both functions
explainer = shap.KernelExplainer(f, X)
shap_values_f = explainer.shap_values(X.values[0:2, :])

explainer_logistic = shap.KernelExplainer(f_logistic, X)
shap_values_f_logistic = explainer_logistic.shap_values(X.values[0:2, :])

Using 500 background data samples could cause slower run times. Consider using shap.kmeans(data, K) to summarize the background as K weighted samples.

Using 500 background data samples could cause slower run times. Consider using shap.kmeans(data, K) to summarize the background as K weighted samples.

边距空间解释

在考虑边距空间时，特性 B 非常重要，因为当其值为 0 时，意味着我们不会触发当 B 大于 2 时发生的 +100 效果。尽管 B 大于 2 的情况很少见，但它也非常重要，因为它具有很大的影响。

[8]:

shap_values_f[0, :]

[8]:

array([-0.506, -6.   ,  0.   ,  0.   ])

[9]:

shap.force_plot(float(explainer.expected_value), shap_values_f[0, :], X.iloc[0, :])

[9]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

概率空间解释

在考虑概率空间时，特征 B 不再非常重要，因为逻辑函数将边际空间中的 +100 效果压缩到最多 +1。因此，现在特征 B 大于 2 的情况既罕见又不那么重要。

[10]:

shap_values_f_logistic[0, :]

[10]:

array([-0.11344976, -0.02653412,  0.        ,  0.        ])

[11]:

shap.force_plot(
    float(explainer_logistic.expected_value), shap_values_f_logistic[0, :], X.iloc[0, :]
)

[11]:

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

[ ]: