挤压函数如何影响特征重要性

在机器学习模型中,当你使用非线性函数来转换模型的输出时,特征的重要性可能会发生显著变化。最常见的这种转换类型是使用“压缩”函数。压缩函数如逻辑变换常用于将无界的“边际”空间转换为有界的概率空间。边际空间的值以信息单位表示,而概率空间中的值以概率单位表示。在不同情况下,你关心的空间可能不同。边际空间更适合加减操作,并直接对应于信息论意义上的“证据”。然而,如果你只关心概率的变化,而不是证据,那么使用概率空间会更好。选择概率空间意味着,从98%概率到99.99%概率的大量强有力证据,远不如从50%概率到60%概率的小量证据重要。为什么从98%概率到99.99%需要更多证据,而从50%概率到60%则不需要?这是因为从信息论的角度来看,从98%的确定性到99.99%需要更多的信息,而从50%的确定性到60%则不需要。

需要注意的是,尽管逻辑函数是一个单调变换,但它仍然可以改变模型中哪些特征最重要的排序。特征的排序可能会改变,因为某些特征对于达到99.9%的概率非常重要,而其他特征通常有助于达到60%的概率。下面的简单示例展示了如何使用压缩函数改变特征的重要性:

[3]:
import numpy as np
import pandas as pd
import scipy

import shap
[4]:
shap.initjs()
[5]:
# build a simple dataset
N = 500
M = 4
X = np.random.randn(N, M)
X[0, 0] = 0
X[0, 1] = 0
X = pd.DataFrame(X, columns=["A", "B", "C", "D"])


# a function (a made up ML model) with an output in "margin" space...
def f(X):
    return (X[:, 0] > 0) * 1 + (X[:, 1] > 1.5) * 100


# ...and then also change its output to probability space
def f_logistic(X):
    return scipy.special.expit(f(X))
[7]:
# explain both functions
explainer = shap.KernelExplainer(f, X)
shap_values_f = explainer.shap_values(X.values[0:2, :])

explainer_logistic = shap.KernelExplainer(f_logistic, X)
shap_values_f_logistic = explainer_logistic.shap_values(X.values[0:2, :])
Using 500 background data samples could cause slower run times. Consider using shap.kmeans(data, K) to summarize the background as K weighted samples.
Using 500 background data samples could cause slower run times. Consider using shap.kmeans(data, K) to summarize the background as K weighted samples.


边距空间解释

在考虑边距空间时,特性 B 非常重要,因为当其值为 0 时,意味着我们不会触发当 B 大于 2 时发生的 +100 效果。尽管 B 大于 2 的情况很少见,但它也非常重要,因为它具有很大的影响。

[8]:
shap_values_f[0, :]
[8]:
array([-0.506, -6.   ,  0.   ,  0.   ])
[9]:
shap.force_plot(float(explainer.expected_value), shap_values_f[0, :], X.iloc[0, :])
[9]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

概率空间解释

在考虑概率空间时,特征 B 不再非常重要,因为逻辑函数将边际空间中的 +100 效果压缩到最多 +1。因此,现在特征 B 大于 2 的情况既罕见又不那么重要。

[10]:
shap_values_f_logistic[0, :]
[10]:
array([-0.11344976, -0.02653412,  0.        ,  0.        ])
[11]:
shap.force_plot(
    float(explainer_logistic.expected_value), shap_values_f_logistic[0, :], X.iloc[0, :]
)
[11]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
[ ]: