简单的加州演示

本笔记本展示了如何构建输入特征的分层聚类，并使用它来解释单个实例。当输入特征的数量很大时，这是一种解释单个实例的好方法。当给定一个平衡的分区树时，PartitionExplainer 的运行时间为 \(O(M^2)\)，其中 \(M\) 是输入特征的数量。这比 KernelExplainer 的 \(O(2^M)\) 运行时间要好得多。

[1]:

import sys

import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.cluster
from xgboost import XGBRegressor

import shap

seed = 2023
np.random.seed(seed)

训练模型

[2]:

X, y = shap.datasets.california()
model = XGBRegressor(n_estimators=100, subsample=0.3)
model.fit(X, y)

instance = X[0:1]
references = X[1:100]

is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead

计算输入特征的层次聚类

[3]:

partition_tree = shap.utils.partition_tree(X)
plt.figure(figsize=(15, 6))
sp.cluster.hierarchy.dendrogram(partition_tree, labels=X.columns)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("feature")
plt.ylabel("distance")
plt.show()

../../../_images/example_notebooks_tabular_examples_model_agnostic_Simple_California_Demo_6_0.png

解释实例

[4]:

# build a masker from partition tree
masker = shap.maskers.Partition(X, clustering=partition_tree)

# build explainer objects
raw_explainer = shap.PartitionExplainer(model.predict, X)
masker_explainer = shap.PartitionExplainer(model.predict, masker)

# compute SHAP values
raw_shap_values = raw_explainer(instance)
masker_shap_values = masker_explainer(instance)

[5]:

# comparison the masker and the original data sizes
print(f"X size: {sys.getsizeof(X)/1024:.2f} kB")
print(f"masker size: {sys.getsizeof(masker)} B")

X size: 1290.16 kB
masker size: 56 B

与 Tree SHAP 相比

[6]:

tree_explainer = shap.TreeExplainer(model, X)
tree_shap_values = tree_explainer(instance)

plt.figure(figsize=(15, 6))
plt.plot(tree_shap_values[0].values, label="Tree SHAP")
plt.plot(masker_shap_values[0].values, "g--", label="Partition SHAP")
plt.plot(raw_shap_values[0].values, "r--", label="Raw SHAP")

plt.legend()
plt.show()

../../../_images/example_notebooks_tabular_examples_model_agnostic_Simple_California_Demo_11_0.png

使用分区树的分区SHAP值是对SHAP值的良好估计。分区树是减少输入特征数量和加快计算速度的好方法。

解释实例的图表

[7]:

shap.plots.waterfall(masker_shap_values[0])

../../../_images/example_notebooks_tabular_examples_model_agnostic_Simple_California_Demo_14_0.png