简单的加州演示
本笔记本展示了如何构建输入特征的分层聚类,并使用它来解释单个实例。当输入特征的数量很大时,这是一种解释单个实例的好方法。当给定一个平衡的分区树时,PartitionExplainer 的运行时间为 \(O(M^2)\),其中 \(M\) 是输入特征的数量。这比 KernelExplainer 的 \(O(2^M)\) 运行时间要好得多。
[1]:
import sys
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.cluster
from xgboost import XGBRegressor
import shap
seed = 2023
np.random.seed(seed)
训练模型
[2]:
X, y = shap.datasets.california()
model = XGBRegressor(n_estimators=100, subsample=0.3)
model.fit(X, y)
instance = X[0:1]
references = X[1:100]
is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
计算输入特征的层次聚类
[3]:
partition_tree = shap.utils.partition_tree(X)
plt.figure(figsize=(15, 6))
sp.cluster.hierarchy.dendrogram(partition_tree, labels=X.columns)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("feature")
plt.ylabel("distance")
plt.show()
解释实例
[4]:
# build a masker from partition tree
masker = shap.maskers.Partition(X, clustering=partition_tree)
# build explainer objects
raw_explainer = shap.PartitionExplainer(model.predict, X)
masker_explainer = shap.PartitionExplainer(model.predict, masker)
# compute SHAP values
raw_shap_values = raw_explainer(instance)
masker_shap_values = masker_explainer(instance)
[5]:
# comparison the masker and the original data sizes
print(f"X size: {sys.getsizeof(X)/1024:.2f} kB")
print(f"masker size: {sys.getsizeof(masker)} B")
X size: 1290.16 kB
masker size: 56 B
与 Tree SHAP 相比
[6]:
tree_explainer = shap.TreeExplainer(model, X)
tree_shap_values = tree_explainer(instance)
plt.figure(figsize=(15, 6))
plt.plot(tree_shap_values[0].values, label="Tree SHAP")
plt.plot(masker_shap_values[0].values, "g--", label="Partition SHAP")
plt.plot(raw_shap_values[0].values, "r--", label="Raw SHAP")
plt.legend()
plt.show()
使用分区树的分区SHAP值是对SHAP值的良好估计。分区树是减少输入特征数量和加快计算速度的好方法。
解释实例的图表
[7]:
shap.plots.waterfall(masker_shap_values[0])