Explain Python Machine Learning Models with SHAP Library

news2024/11/24 15:47:39

Explain Python Machine Learning Models with SHAP Library – Minimatech

(能翻墙直接看原文)

Explain Python Machine Learning Models with SHAP Library

  • 11 September 2021
  • Muhammad Fawi
  • Machine Learning

Using SHapley Additive exPlainations (SHAP) Library to Explain Python ML Models

Almost always after developing an ML model, we find ourselves in a position where we need to explain this model. Even when the model is very good, it is still a black box that needs to be deciphered. Explaining a model is a very important step in a data science project that we usually overlook. SHAP library helps in explaining python machine learning models, even deep learning ones, so easy with intuitive visualizations. It also demonstrates feature importances and how each feature affects model output.

Here we are going to explore some of SHAP’s power in explaining a Logistic Regression model.

We will use the Bank Marketing dataset[1] to predict whether a customer will subscribe a term deposit.

Data Exploration

We will start by importing all necessary libraries and reading the data. We will use the smaller dataset in the bank-additional zip file.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import shap

import zipfile

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, precision_recall_curve

from sklearn.metrics import accuracy_score, precision_score

from sklearn.metrics import recall_score, auc, roc_curve

zf = zipfile.ZipFile("bank-additional.zip")

df = pd.read_csv(zf.open("bank-additional/bank-additional.csv"), sep = ";")

df.shape

# (4119, 21)

Let’s look closely at the data and its structure. We will not go in depth in the exploratory data analysis step. However, we will see how data looks like and perform sum summary and descriptive stats.

df.isnull().sum().sum() # no NAs

# 0

## looking at numeric variables summary stats

df.describe()

Let’s have a quick look at how the object variables are distributed between the two classes; yes and no.

## counts

df.groupby("y").size()

# y

# no 3668

# yes 451

# dtype: int64

num_cols = list(df.select_dtypes(np.number).columns)

print(num_cols)

# ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

obj_cols = list(df.select_dtypes(object).drop("y", axis = 1).columns)

print(obj_cols)

# ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']

df[obj_cols + ["y"]].groupby("y").agg(["nunique"])

# job marital education default housing loan contact month day_of_week poutcome

# nunique nunique nunique nunique nunique nunique nunique nunique nunique nunique

# y

# no 12 4 8 3 3 3 2 10 5 3

# yes 12 4 7 2 3 3 2 10 5 3

Seems like categorical variables are equally distributed between the classes.

I know that this is so quick analysis and shallow. But EDA is out of the scope of this blog.

Feature Preprocessing

Now it is time to prepare the features for the LR model. Scaling the numer variables and one hot encode the categorical ones. We will use ColumnTransformer to apply different preprocessors on different columns and wrap everything in a pipeline.

## change classes to float

df["y"] = np.where(df["y"] == "yes", 1., 0.)

## the pipeline

scaler = Pipeline(steps = [

## there are no NAs anyways

("imputer", SimpleImputer(strategy = "median")),

("scaler", StandardScaler())

])

encoder = Pipeline(steps = [

("imputer", SimpleImputer(strategy = "constant", fill_value = "missing")),

("onehot", OneHotEncoder(handle_unknown = "ignore")),

])

preprocessor = ColumnTransformer(

transformers = [

("num", scaler, num_cols),

("cat", encoder, obj_cols)

])

pipe = Pipeline(steps = [("preprocessor", preprocessor)])

Split data into train and test and fit the pipeline on train data and transform both train and test.

X_train, X_test, y_train, y_test = train_test_split(

df.drop("y", axis = 1), df.y,

stratify = df.y,

random_state = 13,

test_size = 0.25)

X_train = pipe.fit_transform(X_train)

X_test = pipe.transform(X_test)

Reverting to the exploratory phase. A good way to visualize one hot encoded data, sparse matrices with 1s and 0s, is by using imshow(). We will look at the last contact month columns which is now is converted into several columns with 1 in the month when the contact happened. The plot will also be split between yes and no.

First let’s get the new feature names from the pipeline.

## getting feature names from the pipeline

nums = pipe["preprocessor"].transformers_[0][2]

obj = list(pipe["preprocessor"].transformers_[1][1]["onehot"].get_feature_names(obj_cols))

fnames = nums + obj

len(fnames) ## new number of columns due to one hot encoder

# 62

Let’s now visualize!

from matplotlib.colors import ListedColormap

print([i for i in obj if "month" in i])

# ['month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep']

## filter the train data on the month data

tr = X_train[:, [True if "month" in i else False for i in fnames]]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (15,7))

fig.suptitle("Subscription per Contact Month", fontsize = 20)

cmapmine1 = ListedColormap(["w", "r"], N = 2)

cmapmine2 = ListedColormap(["w", "b"], N = 2)

ax1.imshow(tr[y_train == 0.0], cmap = cmapmine1, interpolation = "none", extent = [3, 6, 9, 12])

ax1.set_title("Not Subscribed")

ax2.imshow(tr[y_train == 1.0], cmap = cmapmine2, interpolation = "none", extent = [3, 6, 9, 12])

ax2.set_title("Subscribed")

plt.show()

Of course, we need to sort the columns with months order and put labels so that the plot can be more readable. But it is just to quickly visualize sparse matrices with 1s and 0s.

Model Development

Now it is time to develop the model and fit it.

clf = LogisticRegression(

solver = "newton-cg", max_iter = 50, C = .1, penalty = "l2"

)

clf.fit(X_train, y_train)

# LogisticRegression(C=0.1, max_iter=50, solver='newton-cg')

Now we will look at model’s AUC and set the threshold to predict the test data.

y_pred_proba = clf.predict_proba(X_test)[:, 1]

fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

roc_auc = auc(fpr, tpr)

plt.plot(fpr, tpr, ls = "--", label = "LR AUC = %0.2f" % roc_auc)

plt.plot([0,1], [0,1], c = "r", label = "No Skill AUC = 0.5")

plt.legend(loc = "lower right")

plt.ylabel("true positive rate")

plt.xlabel("false positive rate")

plt.show()

The model shows a very good AUC. Let’s now set the threshold that gives the best combination between recall and precision.

precision, recall, threshold = precision_recall_curve(

y_test, y_pred_proba)

tst_prt = pd.DataFrame({

"threshold": threshold,

"recall": recall[1:],

"precision": precision[1:]

})

tst_prt_melted = pd.melt(tst_prt, id_vars = ["threshold"],

value_vars = ["recall", "precision"])

sns.lineplot(x = "threshold", y = "value",

hue = "variable", data = tst_prt_melted)

We can spot that 0.3 can be a very good threshold. Let’s test it on test data.

y_pred = np.zeros(len(y_test))

y_pred[y_pred_proba >= 0.3] = 1.

print("Accuracy: %.2f%%" % (100 * accuracy_score(y_test, y_pred)))

print("Precision: %.2f%%" % (100 * precision_score(y_test, y_pred)))

print("Recall: %.2f%%" % (100 * recall_score(y_test, y_pred)))

# Accuracy: 91.65%

# Precision: 61.54%

# Recall: 63.72%

Great! The model is performing good. Maybe it can be enhanced, but for now let’s go and try to explain how it behaves with SHAP.

Model Explanation and Feature Importance

Introducing SHAP

From SHAP’s documentation; SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allocation with local explanations using the classic Shapley values from game theory and their related extensions.

In brief, aside from the math behind, this is how it works. When we pass a model and a training dataset, a base value is calculated, which is the average model output over the training dataset. Then shap values are calculated for each feature per each example. Then each feature, with its shap values, contributes to push the model output from that base value to left and right. In a binary classification model, features that push the model output above the base value contribute to the positive class. While the features contributing to negative class will push towards below the base value.

Let’s have a look at how this looks like. First we define our explainer and calculate the shap values.

explainer = shap.Explainer(clf, X_train, feature_names = np.array(fnames))

shap_values = explainer(X_test)

Now let’s visualize how this works in an example.

Individual Visualization

## we init JS once in our session

shap.initjs()

ind = np.argmax(y_test == 0)

print("actual is:", y_test.values[ind], "while pred is:", y_pred[ind])

shap.plots.force(shap_values[ind])

# actual is: 0.0 while pred is: 0.0

We can see how the shown observations (scaled) of duration, number of employees, 3 month euribor and contact via telephone = 1 push the model below the base value (-3.03) resulting in a negative example. While last contact in June not May and 1.53 scaled consumer price index tried to push to the right but couldn’t beat the blue force.

We can also look at the same graph using waterfall graph representing cumulative sum and how the shap values are added together to give the model output from the base value.

shap.plots.waterfall(shap_values[ind])

We can see the collision between the features pushing left and right until we have the output. The numbers on the left side is the actual observations in the data. While the numbers inside the graph are the shap values for each feature for this example.

Let’s look at a positive example using the same two graphs.

ind = np.argmax(y_test == 1)

print("actual is:", y_test.values[ind], "while pred is:", y_pred[ind])

shap.plots.force(shap_values[ind])

# actual is: 1.0 while pred is: 1.0

shap.plots.waterfall(shap_values[ind])

It is too obvious how values are contributing now to the positive class. We can see from the two examples that high duration contributes to positive class while low duration contributes to negative. Unlike number of employees. High nr_employed contributes to negative and low nr_employed contibutes to positive.

Collective Visualization

We saw how the force plot shows how features explain the model output. However, it is only for one observation. We now will look at the same force plot but for multiple observations at the same time.

shap.force_plot(explainer.expected_value, shap_values.values, X_test, feature_names = fnames)

This plot (interactive in the notebook) is the same as individual force plot. Just imagine multiple force plots rotated 90 degrees and added together for each example. A heatmap also can be viewed to see the effect of each feature on each example.

shap.plots.heatmap(shap_values)

The heatmap shows the shap value of each feature per each example in the data. Also, above the map, the model output per each example is shown. The small line plot going above and below the base line.

Another very useful graph is the beeswarm. It gives an overview of which features are most important for the model. It plots the shap values of every feature for every sample as the heatmap and sorts these features by the sum of its shap value magnitudes over all examples.

shap.plots.beeswarm(shap_values)

We can see that duration is the most important variable and high duration increases the probability for positive class, subscription in our example. While high number of employees decreases the probability for subscription.

We can also get the mean of the absolute shap values for each feature and plot a bar chart.

shap.plots.bar(shap_values)

Fantastic! We have seen how SHAP can help in explaining our logistic regression model with very useful visualizations. The library can explain so many models including neural networks and the project github repo has so many notebook examples.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1827985.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Elixir学习笔记——输入输出和文件系统

本章介绍输入/输出机制、文件系统相关任务以及相关模块(如 IO、File 和 Path)。IO 系统提供了一个很好的机会来阐明 Elixir 和 Erlang VM 的一些思维模式和新奇思想。 输入输出模块 输入输出模块是 Elixir 中读写标准输入/输出 (:stdio)、标准错误 (:s…

HTML表格的跨行与跨列:《红楼梦》人物与小学课表示例

在HTML中,表格不仅可以按常规行和列排列数据,还可以通过跨行(rowspan)和跨列(colspan)属性来合并单元格,以适应更复杂的数据展示需求。以下是跨行与跨列属性的介绍,以及两个示例&…

centos环境上:k8s 简单安装教程

本次演示安装3节点k8s环境,无需多言,直接上操作步骤: 1、环境准备 k8s部署前,首先需要准备好环境,除了1.4 步骤,其他步骤在所有(3个)节点上都要执行: 1.1 关闭防火墙 s…

移动硬盘数据恢复方法哪个好?六个硬盘恢复,新手也能用!

移动硬盘数据恢复方法哪个好?移动硬盘,作为我们存储重要数据的常用设备,一旦里面的视频、文档、音频等资料突然消失,确实会令人烦恼和担忧。然而,因为数据丢失的原因可能多种多样,因此恢复方法也会有所不同…

ABBYY Finereader 15软件下载及安装教程

ABBYY FineReader 是一款功能强大的 OCR(Optical Character Recognition)软件,可以将扫描的文档转换为可编辑的文本文件。它不仅可以识别文本,还可以识别表格、图像和布局,使得文档的转换更加准确和方便。 安 装 包 获…

【计算机毕业设计】基于Springboot的车辆管理系统【源码+lw+部署文档】

包含论文源码的压缩包较大,请私信或者加我的绿色小软件获取 免责声明:资料部分来源于合法的互联网渠道收集和整理,部分自己学习积累成果,供大家学习参考与交流。收取的费用仅用于收集和整理资料耗费时间的酬劳。 本人尊重原创作者…

基于51单片机的烟雾报警器设计-ADC0809

一.硬件方案 火灾报警器采用51单片机为核心控制器,利用气体传感器MQ-2、ADC0809模数转换器、DS18B20温度传感器等实现基本功能。通过这些传感器和芯片,当环境中可燃气体浓度或温度等发生变化时系统会发出相应的灯光报警信号和声音报警信号,以…

使用python绘制三维曲线图

使用python绘制三维曲线图 三维曲线图定义特点 效果代码 三维曲线图 三维曲线图(3D曲线图)是一种用于可视化三维数据的图表,它展示了数据在三个维度(X、Y、Z)上的变化。 定义 三维曲线图通过在三维坐标系中绘制曲线…

Arduino入门2——常用函数及用法

Arduino入门2——串口驱动函数及用法 IO串口 上期,我们简单的认识了一下Arduino,浅浅的入了个门,这一期我们介绍以下Arduino串口常用的函数及用法 IO 常用串口库函数如下: 函数名用法及解析pinMode()用于IO口初始化digitalWrite…

新政出台,房市热闹起来,卖房容易了,上海二手房东多卖360万

在517新政之后,上海的中介和房地产商售楼部都表示,如今咨询买房的人士大幅增长,而二手房东也迅速调涨价格,某二手房东表示当天调价,当天就成功卖出,卖出价格比之前增加了360万。 据悉该房东其实早已有意卖房…

数据结构01 栈及其相关问题讲解【C++实现】

栈是一种线性数据结构,栈的特征是数据的插入和删除只能通过一端来实现,这一端称为“栈顶”,相应的另一端称为“栈底”。 栈及其特点 用一个简单的例子来说,栈就像一个放乒乓球的圆筒,底部是封住的,如果你想…

python-基础篇-类与对象/面向对象程序设计-是什么

文章目录 定义一:面对对象是一种编程思想定义一:面向对象是一种抽象1、面向对象的两个基本概念2、面向对象的三大特性 定义一:你是土豪,全家都是土豪面向对象编程基础类和对象定义类创建和使用对象访问可见性问题面向对象的支柱 定…

java打印helloworld

源代码 public class Function1 {public static void main(String[] args) {System.out.println("hello world");}} 打印结果

PyQT5 键盘模拟/鼠标连点器的实现

近来在玩一个游戏,找不到合适的鼠标连点器,不是有广告就是功能太复杂,自己写了一个,分享出来,如果有需要的可以自行运行研究。 准备工作 Python版本:Python 3.12.3;运行前确保pyQT5已经安装: pip install PyQt5程序运行界面: 程序代码: 通过引入单独的常量和变…

C++ 53 之 继承中同名成员处理

#include <iostream> #include <string> using namespace std;class Base06{ public:int m_a;Base06(){this->m_a 10;}void fun(){cout << "父类的fun函数" << endl;}void fun(int a){cout << "父类的fun(int a)函数" &…

模仿qsort实现一个通用的冒泡排序

目录 前言 模仿 排序整型数组 排序结构体数组 排序字符数组 前言 qsort在前面我们讲到底层逻辑是快速排序的方式&#xff0c;是不是可以发现有了qsort来进行排序的话&#xff0c;就更加的方便快捷&#xff0c;我们在使用的时候 一方面&#xff0c;代码量会大大的减少 另一…

JVM 性能分析——jdk 自带命令分析工具(jps/jstat/jinfo/jmap/jhat/jstack)

文章目录 jps&#xff08;Java Process Status&#xff09;&#xff1a;查看正在运行的Java进程jstat&#xff08;JVM Statistics Monitoring Tool&#xff09;&#xff1a;查看 JVM 的统计信息jinfo&#xff08;Configuration Info for Java&#xff09;&#xff1a;实时查看和…

vs+qt5.0 使用poppler 操作库

Poppler 是一个用来生成 PDF 的C类库&#xff0c;从xpdf 继承而来。vs编译库如下&#xff1a; vs中只需要添加依赖库即可 头文件&#xff1a;

SAP HCM 员工供应商过账详解 财务角度理解员工供应商过账

导读 INTRODUCTION 员工供应商:在某些情况下,特别是在大型组织或集团公司中,员工可能同时扮演着供应商的角色,为组织内部的其他部门或子公司提供产品或服务。例如,一个技术部门的员工可能为销售部门提供技术支持或定制开发服务。,还有一种,就是员工在公司挂账的欠款,每…

zip加密txt文件后,暴力破解时会有多个解密密码可以打开的疑问??

最近在做一个关于zip压缩文件解密的测试&#xff0c;发现通过暴力解密时&#xff0c;会有多个解密密码可以打开&#xff0c;非常疑惑&#xff0c;这里做个问题&#xff0c;希望能有大佬解惑。 1、首先在本地创建一个113449.txt的文件&#xff0c;然后右键txt文件选择压缩&…