第100+22步 ChatGPT学习：概率校准 Platt Scaling

基于Python 3.9版本演示

一、写在前面

最近看了一篇在Lancet子刊《eClinicalMedicine》上发表的机器学习分类的文章：《Development of a novel dementia risk prediction model in the general population: A large, longitudinal, population-based machine-learning study》。

学到一种叫做“概率校准”的骚操作，顺手利用GPT系统学习学习。

文章中用的技术是：保序回归（Isotonic regression）。

为了体现举一反三，顺便问了GPT还有哪些方法也可以实现概率校准。它给我列举了很多，那么就一个一个学习吧。

首先，是一个叫做 Platt Scaling 的方法。

二、Platt Scaling

Platt Scaling 是一种后处理方法，用于将机器学习分类器（尤其是二分类器）的输出分数（scores）转换为概率。它通过使用一个简单的逻辑回归模型将分类器的输出分数映射到 [0, 1] 的区间，从而提供概率估计。这个方法最初由 John Platt 在1999年提出，主要用于支持向量机（SVM）的概率估计。

（1）Platt Scaling 的基本步骤

1）训练分类器：首先，训练一个分类器（例如支持向量机、神经网络或决策树），并获得其输出分数（未归一化的分数或logits）。

2）拟合逻辑回归模型：将分类器的输出分数作为逻辑回归模型的输入。使用训练数据中的真实标签作为逻辑回归模型的输出。训练逻辑回归模型，找到最佳的参数（通常是通过最小化对数损失函数来实现）。

3）生成概率估计：使用训练好的逻辑回归模型，将新的分类器输出分数转换为概率估计。

（2）Platt Scaling 在分类中的作用

1）概率校准：Platt Scaling 可以将分类器的输出分数校准为概率，使得输出更具解释性。例如，一个输出概率为0.7的样本可以解释为该样本属于正类的概率为70%。

2）提升模型性能：通过将分数转换为概率，可以更好地进行决策边界的选择，从而提高分类器的性能。

3）适用于多种分类器：尽管Platt Scaling最初是为SVM设计的，但它可以广泛应用于各种分类器，如神经网络、决策树等。

4）融合不同模型的概率输出：在集成学习中，可以使用Platt Scaling将不同模型的输出分数转换为概率，然后结合这些概率进行加权平均，从而提升集成模型的性能。

三、Platt Scaling代码实现

下面，我编一个1比3的不太平衡的数据进行测试，对照组使用不进行校准的LR模型，实验组就是加入校准的LR模型，看看性能能够提高多少？

（1）不进行校准的SVM模型（默认参数）

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve

# 加载数据
dataset = pd.read_csv('8PSMjianmo.csv')
X = dataset.iloc[:, 1:20].values
Y = dataset.iloc[:, 0].values

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=666)

# 标准化数据
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 使用SVM分类器
classifier = SVC(kernel='linear', probability=True)
classifier.fit(X_train, y_train)

# 预测结果
y_pred = classifier.predict(X_test)
y_testprba = classifier.decision_function(X_test)

y_trainpred = classifier.predict(X_train)
y_trainprba = classifier.decision_function(X_train)

# 混淆矩阵
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

# 绘制测试集混淆矩阵
classes = list(set(y_test))
classes.sort()
plt.imshow(cm_test, cmap=plt.cm.Blues)
indices = range(len(cm_test))
plt.xticks(indices, classes)
plt.yticks(indices, classes)
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('Actual')
for first_index in range(len(cm_test)):
    for second_index in range(len(cm_test[first_index])):
        plt.text(first_index, second_index, cm_test[first_index][second_index])

plt.show()

# 绘制训练集混淆矩阵
classes = list(set(y_train))
classes.sort()
plt.imshow(cm_train, cmap=plt.cm.Blues)
indices = range(len(cm_train))
plt.xticks(indices, classes)
plt.yticks(indices, classes)
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('Actual')
for first_index in range(len(cm_train)):
    for second_index in range(len(cm_train[first_index])):
        plt.text(first_index, second_index, cm_train[first_index][second_index])

plt.show()

# 计算并打印性能参数
def calculate_metrics(cm, y_true, y_pred_prob):
    a = cm[0, 0]
    b = cm[0, 1]
    c = cm[1, 0]
    d = cm[1, 1]
    acc = (a + d) / (a + b + c + d)
    error_rate = 1 - acc
    sen = d / (d + c)
    sep = a / (a + b)
    precision = d / (b + d)
    F1 = (2 * precision * sen) / (precision + sen)
    MCC = (d * a - b * c) / (np.sqrt((d + b) * (d + c) * (a + b) * (a + c)))
    auc_score = roc_auc_score(y_true, y_pred_prob)
    
    metrics = {
        "Accuracy": acc,
        "Error Rate": error_rate,
        "Sensitivity": sen,
        "Specificity": sep,
        "Precision": precision,
        "F1 Score": F1,
        "MCC": MCC,
        "AUC": auc_score
    }
    return metrics

metrics_test = calculate_metrics(cm_test, y_test, y_testprba)
metrics_train = calculate_metrics(cm_train, y_train, y_trainprba)

print("Performance Metrics (Test):")
for key, value in metrics_test.items():
    print(f"{key}: {value:.4f}")

print("\nPerformance Metrics (Train):")
for key, value in metrics_train.items():
print(f"{key}: {value:.4f}")

结果输出：

记住这些个数字。

这个参数的SVM还没有LR好。

（2）进行校准的SVM模型（默认参数）

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.calibration import CalibratedClassifierCV

# 加载数据
dataset = pd.read_csv('8PSMjianmo.csv')
X = dataset.iloc[:, 1:20].values
Y = dataset.iloc[:, 0].values

# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=666)

# 标准化数据
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# 使用SVM分类器并进行Platt Scaling
svm = SVC(kernel='linear', probability=True)
calibrated_svm = CalibratedClassifierCV(svm, method='sigmoid')
calibrated_svm.fit(X_train, y_train)

# 预测结果
y_pred = calibrated_svm.predict(X_test)
y_testprba = calibrated_svm.predict_proba(X_test)[:, 1]

y_trainpred = calibrated_svm.predict(X_train)
y_trainprba = calibrated_svm.predict_proba(X_train)[:, 1]

# 混淆矩阵
cm_test = confusion_matrix(y_test, y_pred)
cm_train = confusion_matrix(y_train, y_trainpred)
print(cm_train)
print(cm_test)

# 绘制测试集混淆矩阵
classes = list(set(y_test))
classes.sort()
plt.imshow(cm_test, cmap=plt.cm.Blues)
indices = range(len(cm_test))
plt.xticks(indices, classes)
plt.yticks(indices, classes)
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('Actual')
for first_index in range(len(cm_test)):
    for second_index in range(len(cm_test[first_index])):
        plt.text(first_index, second_index, cm_test[first_index][second_index])

plt.show()

# 绘制训练集混淆矩阵
classes = list(set(y_train))
classes.sort()
plt.imshow(cm_train, cmap=plt.cm.Blues)
indices = range(len(cm_train))
plt.xticks(indices, classes)
plt.yticks(indices, classes)
plt.colorbar()
plt.xlabel('Predicted')
plt.ylabel('Actual')
for first_index in range(len(cm_train)):
    for second_index in range(len(cm_train[first_index])):
        plt.text(first_index, second_index, cm_train[first_index][second_index])

plt.show()

# 计算并打印性能参数
def calculate_metrics(cm, y_true, y_pred_prob):
    a = cm[0, 0]
    b = cm[0, 1]
    c = cm[1, 0]
    d = cm[1, 1]
    acc = (a + d) / (a + b + c + d)
    error_rate = 1 - acc
    sen = d / (d + c)
    sep = a / (a + b)
    precision = d / (b + d)
    F1 = (2 * precision * sen) / (precision + sen)
    MCC = (d * a - b * c) / (np.sqrt((d + b) * (d + c) * (a + b) * (a + c)))
    auc_score = roc_auc_score(y_true, y_pred_prob)
    
    metrics = {
        "Accuracy": acc,
        "Error Rate": error_rate,
        "Sensitivity": sen,
        "Specificity": sep,
        "Precision": precision,
        "F1 Score": F1,
        "MCC": MCC,
        "AUC": auc_score
    }
    return metrics

metrics_test = calculate_metrics(cm_test, y_test, y_testprba)
metrics_train = calculate_metrics(cm_train, y_train, y_trainprba)

print("Performance Metrics (Test):")
for key, value in metrics_test.items():
    print(f"{key}: {value:.4f}")

print("\nPerformance Metrics (Train):")
for key, value in metrics_train.items():
print(f"{key}: {value:.4f}")

看看结果：

总体来看，起作用了。训练集和验证集的AUC均有点提升，但不多。

四、换个策略

参考那篇文章的策略：采用五折交叉验证来建立和评估模型，其中四折用于训练，一折用于评估，在训练集中，其中三折用于建立SVM模型，另一折采用Platt Scaling概率校正，在训练集内部采用交叉验证对超参数进行调参。

代码：

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import confusion_matrix, roc_auc_score, make_scorer

# 加载数据
dataset = pd.read_csv('8PSMjianmo.csv')
X = dataset.iloc[:, 1:20].values
Y = dataset.iloc[:, 0].values

# 标准化数据
sc = StandardScaler()
X = sc.fit_transform(X)

# 五折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=666)

# 超参数调优参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf']
}

# 计算并打印性能参数
def calculate_metrics(cm, y_true, y_pred_prob):
    a = cm[0, 0]
    b = cm[0, 1]
    c = cm[1, 0]
    d = cm[1, 1]
    acc = (a + d) / (a + b + c + d)
    error_rate = 1 - acc
    sen = d / (d + c)
    sep = a / (a + b)
    precision = d / (b + d)
    F1 = (2 * precision * sen) / (precision + sen)
    MCC = (d * a - b * c) / (np.sqrt((d + b) * (d + c) * (a + b) * (a + c)))
    auc_score = roc_auc_score(y_true, y_pred_prob)
    
    metrics = {
        "Accuracy": acc,
        "Error Rate": error_rate,
        "Sensitivity": sen,
        "Specificity": sep,
        "Precision": precision,
        "F1 Score": F1,
        "MCC": MCC,
        "AUC": auc_score
    }
    return metrics

# 初始化结果列表
results_train = []
results_test = []

# 初始化变量以跟踪最优模型
best_auc = 0
best_model = None
best_X_train = None
best_X_test = None
best_y_train = None
best_y_test = None

# 交叉验证过程
for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

    # 内部交叉验证进行超参数调优和模型训练
    inner_kf = KFold(n_splits=4, shuffle=True, random_state=666)
    grid_search = GridSearchCV(SVC(probability=True), param_grid, cv=inner_kf, scoring='roc_auc')
    grid_search.fit(X_train, y_train)
    model = grid_search.best_estimator_

    # Platt Scaling 概率校准
    calibrated_svm = CalibratedClassifierCV(model, method='sigmoid', cv='prefit')
    calibrated_svm.fit(X_train, y_train)

    # 评估模型
    y_trainpred = calibrated_svm.predict(X_train)
    y_trainprba = calibrated_svm.predict_proba(X_train)[:, 1]
    cm_train = confusion_matrix(y_train, y_trainpred)
    metrics_train = calculate_metrics(cm_train, y_train, y_trainprba)
    results_train.append(metrics_train)
    
    y_pred = calibrated_svm.predict(X_test)
    y_testprba = calibrated_svm.predict_proba(X_test)[:, 1]
    cm_test = confusion_matrix(y_test, y_pred)
    metrics_test = calculate_metrics(cm_test, y_test, y_testprba)
    results_test.append(metrics_test)
    
    # 更新最优模型
    if metrics_test['AUC'] > best_auc:
        best_auc = metrics_test['AUC']
        best_model = calibrated_svm
        best_X_train = X_train
        best_X_test = X_test
        best_y_train = y_train
        best_y_test = y_test
        best_params = grid_search.best_params_

    print("Performance Metrics (Train):")
    for key, value in metrics_train.items():
        print(f"{key}: {value:.4f}")
    
    print("\nPerformance Metrics (Test):")
    for key, value in metrics_test.items():
        print(f"{key}: {value:.4f}")
    print("\n" + "="*40 + "\n")

# 使用最优模型评估性能
y_trainpred = best_model.predict(best_X_train)
y_trainprba = best_model.predict_proba(best_X_train)[:, 1]
cm_train = confusion_matrix(best_y_train, y_trainpred)
metrics_train = calculate_metrics(cm_train, best_y_train, y_trainprba)

y_pred = best_model.predict(best_X_test)
y_testprba = best_model.predict_proba(best_X_test)[:, 1]
cm_test = confusion_matrix(best_y_test, y_pred)
metrics_test = calculate_metrics(cm_test, best_y_test, y_testprba)

print("Performance Metrics of the Best Model (Train):")
for key, value in metrics_train.items():
    print(f"{key}: {value:.4f}")

print("\nPerformance Metrics of the Best Model (Test):")
for key, value in metrics_test.items():
    print(f"{key}: {value:.4f}")

# 打印最优模型的参数
print("\nBest Model Parameters:")
for key, value in best_params.items():
    print(f"{key}: {value}")

提升挺大的，不过可能是因为进行了超参数的搜索吧。

最优的SVM参数是：

Best Model Parameters:

C: 0.1

kernel: rbf

大家有空可以去使用这个参数试一试，不进行校准的SVM的性能如何？

不过也无所谓啦，结果不错就行。

五、最后

各位可以去试一试在其他数据或者在其他机器学习分类模型中使用的效果。

数据不分享啦。