对一篇影像组学的的论文(《Development and validation of an MRI-based radiomics nomogram for distinguishing Warthin’s tumour from pleomorphic adenomas of the parotid gland》)中方法进行复现。完整地跑通影像组学全流程,对临床+影像组学特征进行建模并绘制Lasso回归图和ROC曲线、诺模图、校准曲线、决策曲线等。
需要完整代码到这里https://aistudio.baidu.com/aistudio/projectdetail/3270880 fork项目之后下载glioma.7z压缩包,然后用RStudio的RMarkdown进行运行
1.项目介绍
2012年荷兰学者提出影像组学(radiomics)概念:借助计算机,从医学影像中挖掘海量定量影像特征,使用统计学/机器学习方法,筛选出最有价值的影像学特征,用来解析临床信息。
可以说影像组学是人工智能在医学领域的一种特定研究方式。
做影像组学一般会经过以下步骤1.收集数据,2.标注数据,3.特征提取,4.特征工程,5.模型设计,6.结果分析和绘图。
1.1 项目目的
而对于大多数医学专业的朋友来说,毕竟学医的对写代码不一定擅长,或者第一次接触影像组学。因此想打算写一个影像组学相关的项目,对某篇组学论文方法进行复现。通用这个项目的代码希望可以稍微帮助到他们,可以让他们把时间更多花在疾病的分析上,而不是解决某个bug花大量的时间。
1.2 项目大概情况
这个项目主要是对《Development and validation of an MRI-based radiomics nomogram for distinguishing Warthin’s tumour from pleomorphic adenomas of the parotid gland》这篇影像组学论文中的方法进行复现,完整的跑一个影像组学流程。包括:
1.对临床特征进行建模
2.提取影像组学特征,通过LassoCv进行特征筛选再建模
3.结合影像组学特征+age临床特征进行建模
4.绘制三个模型的ROC曲线进行对比
5.对组合模型绘制诺模图+决策曲线+校准曲线
1.3运行环境
由于有些图例,例如诺模图(列线图)需要用到R语言来绘制,但是提取特征用到python语言的Pyradiomics库,所以这个项目是一个混合编程语言的项目。刚好RMarkdown可以同时运行Python和R语言,所以请有需要运行项目的请离线安装RMarkdown来运行。
虽然不能通过在线BML运行项目,但是可以展示复现论文方法的代码,主要可以容易复制粘贴代码~~~ 。
注:对于如何安装python和安装R语言和如何使用RMarkdown不是这篇项目的范围,自己通过百度自学学习。
注:这个项目也是我在学习影像组学中做下的笔记,如有错误请纠正和谅解
2.关于数据
虽然是论文复现,但是论文中涉及的腮腺MR数据是不公开的,无法和论文实验数据一致,不过这个项目主要是为了实现论文中大部分方法。因此采用了一个公开分割比赛的数据集作为这个项目的数据。
在数据上用到的是一个公开的胶质瘤数据集–【BraTS2019】。是一个两分类的数据集,Hgg是高级别胶质瘤,Lgg是低级别的胶质瘤,数量上分别是240和76。每个病例有四个模态的磁共振图像。Hgg患者是带有age的年龄数据,但是lgg的患者是没有的。为了不影响临床+影像组学的建模。在(25到55)之间随机生成一个随机数赋予个Lgg患者作为他的年龄。所以Lgg患者的年龄是虚拟数据。提取影像组学的特征只用到T1增强的模态。
因为原始的数据集Hgg和Lgg的比例差不多达到3比1,所以为了搭到数据均衡一点,把Hgg删掉一部分,剩下101例。Lgg为76例。
因此在实际运用到自己的数据的时候,收集病例尽量做到数据均衡一点。不然有点自己坑自己的感觉。
3.复现的论文
《Development and validation of an MRI-based radiomics nomogram for distinguishing Warthin’s tumour from pleomorphic adenomas of the parotid gland》
3.1论文地址
3.2论文大概内容
这篇论文主要对腮腺肿瘤(Warthin和多形性腺瘤)的磁共振图像使用计算机提取大量高通量的特征,通过对特征进行筛选和建立模型来鉴别MR图像是Warthin还是多形性腺瘤。
两种腺瘤的治疗方法和预后不同,建立一个良好的分类模型对临床来说至关重要。论文通过建立三种不同特征的模型,分别是1.单纯临床特征模型,2.单纯影像组学特征模型,3.影像组学特征+年龄模型。经过实验得到模型3的auc最好。并根据模型3构建影像组学列线图(如下图),使预测模型的结果更具有可读性,让模型更加方便运用实际临床决策之中。
3.2论文中的方法流程:
4.开始影像组学
4.1目录结构和数据准备
重要的一点就是临床特征的index列需要和每一例病例文件夹要一一对应
4.2 特征提取
特征提取是采用pyradiomics库进行提取,可以通过配置文件来设置提取那些特征,例如设置提取特征前是否重采样,设置滤波核的大小等等。如下图
可以到pyradiomics官网查看
"""
经过一两个小时提取后会生成HGG.csv和LGG.csv文件,
生成的csv文件每一行都有接近一千个特征,数量会根据不同yaml文件设置不同而不同
"""
#导入相关的库
import sys
import pandas as pd
import os
import random
import shutil
import numpy as np
import radiomics
from radiomics import featureextractor
import SimpleITK as sitk
kinds = ['HGG','LGG']
#这个是特征处理配置文件,具体可以参考pyradiomics官网
para_path = 'yaml/MR_1mm.yaml'
extractor = featureextractor.RadiomicsFeatureExtractor(para_path)
dir = 'data/MyData/'
for kind in kinds:
print("{}:开始提取特征".format(kind))
features_dict = dict()
df = pd.DataFrame()
path = dir + kind
# 使用配置文件初始化特征抽取器
for index, folder in enumerate( os.listdir(path)):
for f in os.listdir(os.path.join(path, folder)):
if 't1ce' in f:
ori_path = os.path.join(path,folder, f)
break
lab_path = ori_path.replace('t1ce','seg')
features = extractor.execute(ori_path,lab_path) #抽取特征
#新增一列用来保存病例文件夹名字
features_dict['index'] = folder
for key, value in features.items(): #输出特征
features_dict[key] = value
df = df.append(pd.DataFrame.from_dict(features_dict.values()).T,ignore_index=True)
print(index)
df.columns = features_dict.keys()
df.to_csv('csv/' +'{}.csv'.format(kind),index=0)
print('Done')
print("完成")
"""
再对HGG.csv和LGG.csv文件进行处理,去掉字符串特征,插入label标签。
HGG标签为1,LGG标签为0
"""
import matplotlib.pyplot as plt
import seaborn as sns
hgg_data = pd.read_csv('csv/HGG.csv')
lgg_data = pd.read_csv('csv/LGG.csv')
hgg_data.insert(1,'label', 1) #插入标签
lgg_data.insert(1,'label', 0) #插入标签
#因为有些特征是字符串,直接删掉
cols=[x for i,x in enumerate(hgg_data.columns) if type(hgg_data.iat[1,i]) == str]
cols.remove('index')
hgg_data=hgg_data.drop(cols,axis=1)
cols=[x for i,x in enumerate(lgg_data.columns) if type(lgg_data.iat[1,i]) == str]
cols.remove('index')
lgg_data=lgg_data.drop(cols,axis=1)
#再合并成一个新的csv文件。
total_data = pd.concat([hgg_data, lgg_data])
total_data.to_csv('csv/TotalOMICS.csv',index=False)
#简单查看数据的分布
fig, ax = plt.subplots()
sns.set()
ax = sns.countplot(x='label',hue='label',data=total_data)
plt.show()
print(total_data['label'].value_counts())
4.3划分数据
对影像组学数据集进行划分训练集合测试集。比例为8:2。同样把临床特征也进行同样的划分(顺序和影响组学特征是一致的)。
#导入常用R包
library(glmnet)
library(rms)
library(foreign)
library(ggplot2)
library(pROC)
#设置种子为了保证每次结果都一样
set.seed(888)
data <- read.csv("csv/TotalOMICS.csv")
nn=0.8
print(paste('总样本数:',length(data[,1])))
sub<-sample(1:nrow(data),round(nrow(data)*nn))
trainOmics<-data[sub,]#取0.8的数据做训练集
testOmics<-data[-sub,]#取0.2的数据做测试集
print(paste('训练集样本数:',length(trainOmics[,1])))
print(paste('测试集样本数:',length(testOmics[,1])))
write.csv(trainOmics,"csv/trainOmics.csv",row.names = FALSE )
write.csv(testOmics,"csv/testOmics.csv",row.names = FALSE )
"""
R语言输出结果
[1] "总样本数: 177"
[1] "训练集样本数: 142"
[1] "测试集样本数: 35"
"""
"""
根据上面对影像组学TotalOMICS.csv的数据划分,对TotalClinic.csv同样的顺序划分
"""
import pandas as pd
def compare(data1, data2,filename):
# 读取两个表
dt1 = pd.read_csv(data1,encoding='utf-8')
dt2 = pd.read_csv(data2,encoding='gb18030')
dt2.head()
df = pd.DataFrame()
dt1_name = dt1['index'].values.tolist()
dt2_name = dt2['index'].values.tolist()
for i in dt1_name:
if i in dt2_name:
dt2_row = dt2.loc[dt2['index'] == i]
df = df.append(dt2_row)
df.to_csv('./csv/'+filename+'.csv',header=True,index=False,encoding="utf_8_sig")
data_train= "./csv/trainOmics.csv"
data_test = "./csv/testOmics.csv"
data_clinic= "./csv/TotalClinic.csv"
compare(data_train,data_clinic,"trainClinic")
compare(data_test,data_clinic,"testClinic")
4.4对单纯临床特征进行建模
因为这个数据集只有age这个临床特征,所以没有做进一步的特征分析和筛选,直接用age进行建模
4.4.1处理临床数据
#读取临床数据trainClinic.csv
trainClinic<-read.csv("./csv/trainClinic.csv",fileEncoding = "UTF-8-BOM")
trainClinic<- data.frame(trainClinic)
#把age变量转换成数值型。
trainClinic$Age <- as.numeric(trainClinic$Age)
#打印临床特征名
names(trainClinic)
"""
R语言输出结果
[1] "index" "Age" "Label"
"""
4.4.2 使用逻辑回归对临床特征进行建模
#建模,使用逻辑回归
model.Clinic<-glm(Label~Age,data = trainClinic,family=binomial(link='logit'))
summary(model.Clinic)
"""
R语言输出结果
Call:
glm(formula = Label ~ Age, family = binomial(link = "logit"),
data = trainClinic)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7106 -0.6673 0.1870 0.6294 2.6505
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.6037 1.1252 -5.869 4.39e-09 ***
Age 0.1419 0.0233 6.090 1.13e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 194.57 on 141 degrees of freedom
Residual deviance: 121.48 on 140 degrees of freedom
AIC: 125.48
Number of Fisher Scoring iterations: 5
"""
4.4.2查看模型在训练集的验证情况
probClinicTrain<-predict.glm(object =model.Clinic,newdata=trainClinic,type = "response")
predClinicTrain<-ifelse(probClinicTrain>=0.5,1,0)
#计算模型精度
error=predClinicTrain-trainClinic$Label
#accuracy:判断正确的数量占总数的比例
accuracy=(nrow(trainClinic)-sum(abs(error)))/nrow(trainClinic)
#precision:真实值预测值全为1 / 预测值全为1 --- 提取出的正确信息条数/提取出的信息条数
precision=sum(trainClinic$Label & predClinicTrain)/sum(predClinicTrain)
#recall:真实值预测值全为1 / 真实值全为1 --- 提取出的正确信息条数 /样本中的信息条数
recall=sum(predClinicTrain & trainClinic$Label)/sum(trainClinic$Label)
#P和R指标有时候会出现的矛盾的情况,这样就需要综合考虑他们,最常见的方法就是F-Measure(又称为F-Score)
#F-Measure是Precision和Recall加权调和平均,是一个综合评价指标
F_measure=2*precision*recall/(precision+recall)
#输出以上各结果
#精确率和召回率和F_measure是预测Hgg的值
#可以模型预测HGG的能力比较强,但是预测Lgg的能力比较弱
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(trainClinic$Label,predClinicTrain)
"""
R语言输出结果
[1] "准确率accuracy: 0.78169014084507"
[1] "精确率precision: 0.795180722891566"
[1] "召回率recall: 0.825"
[1] "F_measure: 0.809815950920245"
predClinicTrain
0 1
0 45 17
1 14 66
"""
4.4.3 查看模型在测试集的验证情况
testClinic<-read.csv("./csv/testClinic.csv",fileEncoding = "UTF-8-BOM")
testClinic$Age <- as.numeric(testClinic$Age)
probClinicTest<-predict.glm(object =model.Clinic,newdata=testClinic,type = "response")
predClinicTest<-ifelse(probClinicTest>=0.5,1,0)
error=predClinicTest-testClinic$Label
accuracy=(nrow(testClinic)-sum(abs(error)))/nrow(testClinic)
precision=sum(testClinic$Label & predClinicTest)/sum(predClinicTest)
recall=sum(predClinicTest & testClinic$Label)/sum(testClinic$Label)
F_measure=2*precision*recall/(precision+recall)
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(testClinic$Label,predClinicTest)
"""
R语言输出结果
[1] "准确率accuracy: 0.828571428571429"
[1] "精确率precision: 0.857142857142857"
[1] "召回率recall: 0.857142857142857"
[1] "F_measure: 0.857142857142857"
predClinicTest
0 1
0 11 3
1 3 18
"""
4.5 对单纯影像组学进行建模
先对影像组学特征进行T检验筛选一部分特征,然后用lasso回归进一步筛选特征,最后剩下5个特征用来建立逻辑回归方程
4.5.1先做T检验筛选一部分特征
"""
用python做T检验,筛选后剩下的特征数:562个
"""
from scipy.stats import levene, ttest_ind
tData = pd.read_csv('./csv/trainOmics.csv')
classinformation = tData["label"].unique()
for temp_classinformation in classinformation:
temp_data = tData[tData['label'].isin([temp_classinformation])]
exec("df%s=temp_data"%temp_classinformation)
counts = 0
columns_index =[]
for column_name in tData.columns[2:]:
if levene(df1[column_name], df0[column_name])[1] > 0.05:
if ttest_ind(df1[column_name],df0[column_name],equal_var=True)[1] < 0.05:
columns_index.append(column_name)
else:
if ttest_ind(df1[column_name],df0[column_name],equal_var=False)[1] < 0.05:
columns_index.append(column_name)
print("筛选后剩下的特征数:{}个".format(len(columns_index)))
#print(columns_index)
#数据只保留从T检验筛选出的特征数据,重新组合成data
if not 'label' in columns_index:
columns_index = ['label'] + columns_index
if not 'index' in columns_index:
columns_index = ['index'] + columns_index
df1 = df1[columns_index]
df0 = df0[columns_index]
tData = pd.concat([df1, df0])
tData.to_csv('./csv/tData_train.csv',header=True,index=False,encoding="utf_8_sig")
4.5.2再用lasso回归进一步筛选特征
tData_train <- read.csv("csv/tData_train.csv",fileEncoding = "UTF-8-BOM")
dim(tData_train)
Y <-as.data.frame(tData_train$label)
#[,-1]是为了去掉截距
Y <- model.matrix(~.,data=Y)[,-1]
#除去因变量,提取自变量
yavars<-names(tData_train) %in% c("label","index")
X <- as.data.frame(tData_train[!yavars])
X <- model.matrix(~.,data=X)[,-1]
#Lasso回归
fit <- glmnet(X,Y, alpha=1, family = "binomial")
plot(fit, xvar="lambda", label=TRUE)
这个图显示随着lambda变化,特征数量的变化
#采用k折交叉验证进行lasso回归
cv.fit <- cv.glmnet(X,Y, alpha=1,nfolds = 10,family="binomial")
plot(cv.fit)
abline(v=log(c(cv.fit$lambda.min, cv.fit$lambda.lse)), lty=2)
#把lambda取1倍标准误时的值绘制到图中
plot(cv.fit$glmnet.fit,xvar="lambda")
abline(v=log(cv.fit$lambda.1se), lty=2,)
这个图显示随着lambda增大,MSE的变化,右边的垂直虚线是1倍标准误时lambda的取值。
4.5.3经过lasso回归筛选抽出5个特征
分别是
[2] “wavelet.LHL_glcm_Imc2”
[3] “wavelet.LHL_glszm_ZoneEntropy”
[4] “wavelet.HLL_glszm_ZoneEntropy”
[5] “wavelet.LLL_firstorder_90Percentile”
[6] “wavelet.LLL_firstorder_RootMeanSquared”
#如果取1倍标准误时,获取筛选后的特征
lambda = cv.fit$lambda.1se
Coefficients <- coef(fit, s = lambda)
Active.Index <- which(Coefficients != 0)
Active.Coefficients <- Coefficients[Active.Index]
Active.Index
Active.Coefficients
row.names(Coefficients)[Active.Index]
"""
R语言输出结果
[1] 1 273 300 396 507 518
[1] -1.135967e+01 5.551516e+00 5.053583e-01 4.978627e-01 4.710207e-04
[6] 1.236226e-03
[1] "(Intercept)"
[2] "wavelet.LHL_glcm_Imc2"
[3] "wavelet.LHL_glszm_ZoneEntropy"
[4] "wavelet.HLL_glszm_ZoneEntropy"
[5] "wavelet.LLL_firstorder_90Percentile"
[6] "wavelet.LLL_firstorder_RootMeanSquared"
"""
4.6 对5个影像组学特征建立逻辑回归
#建立公式
formulalse <-as.formula(label ~wavelet.LHL_glcm_Imc2+wavelet.LHL_glszm_ZoneEntropy+wavelet.HLL_glszm_ZoneEntropy+wavelet.LLL_firstorder_90Percentile+wavelet.LLL_firstorder_RootMeanSquared)
#逻辑回归
model.Omics <- glm(formula=formulalse,data=tData_train,family=binomial(link="logit"))
#查看结果
summary(model.Omics)
"""
R语言输出结果
Call:
glm(formula = formulalse, family = binomial(link = "logit"),
data = tData_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8353 -0.3845 0.1660 0.3471 2.3193
Coefficients:
Estimate Std. Error z value
(Intercept) -31.638928 7.898076 -4.006
wavelet.LHL_glcm_Imc2 14.026526 5.980053 2.346
wavelet.LHL_glszm_ZoneEntropy 0.428890 1.381598 0.310
wavelet.HLL_glszm_ZoneEntropy 2.143145 1.333109 1.608
wavelet.LLL_firstorder_90Percentile 0.002704 0.003126 0.865
wavelet.LLL_firstorder_RootMeanSquared 0.004004 0.004434 0.903
Pr(>|z|)
(Intercept) 6.18e-05 ***
wavelet.LHL_glcm_Imc2 0.019 *
wavelet.LHL_glszm_ZoneEntropy 0.756
wavelet.HLL_glszm_ZoneEntropy 0.108
wavelet.LLL_firstorder_90Percentile 0.387
wavelet.LLL_firstorder_RootMeanSquared 0.367
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 194.566 on 141 degrees of freedom
Residual deviance: 84.983 on 136 degrees of freedom
AIC: 96.983
Number of Fisher Scoring iterations: 6
"""
4.6.1 查看在训练集的情况
probOmicsTrain<-predict.glm(object =model.Omics,newdata=tData_train,type = "response")
predOmicsTrain<-ifelse(probOmicsTrain>=0.5,1,0)
error=predOmicsTrain-tData_train$label
accuracy=(nrow(tData_train)-sum(abs(error)))/nrow(tData_train)
precision=sum(tData_train$label & predOmicsTrain)/sum(predOmicsTrain)
recall=sum(predOmicsTrain & tData_train$label)/sum(tData_train$label)
F_measure=2*precision*recall/(precision+recall)
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(tData_train$label,predOmicsTrain)
"""
R语言输出结果
[1] "准确率accuracy: 0.908450704225352"
[1] "精确率precision: 0.935064935064935"
[1] "召回率recall: 0.9"
[1] "F_measure: 0.917197452229299"
predOmicsTrain
0 1
0 57 5
1 8 72
"""
4.6.2 查看模型在测试集的结果
tData_test <- read.csv("csv/testOmics.csv",fileEncoding = "UTF-8-BOM")
probOmicsTest<-predict.glm(object =model.Omics,newdata=tData_test,type = "response")
predOmicsTest<-ifelse(probOmicsTest>=0.5,1,0)
error=predOmicsTest-tData_test$label
accuracy=(nrow(tData_test)-sum(abs(error)))/nrow(tData_test)
precision=sum(tData_test$label & predOmicsTest)/sum(predOmicsTest)
recall=sum(predOmicsTest & tData_test$label)/sum(tData_test$label)
F_measure=2*precision*recall/(precision+recall)
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(tData_test$label,predOmicsTest)
"""
R语言输出结果
[1] "准确率accuracy: 0.857142857142857"
[1] "精确率precision: 0.863636363636364"
[1] "召回率recall: 0.904761904761905"
[1] "F_measure: 0.883720930232558"
predOmicsTest
0 1
0 11 3
1 2 19
"""
4.7结合临床特征和影像组学特征进行建模
先处理数据,把临床特征的age赋值到影像组学中新增age一列。然后根据lasso模型计算每个病例的得分。把这个得分和age作为新的特征建立逻辑回归模型。
4.7.1处理数据
#读取经过T检验筛选后的影像组学训练集文件
tData_train2 <- read.csv("csv/tData_train.csv",fileEncoding = "UTF-8-BOM")
#把临床特征的Age放到影像组学中作为新的一列
#放的时候是根据病例名一一对应的
tData_train2$Age <- trainClinic$Age
for (i in trainClinic$index){
if(tData_train2[tData_train2$index == i,]$index == i){
Age <- trainClinic[trainClinic$index == i,]$Age
tData_train2[tData_train2$index == i,]$Age <- Age
}
}
#对测试集也这样处理
tData_test2 <- read.csv("csv/testOmics.csv",fileEncoding = "UTF-8-BOM")
testClinic2<-read.csv("./csv/testClinic.csv",fileEncoding = "UTF-8-BOM")
testClinic2<- data.frame(testClinic2)
testClinic2$Age <- as.numeric(testClinic2$Age)
columns <- colnames(tData_train2)[1:564]#经过T检验筛选后的562个特征+index+label
tData_test2 <- as.data.frame(tData_test2[,columns])
tData_test2$Age <- testClinic2$Age
for (i in testClinic2$index){
if(tData_test2[tData_test2$index == i,]$index == i){
Age <- testClinic2[testClinic2$index == i,]$Age
tData_test2[tData_test2$index == i,]$Age <- Age
}
}
4.7.2根据lasso模型计算得分
#分别计算影像组学得分(RS)
y_vad <-as.data.frame(tData_test2$label)
y_vad <- model.matrix(~.,data=y_vad)[,-1]
#除去因变量,提取自变量
yavars<-names(tData_test2) %in% c("label")
x_vad <- as.data.frame(tData_test2[!yavars])
columns <- colnames(tData_train2)[3:564]#T检验筛选后的564个特征不要index+label
x_vad <- as.data.frame(tData_test2[,columns])
x_vad <- model.matrix(~.,data=x_vad)[,-1]
x_dev <- X
y_dev <- Y
#fit是lassoCV模型
tData_train2$RS <-predict(fit,type="link",
newx=x_dev,newy=y_dev,s=cv.fit$lambda.1se)#训练集的RS
tData_test2$RS<-predict(fit,type="link",
newx=x_vad,newy=y_vad,cv.fit$lambda.1se)#测试集的RS
4.7.3建立逻辑回归模型
#因为后续要绘制诺模图,所以用rms这包建立逻辑回归
#通过下列例子,发现两种的逻辑回归结果是一样的
tData_train2$RS <- as.numeric(tData_train2$RS)
model.and1 = glm(label ~ RS+Age,data=tData_train2,binomial(link='logit'))
print(model.and1$coef)
print("#####")
model.and2 <- lrm(label ~ RS+Age,data=tData_train2,x=TRUE,y=TRUE)
print(model.and2$coef)
"""
R语言输出结果
(Intercept) RS Age
-5.8730358 2.4543423 0.1192463
[1] "#####"
Intercept RS Age
-5.8730331 2.4543415 0.1192462
"""
4.7.4查看模型在临床结合影像组学的训练集的情况
probOmicsTrainAnd<-predict.glm(object =model.and1,newdata=tData_train2,type = "response")
predOmicsTrainAnd<-ifelse(probOmicsTrainAnd>=0.5,1,0)
error=predOmicsTrainAnd-tData_train2$label
accuracy=(nrow(tData_train2)-sum(abs(error)))/nrow(tData_train2)
precision=sum(tData_train2$label & predOmicsTrainAnd)/sum(predOmicsTrainAnd)
recall=sum(predOmicsTrainAnd & tData_train2$label)/sum(tData_train2$label)
F_measure=2*precision*recall/(precision+recall)
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(tData_train2$label,predOmicsTrainAnd)
"""
R语言输出结果
[1] "准确率accuracy: 0.929577464788732"
[1] "精确率precision: 0.948717948717949"
[1] "召回率recall: 0.925"
[1] "F_measure: 0.936708860759494"
predOmicsTrainAnd
0 1
0 58 4
1 6 74
"""
4.7.5 查看模型在临床结合影像组学的测试集的情况
tData_test2$RS <- as.numeric(tData_test2$RS)
probOmicsTestAnd<-predict.glm(object =model.and1,newdata=tData_test2,type = "response")
predOmicsTestAnd<-ifelse(probOmicsTestAnd>=0.5,1,0)
error=predOmicsTestAnd-tData_test2$label
accuracy=(nrow(tData_test2)-sum(abs(error)))/nrow(tData_test2)
precision=sum(tData_test2$label & predOmicsTestAnd)/sum(predOmicsTestAnd)
recall=sum(predOmicsTestAnd & tData_test2$label)/sum(tData_test2$label)
F_measure=2*precision*recall/(precision+recall)
print(paste('准确率accuracy:',accuracy))
print(paste('精确率precision:',precision))
print(paste('召回率recall:',recall))
print(paste('F_measure:',F_measure))
table(tData_test2$label,predOmicsTestAnd)
"""
R语言输出结果
[1] "准确率accuracy: 0.857142857142857"
[1] "精确率precision: 0.9"
[1] "召回率recall: 0.857142857142857"
[1] "F_measure: 0.878048780487805"
predOmicsTestAnd
0 1
0 12 2
1 3 18
"""
4.8绘制ROC
把三个模型都绘制到同一个ROC曲线里面,进行比较
4.8.1绘制训练集的ROC曲线
#计算roc和auc
rocClinic <- roc(trainClinic$Label, probClinicTrain)
rocOmics <- roc(tData_train$label, probOmicsTrain)
rocClinicAndOmics <-roc(tData_train2$label, probOmicsTrainAnd)
# 绘图
plot(rocClinic,
print.auc=TRUE, # 图像上输出AUC的值
print.auc.x=0.4, print.auc.y=0.6, # 设置AUC值坐标为(x,y)
auc.polygon=TRUE, # 将ROC曲线下面积转化为多边形
auc.polygon.col="#fff7f7", # 设置ROC曲线下填充色
grid=FALSE,
print.thres=FALSE, # 图像上输出最佳截断值
main=" Train ROC curves", # 添加图形标题
col="red", # 设置ROC曲线颜色
legacy.axes=TRUE,# 使x轴从0到1,表示为1-特异度
xlim=c(1,0),
mgp=c(1.5, 1, 0),
lty=3)
# 再添加1条ROC曲线
plot.roc(rocOmics,
add=TRUE, # 增加曲线,
col="green", # 设置ROC曲线颜色
print.thres=FALSE, # 图像上输出最佳截断值
print.auc=TRUE, # 图像上输出AUC
print.auc.x=0.4,print.auc.y=0.5,# AUC的坐标为(x,y)
lty=2)
# 再添加1条ROC曲线
plot.roc(rocClinicAndOmics,
add=TRUE, # 增加曲线,
col="blue", # 设置ROC曲线颜色
print.thres=FALSE, # 图像上输出最佳截断值
print.auc=TRUE, # 图像上输出AUC
print.auc.x=0.4,print.auc.y=0.4,# AUC的坐标为(x,y)
lty=1)
# 添加图例
legend(0.45, 0.30, # 图例位置x,y
bty = "n", # 图例样式
legend=c("Clinical model","Radiomics signature","Clinical and Radiomics"), # 添加分组
col=c("red","green","blue"), # 颜色跟前面一致
lwd=1,
lty=c(3,2,1)) # 线条粗细
4.8.1绘制测试集的ROC曲线
rocClinicTest <- roc(testClinic$Label, probClinicTest)
rocOmicsTest <- roc(tData_test$label, probOmicsTest)
rocClinicAndOmicsTest <-roc(tData_test2$label, probOmicsTestAnd)
plot(rocClinicTest,
print.auc=TRUE, # 图像上输出AUC的值
print.auc.x=0.4, print.auc.y=0.6, # 设置AUC值坐标为(x,y)
auc.polygon=TRUE, # 将ROC曲线下面积转化为多边形
auc.polygon.col="#fff7f7", # 设置ROC曲线下填充色
grid=FALSE,
print.thres=FALSE, # 图像上输出最佳截断值
main=" Test ROC curves", # 添加图形标题
col="red", # 设置ROC曲线颜色
legacy.axes=TRUE,# 使x轴从0到1,表示为1-特异度
xlim=c(1,0),
mgp=c(1.5, 1, 0),
lty=3)
# 再添加1条ROC曲线
plot.roc(rocOmicsTest,
add=TRUE, # 增加曲线,
col="green", # 设置ROC曲线颜色
print.thres=FALSE, # 图像上输出最佳截断值
print.auc=TRUE, # 图像上输出AUC
print.auc.x=0.4,print.auc.y=0.5,# AUC的坐标为(x,y)
lty=2)
# 再添加1条ROC曲线
plot.roc(rocClinicAndOmicsTest,
add=TRUE, # 增加曲线,
col="blue", # 设置ROC曲线颜色
print.thres=FALSE, # 图像上输出最佳截断值
print.auc=TRUE, # 图像上输出AUC
print.auc.x=0.4,print.auc.y=0.4,# AUC的坐标为(x,y)
lty=1)
# 添加图例
legend(0.45, 0.30, # 图例位置x,y
bty = "n", # 图例样式
legend=c("Clinical model","Radiomics signature","Clinical and Radiomics"), # 添加分组
col=c("red","green","blue"), # 颜色跟前面一致
lwd=1,
lty=c(3,2,1)) # 线条粗细
4.8绘制诺模图
临床结合影像组学建立的逻辑回归模型整体来说比较良好。
把这个模型绘制成诺模图,使模型可视化。
library(rms)
formula <- as.formula(label ~ Age+RS)
#数据打包
dd = datadist(tData_train2)
options(datadist="dd")
fitnomogram <- lrm(formula,data=tData_train2, x=TRUE, y=TRUE)
nom1 <- nomogram(fitnomogram,#模型对象
fun=function(x)1/(1+exp(-x)),#保持不变,logistic固定
lp=F,#是否显示线性概率
fun.at=c(0.1,0.2,0.5,0.7,0.9),#预测结果坐标轴
funlabel="Risk")#坐标轴名称
#可以使用Cairo导出pdf
#library(Cairo)
#CairoPDF("nomogram.pdf",width=6,height=6)
plot(nom1)
诺模图怎样看?现在知道某个患者的age师多少岁,找到对应顶部Points的分数,同样知道患者的RS得分,也找到对应Points的得分,把两个等分加起来。在Total Points找到底部Risk得分,这个得分就是判断患者是HGG的概率。
4.9绘制校准曲线
cal1 <- calibrate(fitnomogram, cmethod="hare", method="boot", B=1000)
plot(cal1, xlim=c(0,1.0), ylim=c(0,1.0),
xlab = "Predicted Probability",
ylab = "Observed Probability",
legend = FALSE,subtitles = FALSE)
#abline对角线
abline(0,1,col = "black",lty = 2,lwd = 2)
#再画一条模型预测的实际曲线
lines(cal1[,c("predy","calibrated.orig")],
type = "l",lwd = 2,col="red",pch =16)
#再画一条模型Bias-corrected是校准曲线
lines(cal1[,c("predy","calibrated.corrected")],
type = "l",lwd = 2,col="green",pch =16)
legend(0.55,0.35,
c("Ideal","Apparent","Bias-corrected"),
lty = c(2,1,1),
lwd = c(2,1,1),
col = c("black","red","green"),
bty = "n") # "o"为加边框
彩色的曲线越接近点斜线,模型的性能就比较好
4.10 绘制决策曲线
把三个模型都绘制在同一个图中,进行比较
library(rmda)
formulClinic<-as.formula(Label~Age)
formulaOmics <- as.formula(label ~wavelet.LHL_glcm_Imc2+wavelet.LHL_glszm_ZoneEntropy+wavelet.HLL_glszm_ZoneEntropy+wavelet.LLL_firstorder_90Percentile+wavelet.LLL_firstorder_RootMeanSquared)
formulaAnd <- as.formula(label~Age+RS)
model_Clinic <- decision_curve(formulClinic, data=trainClinic,
family=binomial(link='logit'),
thresholds=seq(0,1,by=0.01),
confidence.intervals=0.95,
study.design = 'case-control',
population.prevalence=0.3)
model_Omics <- decision_curve(formulaOmics, data=tData_train,
family=binomial(link='logit'),
thresholds=seq(0,1,by=0.01),
confidence.intervals=0.95,
study.design = 'case-control',
population.prevalence=0.3)
model_And <- decision_curve(formulaAnd, data=tData_train2,
family=binomial(link='logit'),
thresholds=seq(0,1,by=0.01),
confidence.intervals=0.95,
study.design = 'case-control',
population.prevalence=0.3)
model_all <- list(model_Clinic,model_Omics,model_And)
plot_decision_curve(model_all,curve.names=c('Clinical model','Radiomics signature','Clinical and Radiomics'),
xlim=c(0,0.8),
cost.benefit.axis=F,col=c('green','red','blue'),
confidence.intervals = F,
standardize = F,
ain2,
family=binomial(link='logit'),
thresholds=seq(0,1,by=0.01),
confidence.intervals=0.95,
study.design = 'case-control',
population.prevalence=0.3)
model_all <- list(model_Clinic,model_Omics,model_And)
plot_decision_curve(model_all,curve.names=c('Clinical model','Radiomics signature','Clinical and Radiomics'),
xlim=c(0,0.8),
cost.benefit.axis=F,col=c('green','red','blue'),
confidence.intervals = F,
standardize = F,
legend.position="bottomleft")
一般曲线越靠左上角的,模型的性能就比较好,可以看到蓝色曲线(临床结合影像组学)的模型是在最外面的,可见这个模型性能比较好。
来源:
https://blog.csdn.net/weixin_42990106/article/details/122102397?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522167634472316800215041801%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=167634472316800215041801&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2blogsobaiduend~default-2-122102397-null-null.blog_rank_default&utm_term=python%E5%BD%B1%E5%83%8F%E7%BB%84%E5%AD%A6%E5%85%A8%E6%B5%81%E7%A8%8B&spm=1018.2226.3001.4450