随机森林算法进行预测（＋调参+变量重要性）--血友病计数数据

news2025/7/5 17:33:59

1.读取数据

所使用的数据是血友病数据，如有需要，可在主页资源处获取，数据信息如下：

import pandas as pd
import numpy as np
hemophilia = pd.read_csv('D:/my_files/data.csv')  #读取数据

2.数据预处理

在使用机器学习方法时，都需要注意变量的数据类型，比如此处的hiv,其实是哑变量，但是使用1,2表示，计算机就会将其当成数值变量分析，所以对这类变量需要先进行处理，此处直接转换，也可采用one-hot编码进行处理。

print("hemophilia每个特征缺失的数目为：\n",hemophilia.isnull().sum())  #查看缺失值
print(hemophilia.dtypes)   # 查看数据类型
'''
hiv        object
factor     object
year        int64
age         int64
py        float64
deaths      int64
dtype: object
'''

#数值变量化为分类变量
hemophilia['hiv']=hemophilia['hiv'].astype(object) 
hemophilia['factor']=hemophilia['factor'].astype(object)
new_hemophilia=pd.get_dummies(hemophilia,drop_first=True)   
    #drop_first=True--删去一列，如hiv，处理后为两列，都是01表示，但只保留一列就足够表示两种状态
new_data=new_hemophilia
new_data.describe(include='all')  #查看基本统计信息
'''
              year          age  ...     factor_4     factor_5
count  2144.000000  2144.000000  ...  2144.000000  2144.000000
mean     86.847481     7.236007  ...     0.217351     0.199627
std       4.899319     3.810280  ...     0.412539     0.399813
min      78.000000     1.000000  ...     0.000000     0.000000
25%      83.000000     4.000000  ...     0.000000     0.000000
50%      87.000000     7.000000  ...     0.000000     0.000000
75%      91.000000    10.000000  ...     0.000000     0.000000
max      95.000000    14.000000  ...     1.000000     1.000000
'''

from sklearn.model_selection import train_test_split
x = new_data.drop(['deaths'],axis=1)   #删去标签列
X_train, X_test, y_train, y_test = train_test_split(x, new_data.deaths, test_size=0.3, random_state=0)  
    #区分数据集，70%训练集，30%测试集

3.随机森林预测及调参


from sklearn import metrics   #加载包用于计算评价指标
from sklearn.ensemble import RandomForestRegressor   #随机森林回归
rf = RandomForestRegressor(random_state=0)  #random_state=0是随机种子数
rf.fit(X_train, y_train)  #不调参，直接训练模型
y_pred = rf.predict(X_test)
print('MSE of RF: %.3f' %metrics.mean_squared_error(y_test, y_pred))  # %.3f保留三位小数
'''
MSE of RF: 0.306
'''

在随机森林进行调参时，就是对一些重要参数给定各自的取值范围，再遍历所有参数取值的组合，得到最优参数。这里调整学习率'n_estimators'，树深'max_depth'，节点最小样本数'min_samples_split'，也可以选取其他的参数调整。一次调参每个参数取值不宜太多，否则会很慢，可以找到比较好的区间后减小取值间隔，重复操作；也可以一次调节两个（或一个）参数，找到最优值再对其他参数逐二（或逐一）进行调整，最后使用调整后的参数进行建模。

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators':np.arange(10,200,30),      
              'max_depth' : np.arange(2,7,1),
              'min_samples_split' :  np.arange(2,10,2)}
GS = GridSearchCV(rf,param_grid,scoring = 'neg_mean_squared_error',cv=5)
    #取值为scoring“neg_mean_squared_error”，代表使用MSE作为衡量回归树质量的指标，进行五折交叉验证
GS.fit(X_train, y_train)
GS.best_params_  #最佳参数组合
'''
{'max_depth': 4, 'min_samples_split': 8, 'n_estimators': 40}

'''
rf = RandomForestRegressor(max_depth = 4, n_estimators = 40,min_samples_split = 8, random_state=0)
    #使用调参后的参数建立模型
rf.fit(X_train, y_train)  
y_pred = rf.predict(X_test)
print('MSE of RF: %.3f' %metrics.mean_squared_error(y_test, y_pred))
#调参效果不明显，考虑改变调参取值或参数选取，也可能是数据不适合随机森林，不同数据有不同表现
'''
MSE of RF: 0.305
'''

4.随机森林输出变量重要性

随机森林能够在建模的同时得到变量重要性，可以通过以下方式将变量重要性表示出来。

import matplotlib.pyplot as plt
importances = list(rf.feature_importances_)  #随机森林特征重要性
feature_list = list(x.columns)   #获取变量名
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
#feature_importances.pop(0)
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
f_list = []
importances_list = []
for i in range(0,8):
    feature = feature_importances[i][0]
    importances_r = feature_importances[i][1]
    f_list.append(feature),importances_list.append(importances_r)
x_values = list(range(len(importances_list)))
plt.figure(figsize=(14, 9))
plt.bar(x_values, importances_list, orientation='vertical')
plt.xticks(x_values, f_list, rotation=25, size =18)
plt.yticks(size =18)
plt.ylabel('Importance',size = 20)
plt.xlabel('Variable',size = 20)
plt.title('RF Variable Importances',size = 22)
#plt.savefig('D:/files/rf变量重要性.png', dpi=1500)    
    #保存图片到指定位置 dpi--分辨率
plt.show()

得到变量重要性条形图如下：