【学习笔记及课后题练习】陈强-机器学习-Python-Ch12 随机森林（Random Forest）

系列文章目录

监督学习：参数方法

【学习笔记】陈强-机器学习-Python-Ch4 线性回归
【学习笔记】陈强-机器学习-Python-Ch5 逻辑回归
【课后题练习】陈强-机器学习-Python-Ch5 逻辑回归（SAheart.csv）
【学习笔记】陈强-机器学习-Python-Ch6 多项逻辑回归
【学习笔记及课后题练习】陈强-机器学习-Python-Ch7 判别分析
【学习笔记】陈强-机器学习-Python-Ch8 朴素贝叶斯
【学习笔记】陈强-机器学习-Python-Ch9 惩罚回归
【课后题练习】陈强-机器学习-Python-Ch9 惩罚回归（student-mat.csv）

监督学习：非参数方法

【学习笔记及课后题练习】陈强-机器学习-Python-Ch10 KNN法
【学习笔记】陈强-机器学习-Python-Ch11 决策树（Decision Tree）

集成学习

文章目录

系列文章目录
- 监督学习：参数方法
- 监督学习：非参数方法
- 集成学习
前言
一、集成学习：随机森林（Random Forest）
- 参数：mtry
- 变量重要性
- 偏依赖图
二、回归问题-随机森林案例
- 1. 载入数据+数据处理
- 2. 装袋估计（BaggingRegressor）
- - 1）进行袋装估计
  - - `笔记：BaggingRegressor ()`
  - 2）计算袋外样本性能
  - - `笔记：BaggingRegressor 的属性`
  - 3）对比：线性回归
  - 4）决策树的数量(n_estimators)对袋外误差的影响
- 3. 随机森林估计（mtry=p/3）
- - 1）对于回归树，一般mtry=p/3
  - - `笔记：RandomForestRegressor()`
  - 2）预测结果
  - 3）变量重要性
  - 4）偏依赖图
  - - `笔记：PartialDependenceDisplay()`
- 4. 选出最优mtry：通过测试集R^2^
- - 1）for循环：最优R^2^
  - 2）画图：测试集R^2^与mtry的关系
  - 3）测试集拟合优度R^2^（具体）
- 5. 对比不同模型：最优mtry时，决策树的数量(n_estimators) 与袋外误差
- - 1）随机森林
  - 2）Bagging
  - 3）单个决策树
  - 4）画图展示三种算法的测试误差对比
- 6. 选出最优mtry：10折交叉验证---最小MSE
- - 1）10折交叉验证---最小MSE
  - 2）画图：MSE与mtry的关系
三、分类问题-随机森林案例
- 1. 载入数据+数据处理
- 2. 单个决策树估计
- 3.对比：逻辑回归
- 4. 随机深林 - mstry = p½
- 5. 选出最优mtry：10折交叉验证
- 6. 变量重要性
- 7. 偏依赖图
- 8. 预测效果
- 9. 画决策边界：load_iris
四、习题
- 1. 回归问题的随机森林估计-concrete.csv
- 2.分类问题的随机森林估计-Mushroom.csv

前言

本学习笔记仅为以防自己忘记了，顺便分享给一起学习的网友们参考。如有不同意见/建议，可以友好讨论。

本学习笔记所有的代码和数据都可以从陈强老师的个人主页上下载

参考书目：陈强.机器学习及Python应用. 北京：高等教育出版社, 2021.

数学原理等详见陈强老师的 PPT

参考了：网友阡之尘埃的Python机器学习09——随机森林

另：各位同学能不能帮我看看习题（分类问题的随机森林：Mushroom.csv）,我是不是处理X没处理对啊？为什么不管是单个决策树还是随机森林，R^2都是1.0 啊啊啊？！！！我已经纠结了好几天了

一、集成学习：随机森林（Random Forest）

集成学习（Ensemble Learning）方法通过结合多个模型的预测结果来减少单个模型可能出现的过拟合和偏差，提高预测性能和稳定性。
主要的集成学习方法包括：Bagging（Bootstrap Aggregating），Boosting，Stacking（Stacked Generalization），Voting。

Bagging（Bootstrap Aggregating）：通过自助法（Bootstrap Sampling）从原始数据集中生成多个训练子集，每个子集用于训练一个基学习器（如决策树）。最终的预测结果是这些基学习器预测结果的平均（回归）或投票（分类）结果。随机森林就是一种典型的 Bagging 方法。

随机森林（Random Forest）结合了多个决策树的预测结果以提高模型的准确性和稳定性。
在随机森林中，每个决策树都是独立训练的。

参数：mtry

随机森林最重要的调节参数： $m$ （代表在每个节点分裂时，随机选择的特征数量），可控制过拟合、提高模型稳定性。
在文献中一般称为“mtry”，在sklearn中称为“max_features”.

对于回归树，一般建议随机选取 $m = p /3$ 个变量( $p$ 为特征变量的个数 )。
对于决策树，则一般建议 $m=\sqrt{p}$ .

调节参数 $m$ (mtry)的选择，涉及偏差与方差的权衡，可通过袋外误差或交叉验证来确定。

另，对于随机森林，自助样本的数目 $B$ 不太重要。

变量重要性

度量“变量重要性”是评估特征对模型预测贡献的重要步骤。
对于每个变量，在随机森林的每棵决策树，可度量由该变量所导致的分裂准则函数（残差平方和或基尼指数）之下降幅度。针对此下降幅度，对每棵决策树进行平均，即为对该变量重要性的度量。

偏依赖图

变量重要性只是将特征变量的重要性进行了度量.feature_importances_和排序.feature_importances_.argsort()。
而偏依赖图（Partial Dependence Plots, PDP）可以展示每个变量对 $y$ 的边际效应（marginal effects）。
偏依赖图PartialDependenceDisplay.from_estimator()适用于任何 $f (\cdot)$ 无解析表达式的黑箱(black box)方法。

二、回归问题-随机森林案例

使用波士顿房价数据boston （参考【学习笔记】陈强-机器学习-Python-Ch4 线性回归 &【学习笔记】陈强-机器学习-Python-Ch11 决策树（Decision Tree））

1. 载入数据+数据处理

import numpy as np
import pandas as pd

# 从原始来源加载数据
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep=r"\s+", skiprows=22, header=None)

# 处理数据
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]

# 创建DataFrame
columns = [
    "CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", 
    "PTRATIO", "B", "LSTAT"
]
df = pd.DataFrame(data, columns=columns)
df['MEDV'] = target

# 确定特征
X = df.drop(columns=['MEDV'])
y = df['MEDV']


# 将数据分割为训练集（70%）和测试集（30%）
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1)

2. 装袋估计（BaggingRegressor）

1）进行袋装估计

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

model = BaggingRegressor(
    estimator=DecisionTreeRegressor(random_state=123), 
    n_estimators=500, oob_score=True, random_state=0)
  
model.fit(X_train, y_train)
model.score(X_test, y_test)

结果输出： 0.9054552734857662

`笔记：BaggingRegressor ()`

BaggingRegressor（Bootstrap Aggregating Regressor）是一个用于回归任务的集成学习算法，通过构建多个回归模型的集合来提高预测性能和模型稳定性。它通过袋装方法（Bagging）对基学习器进行训练，并结合多个模型的预测结果来生成最终预测。
BaggingRegressor ()是 Scikit-learn 库中的一个类，用于实现 Bagging 方法的回归版本。它通过对数据进行自助抽样（bootstrap sampling），训练多个回归模型，并对这些模型的预测结果进行平均，减少模型的方差，提高预测准确性。

#基本语法和参数
from sklearn.ensemble import BaggingRegressor

# 初始化 BaggingRegressor
model = BaggingRegressor(
    estimator=DecisionTreeRegressor(), #基学习器
    n_estimators=10, #基学习器的数量
    max_samples=1.0, #每个基学习器训练时使用的样本比例或数量
    max_features=1.0, #每个基学习器训练时使用的特征比例或数量
    bootstrap=True, #是否进行自助采样
    bootstrap_features=False, #是否对特征进行自助采样
    oob_score=False, #是否使用袋外样本（out-of-bag samples）进行评分
    n_jobs=None, #并行计算的作业数量
    random_state=None)

estimator：基学习器，用于生成每个模型的实例。通常是一个回归模型，如 DecisionTreeRegressor、LinearRegression 等。
n_estimators：基学习器的数量。默认值为 10。增大该值通常可以提高模型的稳定性，但也会增加计算复杂度。
max_samples：每个基学习器训练时使用的样本比例或数量。默认值为 1.0（使用所有样本）。可以设为一个浮动比例（如 0.8），表示使用 80% 的样本进行训练。
max_features：每个基学习器训练时使用的特征比例或数量。默认值为 1.0（使用所有特征）。可以设为一个浮动比例（如 0.8），表示使用 80% 的特征进行训练。
bootstrap：是否进行自助采样（bootstrap sampling）。默认值为 True。当设置为 True 时，使用放回抽样；为 False 时，使用原始样本。
bootstrap_features：是否对特征进行自助采样。默认值为 False。当设置为 True 时，特征也会进行自助采样。
oob_score：是否使用袋外样本（out-of-bag samples）进行评分。默认值为 False。当设置为 True 时，模型会计算袋外样本的得分作为模型的评估指标。
n_jobs：并行计算的作业数量。默认值为 None，表示不使用并行计算。如果设为 -1，则使用所有可用的处理器。
random_state：控制随机数生成的种子，确保结果的可复现性。默认值为 None。

2）计算袋外样本性能

①启用袋外样本：在创建 BaggingRegressor 实例时，设置 oob_score=True。这会使模型在训练过程中计算袋外样本的得分。
② .oob_prediction_ _{(只在设置 oob_score =True 时可用)}获取袋外（Out-Of-Bag, OOB）样本的预测结果(包含了每个训练样本的袋外预测值，即在训练过程中没有用于该基学习器训练的样本的预测值)。
③ mean_squared_error()计算袋外（OOB）均方误差（MSE）
④.oob_score_获取袋外得分(对于回归任务，通常是 R² 得分（决定系数），它表示模型在袋外样本上的拟合优度。)。

#袋外预测
pred_oob = model.oob_prediction_
#袋外均方误差
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_train, pred_oob))
#袋外预测值的拟合优度
model.oob_score_

结果输出： 11.323470116238905
0.8605296124474909 低于模型在测试集上的 R²（ 0.905）

`笔记：BaggingRegressor 的属性`

oob_score_ 袋外样本的拟合优度_{(只在设置 oob_score =True 时可用)}
oob_prediction_袋外样本的预测值。_{(只在设置 oob_score =True 时可用)}
estimators_训练过程中使用的基学习器列表。类型：list
feature_importances_为特征的重要性
n_features_ 训练时使用的特征数量。类型：int
n_samples_训练时使用的样本数量。类型：int

3）对比：线性回归

#对比：线性回归拟合优度
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X_train, y_train)
model.score(X_test, y_test)

结果输出： 0.7836295385076302 低于BaggingRegressor的拟合优度

4）决策树的数量(n_estimators)对袋外误差的影响

oob_errors = [] #初始化一个空列表 oob_errors，用于存储每个模型的袋外（OOB）均方误差（MSE）。
for n_estimators in range(100,301):
    model = BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=123),
        n_estimators=n_estimators, 
        n_jobs=-1, #使用计算机所有Kernel
        oob_score=True, #启用袋外得分计算
        random_state=0)
    model.fit(X_train, y_train)
    pred_oob = model.oob_prediction_
    oob_errors.append(mean_squared_error(y_train, pred_oob))
#画出折线图
import matplotlib.pyplot as plt
plt.plot(range(100, 301), oob_errors)
plt.xlabel('Number of Trees')
plt.ylabel('OOB MSE')
plt.title('Bagging OOB Errors')

在这里插入图片描述
由上图可见，决策树的数量（B）> 200时，袋外误差基本稳定 → 继续扩大B 不会继续增大/降低袋外误差

3. 随机森林估计（mtry=p/3）

1）对于回归树，一般mtry=p/3

#设置mtry=p/3
max_features=int(X_train.shape[1] / 3) 
max_features

结果输出： 4

#进行随机森林估计
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor(
    n_estimators=5000, 
    max_features=max_features, 
    random_state=0)
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

结果输出： 0.8959042792611034 低于BaggingRegressor模型在测试集上的 R²（ 0.905）

`笔记：RandomForestRegressor()`

RandomForestRegressor 是 scikit-learn 库中的一种集成学习方法，用于回归问题。它通过训练多个决策树，并结合这些树的预测来提高模型的预测准确性和稳定性。

#基本语法和参数
from sklearn.ensemble import RandomForestRegressor

RandomForestRegressor(
    n_estimators=100, #基学习器（决策树）的数量，默认为 100。
    criterion='squared_error', #criterion: 衡量分裂质量的指标，默认为 'squared_error'，可选： 'absolute_error'。
    max_depth=None, #决策树的最大深度，默认为 None
    min_samples_split=2, #节点分裂所需的最小样本数，默认为 2
    min_samples_leaf=1, #叶子节点所需的最小样本数，默认为 1。
    min_weight_fraction_leaf=0.0, 
    max_features='auto', #每个决策树随机选择的特征数量，默认为 'auto'，即 sqrt(n_features)。可选：'log2' 
    max_leaf_nodes=None,
    min_impurity_decrease=0.0, 
    bootstrap=True, #是否在训练过程中使用自助采样（有放回抽样），默认为 True。
    oob_score=False, #是否使用袋外样本来估计模型的性能，默认为 False。
    n_jobs=None, #并行运行的任务数。-1 表示使用所有处理器。None 表示使用单个处理器。
    random_state=None, 
    verbose=0, #控制输出的详细程度。默认为 0
    warm_start=False) #是否保留之前的计算结果以便重新使用，从而实现增量学习。默认为 False

基本属性

oob_score_ 袋外样本的拟合优度_{(只在设置 oob_score =True 时可用)}
estimators_所有训练好的决策树。类型：list
n_estimators_实际使用的基学习器数量。类型：int
feature_importances_为特征的重要性

2）预测结果

#在测试集中预测
pred = model_rf.predict(X_test)

#散点图：预测值v.s.实际值
plt.scatter(pred, y_test, alpha=0.6)
w = np.linspace(min(pred), max(pred), 100)
plt.plot(w, w)
plt.xlabel('pred')
plt.ylabel('y_test')
plt.title('Random Forest Prediction')

在这里插入图片描述

3）变量重要性

#变量重要性
model_rf.feature_importances_

结果输出： array([0.05973776, 0.00710239, 0.06430075, 0.00453907, 0.06608221, 0.24163514, 0.04537915, 0.07261151, 0.00782096, 0.02861349, 0.06775422, 0.02357072, 0.31085264])

#变量重要性排序
sorted_index = model_rf.feature_importances_.argsort()
#变量重要性的柱状图
plt.barh(range(X.shape[1]), model_rf.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')
plt.tight_layout()

在这里插入图片描述

4）偏依赖图

画出最重要的两个变量[LSTAT] [RM]的偏依赖图

# 画偏依赖图
from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(model, X, ['LSTAT', 'RM'])

在这里插入图片描述

`笔记：PartialDependenceDisplay()`

PartialDependenceDisplay() 是 scikit-learn 提供的工具，用于可视化模型的部分依赖关系。部分依赖图展示了特征对预测结果的影响，保持其他特征不变。
PartialDependenceDisplay.from_estimator() 是 scikit-learn 的一个方法，用于从训练好的模型生成部分依赖图。

#基本语法和参数
from sklearn.inspection import PartialDependenceDisplay
#画图
PartialDependenceDisplay.from_estimator(
    estimator, #训练好的模型实例（如 RandomForestRegressor 或 GradientBoostingClassifier）。
    X, #用于计算部分依赖关系的特征数据
    features, #要绘制的特征的索引或名称列表
    *, #以下 可选
    target=None, #对于分类任务，指定目标类的索引（如 [0, 1] 表示不同类别）。默认情况下为 None。
    ax=None, #Matplotlib 的轴对象，
    grid_resolution=100, #用于生成部分依赖图的网格分辨率，默认值为 100。
    percentiles=(0.05, 0.95), #用于确定绘制部分依赖图的特征值范围，指定为 (min_percentile, max_percentile)，默认值为 (0.05, 0.95)
    **kwargs)

4. 选出最优mtry：通过测试集R²

1）for循环：最优R²

scores = []
for max_features in range(1, X.shape[1] + 1):
    model = RandomForestRegressor(
        max_features=max_features,
        n_estimators=500, 
        random_state=123)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score) 
index = np.argmax(scores)
range(1, X.shape[1] + 1)[index]

结果输出： 9

2）画图：测试集R²与mtry的关系

#画图：展示测试集R2与mtry的关系
best_max_features = range(1, X.shape[1] + 1)[index]
best_score=max(scores)
plt.plot(range(1, X.shape[1] + 1), scores, 'o-')
plt.axvline(range(1, X.shape[1] + 1)[index], linestyle='--', color='k', linewidth=1)
plt.axhline(max(scores), linestyle='--', color='r', linewidth=1)

# 标记交点
plt.scatter(best_max_features, best_score, color='b', zorder=5)
plt.text(best_max_features, best_score,  
		f'({best_max_features},  {best_score:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', 
         fontsize=11, color='b')

plt.xlabel('max_features')
plt.ylabel('R2')
plt.title('Choose max_features via Test Set')

在这里插入图片描述

3）测试集拟合优度R²（具体）

print(scores)

结果输出： [0.8437131192861476, 0.8739752162372, 0.8933972149960787, 0.8945791364193233, 0.9023805203648395, 0.9064183313773803, 0.9047043224433458, 0.9052402402199933, 0.9095729082813647, 0.9065784479617953, 0.9059220121673773, 0.9049744409525852, 0.9046686863266287]

5. 对比不同模型：最优mtry时，决策树的数量(n_estimators) 与袋外误差

1）随机森林

scores_rf = []
for n_estimators in range(1, 301):
    model = RandomForestRegressor(
        max_features=9,
        n_estimators=n_estimators, 
        random_state=123)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores_rf.append(mse)

2）Bagging

scores_bag = []
for n_estimators in range(1, 301):
    model = BaggingRegressor(
        estimator=DecisionTreeRegressor(random_state=123), 
        n_estimators=n_estimators, 
        random_state=0)
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mse = mean_squared_error(y_test, pred)
    scores_bag.append(mse)

3）单个决策树

model = DecisionTreeRegressor()
path = model.cost_complexity_pruning_path(X_train, y_train)
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(
    DecisionTreeRegressor(random_state=123), 
    param_grid, cv=kfold, 
    scoring='neg_mean_squared_error')
model.fit(X_train, y_train)
score_tree = -model.score(X_test, y_test)
scores_tree = [score_tree for i in range(1, 301)]

4）画图展示三种算法的测试误差对比

plt.plot(range(1, 301), scores_tree, 'k--', label='Single Tree')
plt.plot(range(1, 301), scores_bag, 'r-', label='Bagging')
plt.plot(range(1, 301), scores_rf, 'g-', label='Random Forest')
plt.xlabel('Number of Trees')
plt.ylabel('MSE')
plt.title('Test Error')
plt.legend()

在这里插入图片描述
随机森林与Bagging接近，都低于最优单棵树

6. 选出最优mtry：10折交叉验证—最小MSE

1）10折交叉验证—最小MSE

max_features = range(1, X.shape[1] + 1)                   
param_grid = {'max_features': max_features }
#10折交叉验证
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(
    RandomForestRegressor(n_estimators=300, random_state=123), 
    param_grid, cv=kfold, 
    scoring='neg_mean_squared_error', 
    return_train_score=True)
 
model.fit(X_train, y_train)
#最优mtry
model.best_params_

结果输出： {‘max_features’: 5}

2）画图：MSE与mtry的关系

#提取MSE
cv_mse = -model.cv_results_['mean_test_score']
#折线图
plt.plot(max_features, cv_mse, 'o-')
plt.axvline(max_features[np.argmin(cv_mse)], linestyle='--', color='k', linewidth=1)
plt.axhline(min(cv_mse), linewidth=1, linestyle='--', color='r')

# 标记交点
best_max_features = max_features[np.argmin(cv_mse)]
min_mse=min(cv_mse)
plt.scatter(best_max_features, min_mse, color='b', zorder=5)
plt.text(best_max_features, min_mse,
         f'({best_max_features},  {min_mse:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', fontsize=11, color='b')

plt.xlabel('max_features')
plt.ylabel('MSE')
plt.title('CV Error for Random Forest')

在这里插入图片描述

三、分类问题-随机森林案例

UCI Machine Learning 的声呐数据 Sonar.scv，响应变量为Class （）表示声呐回音来自“金属筒” 记为M 还是“岩石” 记为R）。
特征变量共60个（V1-V60）。

1. 载入数据+数据处理

import numpy as np
import pandas as pd

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\Sonar.csv'
Sonar = pd.read_csv(csv_path)
Sonar.Class.value_counts()

结果输出：
Class
M 111
R 97
Name: count, dtype: int64

#取出X和y
X = Sonar.iloc[:, :-1]
y = Sonar.iloc[:, -1]
#画变量之间的相关性热力图
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(X.corr(), cmap='Blues')
plt.title('Correlation Matrix')
plt.tight_layout()

在这里插入图片描述

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=50, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

结果输出： ((158, 60), (50, 60), (158,), (50,))

2. 单个决策树估计

#单个决策树
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
path = model.cost_complexity_pruning_path(
    X_train, y_train)

from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
param_grid = {'ccp_alpha': path.ccp_alphas}
kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model_tree = GridSearchCV(
    DecisionTreeClassifier(random_state=123), 
    param_grid, cv=kfold)
model_tree.fit(X_train, y_train)   
model_tree.score(X_test, y_test)

结果输出： 0.74

3.对比：逻辑回归

from sklearn.linear_model import LogisticRegression
model_logit = LogisticRegression(
    C=1e10, max_iter=500)
model_logit.fit(X_train, y_train)   
model_logit.score(X_test, y_test)

结果输出： 0.7 低于单棵决策树

4. 随机深林 - mstry = p½

from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(
    n_estimators=500, 
    max_features='sqrt', 
    random_state=123)
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

结果输出： 0.78 高于单棵决策树和逻辑回归

5. 选出最优mtry：10折交叉验证

#1. GridSearchCV需要响应变量y是数值，所以生成虚拟变量
y_train_dummy = pd.get_dummies(y_train)
y_train_dummy = y_train_dummy.iloc[:, 1]

#2.交叉验证 
param_grid = {'max_features': range(1, 11) }
kfold = StratifiedKFold(n_splits=10,shuffle=True,random_state=1)
model_cv = GridSearchCV(
    RandomForestClassifier(n_estimators=300, random_state=123), 
    param_grid, cv=kfold)
model_cv.fit(X_train, y_train_dummy)
 
model_cv.best_params_

结果输出： {‘max_features’: 8}

#mtry最优时的R2
model_best = RandomForestClassifier(
    n_estimators=500, 
    max_features=8, 
    random_state=123)
model_best.fit(X_train, y_train)
model_best.score(X_test, y_test)

结果输出： 0.82 比一般mtry时高

6. 变量重要性

sorted_index = model_best.feature_importances_.argsort()
plt.barh(range(X.shape[1]), model_best.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')

在这里插入图片描述

7. 偏依赖图

画出最重要的变量[V11] 的偏依赖图

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(
    model_best,
    X_train,
    features=['V11'])

在这里插入图片描述

8. 预测效果

# 在测试集中预测：混淆矩阵
pred = model_best.predict(X_test)
table = pd.crosstab(
    y_test, 
    pred, 
    rownames=['Actual'], 
    colnames=['Predicted'])
table

和陈强老师勘误后的一样。

table = np.array(table)
Accuracy = (table[0, 0] + table[1, 1]) / np.sum(table)
Sensitivity  = table[1 , 1] / (table[1, 0] + table[1, 1])
Specificity = table[0, 0] / (table[0, 0] + table[0, 1])
Recall = table[1, 1] / (table[0, 1] + table[1, 1])

print(f'准确率: {Accuracy}')
print(f'敏感度: {Sensitivity}')
print(f'特异度: {Specificity}')
print(f'召回率: {Recall}')

结果输出： 准确率: 0.82
敏感度: 0.8260869565217391
特异度: 0.8148148148148148
召回率: 0.7916666666666666

#计算kappa
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_test, pred)

结果输出： 0.6388443017656501

#画ROC曲线
from sklearn.metrics import RocCurveDisplay

RocCurveDisplay.from_estimator(
    model_best, 
    X_test,
    y_test, 
    plot_chance_level=True)

plt.title('ROC Curve for Random Forest')

在这里插入图片描述
AUC=0.9 表示预测效果较好。

9. 画决策边界：load_iris

'''画决策边界'''
from sklearn.datasets import load_iris
X,y = load_iris(return_X_y=True)
X2 = X[:, 2:4]
 
model = RandomForestClassifier(
    n_estimators=500, 
    max_features=1, 
    random_state=1)
model.fit(X2,y)
model.score(X2,y)

结果输出： 0.9933333333333333

from mlxtend.plotting import plot_decision_regions
plot_decision_regions(X2, y, model)
plt.xlabel('petal_length')
plt.ylabel('petal_width')
plt.title('Decision Boundary for Random Forest')

在这里插入图片描述
决策边界依然不太光滑。

四、习题

1. 回归问题的随机森林估计-concrete.csv

'''1.载入数据'''
import numpy as np
import pandas as pd

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\concrete.csv'
concrete = pd.read_csv(csv_path)

# 提取特征和目标值
X = concrete.drop(columns=['CompressiveStrength'])
y = concrete['CompressiveStrength']

# 将数据分割为训练集和测试集
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=300, random_state=0)

'''2.进行随机森林估计'''
from sklearn.ensemble import RandomForestRegressor
model_rf = RandomForestRegressor(
    n_estimators=100, 
    max_features=3, 
    random_state=123)
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

结果输出： 0.9026461454417134

'''3.画 变量重要性 图'''
#变量重要性
model_rf.feature_importances_
#变量重要性排序
sorted_index = model_rf.feature_importances_.argsort()
#变量重要性的柱状图
import matplotlib.pyplot as plt
plt.barh(range(X.shape[1]), model_rf.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')
plt.tight_layout()

在这里插入图片描述

'''4.画出AGE & CEMET的偏依赖图'''
from sklearn.inspection import PartialDependenceDisplay
features = [(0, 8)] 
PartialDependenceDisplay.from_estimator(
    model_rf,
    X_train,
    features = ['Age', 'Cement'])

在这里插入图片描述

'''5.在测试集中预测，计算均方误差'''
#在测试集中预测
pred = model_rf.predict(X_test)
#计算均方误差
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)

结果输出： 24.71151870998681

'''6.通过10折交叉验证，找出最优mtry，并画图展示'''
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV

scores = []
for max_features in range(1, X.shape[1] + 1):
    model = RandomForestRegressor(
        max_features=max_features,
        n_estimators=500, 
        random_state=123)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    scores.append(score) 
index = np.argmax(scores)
best_max_features = range(1, X.shape[1] + 1)[index]
best_score = max(scores)

#画出 折线图
plt.plot(range(1, X.shape[1] + 1), scores, 'o-')
plt.axvline(best_max_features, linestyle='--', color='k', linewidth=1)
plt.axhline(max(scores), linewidth=1, linestyle='--', color='r')

# 标记交点
plt.scatter(best_max_features, best_score, color='b', zorder=5)
plt.text(best_max_features, 
         best_score,  f'({best_max_features},  {best_score:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', fontsize=11, color='b')

plt.xlabel('max_features')
plt.ylabel('R2')
plt.title('Choose max_features via Test Set')

在这里插入图片描述

'''7.通过测试集误差，找出最优mtry，并画图展示'''
max_features = range(1, X.shape[1] + 1)                   
param_grid = {'max_features': max_features }

kfold = KFold(n_splits=10, shuffle=True, random_state=1)
model = GridSearchCV(
    RandomForestRegressor(n_estimators=500, random_state=123), 
    param_grid, cv=kfold, 
    scoring='neg_mean_squared_error', 
    return_train_score=True)
 
model.fit(X_train, y_train)
#最优mtry
model.best_params_
#最优mtry时的交叉验证误差
cv_mse = -model.cv_results_['mean_test_score']
#折线图
import matplotlib.pyplot as plt
plt.plot(max_features, cv_mse, 'o-')
plt.axvline(max_features[np.argmin(cv_mse)], linestyle='--', color='k', linewidth=1)
plt.axhline(min(cv_mse), linewidth=1, linestyle='--', color='r')

# 标记交点
max_features_mse = max_features[np.argmin(cv_mse)]
min_mse = min(cv_mse)
plt.scatter(max_features_mse, min_mse, color='b', zorder=5)
plt.text(max_features_mse, 
         min_mse,  f'({max_features_mse},  {min_mse:.6f})', 
         horizontalalignment='right', 
         verticalalignment='bottom', fontsize=11, color='b')

plt.xlabel('max_features')
plt.ylabel('MSE')
plt.title('CV Error for Random Forest')

在这里插入图片描述

2.分类问题的随机森林估计-Mushroom.csv

'''1.载入数据'''
import numpy as np
import pandas as pd

#读取CSV文件的路径
csv_path = r'D:\桌面文件\Python\【陈强-机器学习】MLPython-PPT-PDF\MLPython_Data\Mushroom.csv'
Mushroom = pd.read_csv(csv_path)

# 提取特征和目标值
X = Mushroom.drop(columns=['Class'])
y = Mushroom['Class']

# X的分类变量变为虚拟变量
X = pd.get_dummies(X, drop_first=True)

# 将数据分割为训练集和测试集
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=1000, random_state=1)

'''2.进行随机森林估计'''
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(
    n_estimators=1000, 
    max_features='sqrt', 
    random_state=123)
model_rf.fit(X_train, y_train)
model_rf.score(X_test, y_test)

结果输出： 1.0 !!!就是这个！！！好几天了，怎么试都是1.0！！！谁家好人的R2能做到1.0啊啊啊！！

'''3.画 变量重要性 图'''
#变量重要性
sorted_index = model_rf.feature_importances_.argsort()

import matplotlib.pyplot as plt
plt.barh(range(X.shape[1]), model_rf.feature_importances_[sorted_index])
plt.yticks(np.arange(X.shape[1]), X.columns[sorted_index])
plt.xlabel('Feature Importance')
plt.ylabel('Feature')
plt.title('Random Forest')

在这里插入图片描述

'''4.在测试集中预测：混淆矩阵'''
pred = model_rf.predict(X_test)
table = pd.crosstab(
    y_test, 
    pred, 
    rownames=['Actual'], 
    colnames=['Predicted'])
table