医学数据分析实训 项目七 集成学习--空气质量指标--天气质量分析和预测

news2024/11/15 7:01:41

项目七:集成学习

实践目的
  1. 理解集成学习算法原理;
  2. 熟悉并掌握常用集成学习算法的使用方法;
  3. 熟悉模型性能评估的方法;
  4. 掌握模型优化的方法。
实践平台
  • 操作系统:Windows7及以上
  • Python版本:3.8.x及以上
  • 集成开发环境:PyCharm或Anoconda
实践内容

数据集文件名为“aqi.csv”,包含了2020年全国空气质量数据,该数据集主要记录了2020年1月至2020年9月的空气质量指标,包括日期、AQI、质量等级、PM2.5含量(ppm)、PM10含量(ppm)、SO2含量(ppm)、CO含量(ppm)、NO2含量(ppm)、O3_8h含量(ppm)等字段。

本项目实践所涉及的业务为天气质量分析和预测。将数据分为训练集和测试集,通过集成学习建立算法模型预测AQI值和质量等级。

(一)数据理解及准备
  1. 导入本案例所需的Python包;
  2. 通过describe()、info()方法、shape属性等对读入的数据对象进行探索性分析。
  3. 结合实际数据情况,对数据集进行适当的预处理;
  4. 提取用于数据分析的特征,并划分训练集和测试集。
(二)模型建立、预测及优化
任务一:随机森林
  1. 回归模型

    • 通过RandomForestRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过RandomForestClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务二:梯度提升机 (GBM)
  1. 回归模型

    • 通过GradientBoostingRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,包括平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score;
    • 使用GridSearchCV网格搜索函数对模型进行优化,并通过best_params_属性返回性能最好的参数组合;
    • 根据以上参数对模型进行优化,并输出新模型的平方绝对误差(MAE)、均方误差(MSE)、均方根误差(RMSE)、平方绝对百分比误差(MAPE)、回归系数score评价指标,与优化前的指标进行对比;
    • 使用feature_importances_属性输出模型每个特征的重要度,并按重要程度进行排序;
    • 使用优化后的模型进行预测,并输出预测结果;
    • 可视化展示预测值和测试值的对比情况。
  2. 分类模型

    • 通过GradientBoostingClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用confusion_matrix()、accuracy_scorer()、precision_score()、recall_score()、f1_score()方法分别对模型的混淆矩阵、准确率、精确率、召回率、f1值指标进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
任务三:轻量级梯度提升机 (LightGBM)
  1. 回归模型

    • 通过LGBMRegressor()方法建立模型并训练;
    • 使用该模型预测AQI值;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。
  2. 分类模型

    • 通过LGBMClassifier()方法建立模型并训练;
    • 使用该模型预测空气质量等级;
    • 使用评价指标对模型进行评价,并输出评价结果;
    • 如评价结果不理想需对模型进行优化。

(一)数据理解及准备

# 导入必要的库
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, r2_score, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt
import lightgbm as lgb

# 读取数据
data = pd.read_csv('output/modified_data.csv')


# 显示数据基本信息
print("数据信息:")
print(data.info())
print("\n数据描述:")
print(data.describe())
print("\n数据形状:", data.shape)
# 检查并处理缺失值
if data.isnull().sum().sum() > 0:
    # 可以选择填充缺失值或删除含有缺失值的行
    # 这里简单地用列的平均值填充
    data.fillna(data.mean(), inplace=True)

# 转换日期格式
data['Date'] = pd.to_datetime(data['Date'])
print(data.head)

# 特征提取
features = ['PM2_5_(ppm)', 'PM10_(ppm)', 'SO2_(ppm)', 'CO_(ppm)', 'NO2_(ppm)', 'O3_8h_(ppm)']
target_aqi = 'AQI'
target_quality = 'Quality_Level'

# 划分训练集和测试集
X = data[features]
y_aqi = data[target_aqi]
y_quality = data[target_quality]

X_train, X_test, y_aqi_train, y_aqi_test, y_quality_train, y_quality_test = train_test_split(X, y_aqi, y_quality, test_size=0.2, random_state=42)

(二)模型建立、预测及优化

任务一:随机森林
# 1 建立随机森林回归模型 训练模型
rf_reg = RandomForestRegressor(random_state=42)
rf_reg.fit(X_train, y_aqi_train)

# 2 预测 AQI 值
y_aqi_pred = rf_reg.predict(X_test)
print('随机森林回归模型预测 AQI 值:', y_aqi_pred)
# 3 计算评估指标
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("随机森林回归模型评价指标:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')

随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=rf_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)
# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'最佳参数组合: {best_params}')

最佳参数组合: {‘max_depth’: 20, ‘min_samples_leaf’: 1, ‘min_samples_split’: 2, ‘n_estimators’: 100}

# 5. 根据最佳参数重新训练模型
best_rf_reg = RandomForestRegressor(**best_params, random_state=42)
best_rf_reg.fit(X_train, y_aqi_train)

# 预测并评价优化后的模型
y_aqi_pred_optimized = best_rf_reg.predict(X_test)
print('优化后的随机森林回归模型预测 AQI 值:', y_aqi_pred_optimized)
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)
    
print("优化后的随机森林回归模型评价指标:")
print(f'MAE: {mae_optimized}, \nMSE: {mse_optimized}, \nRMSE: {rmse_optimized}, \nMAPE: {mape_optimized}, \nR2_SCORE: {r2_optimized}')


优化后的随机森林回归模型预测 AQI 值: [124.48 81.15 72.38 71.58 45.77 82.31 34.6 31.42 80.58 83.7
47.59 74.32 75.95 47.13 39.53 33.33 45.76 75.11 80.87 42.57
39.87 58.52 44.34 45.51 60.06 40.73 51.15 45.06 51.46 43.2
70.71 37.2 127.29 31.26 86.79 43.56 90.83 66. 111.21 80.26
33.47 53.14 47.4 130.66 73.89 47.37 47.58 47.16 66.56 39.78
44.36 115.1 105.81 110.77 74.06]
优化后的随机森林回归模型评价指标:
MAE: 3.0150909090909086,
MSE: 25.965259999999997,
RMSE: 5.095611837650116,
MAPE: 0.05528859263399927,
R2_SCORE: 0.965415994168552

# 比较优化前后的指标
print("优化前后指标对比:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后指标对比:
优化前: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552
优化后: MAE: 3.0150909090909086, MSE: 25.965259999999997, RMSE: 5.095611837650116, MAPE: 0.05528859263399927, R2_SCORE: 0.965415994168552

未优化成功

# 6. 使用feature_importances_属性输出模型每个特征的重要度 
# 特征重要度
importances = best_rf_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
# 7. 输出预测结果
print(feature_importances)

# 8. 可视化展示预测值和测试值的对比情况
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.400419
PM2_5_(ppm) 0.291729
O3_8h_(ppm) 0.288429
CO_(ppm) 0.008184
NO2_(ppm) 0.007934
SO2_(ppm) 0.003305
dtype: float64

在这里插入图片描述

任务二:GBM
回归模型
# 1. 通过 GradientBoostingRegressor()方法建立模型并训练
gb_reg = GradientBoostingRegressor(random_state=42)
gb_reg.fit(X_train, y_aqi_train)
# 2. 使用该模型预测 AQI 值
y_aqi_pred = gb_reg.predict(X_test)
print('GBM回归模型预测 AQI 值:', y_aqi_pred)

# 3. 使用评价指标对模型进行评价
mae = mean_absolute_error(y_aqi_test, y_aqi_pred)
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
rmse = np.sqrt(mse)
mape = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)
print("Gradient Boosting Regression Model Evaluation Metrics:")
print(f'MAE: {mae}, \nMSE: {mse}, \nRMSE: {rmse}, \nMAPE: {mape}, \nR2_SCORE: {r2}')


GBM回归模型预测 AQI 值: [122.57416247 83.36233285 73.90280417 71.61249735 45.90407098
83.09407824 35.38809475 32.1115523 81.92797541 83.40916295
48.82405535 74.28270394 74.96495747 45.69629863 39.59354642
33.09971192 45.41896268 75.52727318 81.71507209 47.02496198
41.96486507 59.76085878 45.10753769 46.1912337 59.05166283
49.05189862 53.29885368 47.58476507 46.59894793 42.17298408
70.67172663 35.57436497 130.76443134 33.12142879 85.93142525
41.04272972 88.25804535 64.42863259 112.47587802 80.12500147
32.96123373 55.09504267 50.37469809 125.99062665 75.72767345
48.10707457 51.29551088 47.94867709 70.66198919 40.51320902
40.7250176 115.95276244 114.3584965 112.04106305 74.86570745]
Gradient Boosting Regression Model Evaluation Metrics:
MAE: 3.017067274506405,
MSE: 19.567961603563685,
RMSE: 4.4235688763218874,
MAPE: 0.058486439950287586,
R2_SCORE: 0.9739367717401174

# 4. 使用 GridSearchCV 网格搜索函数对模型进行优化
# 定义参数网格
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gb_reg, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error')
# 进行网格搜索
grid_search.fit(X_train, y_aqi_train)

# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 1, ‘min_samples_split’: 10, ‘n_estimators’: 300}

# 5. 根据最佳参数重新训练模型
best_gb_reg = GradientBoostingRegressor(**best_params, random_state=42)
best_gb_reg.fit(X_train, y_aqi_train)

# 使用优化后的模型进行预测
y_aqi_pred_optimized = best_gb_reg.predict(X_test)
print('优化后的模型预测结果:', y_aqi_pred_optimized)
# 计算优化后的模型评估指标
mae_optimized = mean_absolute_error(y_aqi_test, y_aqi_pred_optimized)
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
rmse_optimized = np.sqrt(mse_optimized)
mape_optimized = mean_absolute_percentage_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)

print("优化梯度增强回归模型评价指标:")
print(f'Optimized MAE: {mae_optimized}, \nOptimized MSE: {mse_optimized}, \nOptimized RMSE: {rmse_optimized}, \nOptimized MAPE: {mape_optimized}, \nOptimized R2_SCORE: {r2_optimized}')


优化后的模型预测结果: [124.50273685 83.46470773 76.67313626 71.43908717 46.06087546
82.48270218 35.45781481 30.29347664 80.68483493 83.63975494
48.01910073 75.04391558 74.6780025 44.02048381 39.16875902
31.57326064 45.52152266 74.54621085 81.98742113 41.15229431
40.05067005 60.0349372 43.40693783 42.44777993 60.0874834
46.4533299 53.98613726 45.00781228 51.56679542 38.97574632
73.97473389 36.03646256 131.65412729 30.82872235 86.88627133
44.17166092 89.64827072 66.71578258 112.06193027 80.82544043
32.13607404 53.33558888 48.52689834 125.55765644 77.38396113
48.52990476 51.07272122 48.89955218 69.66154718 40.70715896
49.21862157 117.74301294 107.39395475 111.89285961 75.11097803]
优化梯度增强回归模型评价指标:
Optimized MAE: 2.7372135954667853,
Optimized MSE: 20.880137541908642,
Optimized RMSE: 4.569478913608054,
Optimized MAPE: 0.048316013543643156,
Optimized R2_SCORE: 0.9721890403365572

# 比较优化前后的指标
print("优化前后评价指标的比较:")
print(f"优化前: MAE: {mae}, MSE: {mse}, RMSE: {rmse}, MAPE: {mape}, R2_SCORE: {r2}")
print(f"优化后: MAE: {mae_optimized}, MSE: {mse_optimized}, RMSE: {rmse_optimized}, MAPE: {mape_optimized}, R2_SCORE: {r2_optimized}")

优化前后评价指标的比较:
优化前: MAE: 3.017067274506405, MSE: 19.567961603563685, RMSE: 4.4235688763218874, MAPE: 0.058486439950287586, R2_SCORE: 0.9739367717401174
优化后: MAE: 2.7372135954667853, MSE: 20.880137541908642, RMSE: 4.569478913608054, MAPE: 0.048316013543643156, R2_SCORE: 0.9721890403365572

# 6. 输出特征重要性
importances = best_gb_reg.feature_importances_
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
print(feature_importances)

# 可视化预测值和测试值的对比
plt.figure(figsize=(10, 6))
plt.scatter(y_aqi_test, y_aqi_pred_optimized, alpha=0.5)
plt.xlabel('Actual AQI')
plt.ylabel('Predicted AQI')
plt.title('Actual vs Predicted AQI')
plt.show()

PM10_(ppm) 0.422780
O3_8h_(ppm) 0.303769
PM2_5_(ppm) 0.265263
NO2_(ppm) 0.006753
SO2_(ppm) 0.001192
CO_(ppm) 0.000243
dtype: float64

在这里插入图片描述

分类模型
# 1 建立模型并训练
gbm_clf = GradientBoostingClassifier(random_state=42)
gbm_clf.fit(X_train, y_quality_train)

# 2 预测空气质量等级
y_quality_pred = gbm_clf.predict(X_test)
print('GBM分类模型预测结果:', y_quality_pred)
# 3 评价模型
conf_matrix = confusion_matrix(y_quality_test, y_quality_pred)
accuracy = accuracy_score(y_quality_test, y_quality_pred)
precision = precision_score(y_quality_test, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test, y_quality_pred, average='weighted')

print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}, \nPrecision: {precision}, \nRecall: {recall}, \nF1 Score: {f1}')

GBM分类模型预测结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Confusion Matrix:
[[20 3 0]
[ 0 25 0]
[ 0 0 7]]
Accuracy: 0.9454545454545454,
Precision: 0.9512987012987013,
Recall: 0.9454545454545454,
F1 Score: 0.9450955363197574


# 4. 对模型进行优化
# 定义参数网格
param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 创建 StratifiedKFold 对象
# stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 创建 GridSearchCV 对象
grid_search = GridSearchCV(estimator=gbm_clf, param_grid=param_grid, cv=2, scoring='accuracy')
# 进行网格搜索
grid_search.fit(X_train, y_quality_train)

# 获取最佳参数组合
best_params = grid_search.best_params_
print(f'Best Parameters: {best_params}')

Best Parameters: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_samples_leaf’: 4, ‘min_samples_split’: 2, ‘n_estimators’: 200}

# 5. 根据最佳参数重新训练模型
best_gbm_clf = GradientBoostingClassifier(**best_params, random_state=42)
best_gbm_clf.fit(X_train, y_quality_train)

# 6. 使用优化后的模型进行预测
y_quality_pred_optimized = best_gbm_clf.predict(X_test)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)
# 7. 计算优化后的模型评估指标
conf_matrix_optimized = confusion_matrix(y_quality_test, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test, y_quality_pred_optimized, average='weighted')

print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}, \nOptimized Precision: {precision_optimized}, \nOptimized Recall: {recall_optimized}, \nOptimized F1 Score: {f1_optimized}')

# 比较前后的指标
print("优化前后评价指标的比较:\n")
print(f"优化前: Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}")
print(f"优化后: Accuracy: {accuracy_optimized}, Precision: {precision_optimized}, Recall: {recall_optimized}, F1 Score: {f1_optimized}")

优化后的模型预测空气API结果: [‘C’ ‘B’ ‘B’ ‘B’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘A’ ‘B’
‘B’ ‘A’ ‘A’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘B’ ‘A’ ‘A’ ‘A’ ‘B’ ‘A’ ‘C’ ‘A’ ‘B’ ‘A’
‘B’ ‘B’ ‘C’ ‘B’ ‘A’ ‘B’ ‘A’ ‘C’ ‘B’ ‘A’ ‘A’ ‘B’ ‘B’ ‘A’ ‘B’ ‘C’ ‘C’ ‘C’
‘B’]
Optimized Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 0 7]]
Optimized Accuracy: 0.9636363636363636,
Optimized Precision: 0.9636363636363636,
Optimized Recall: 0.9636363636363636,
Optimized F1 Score: 0.9636363636363636
优化前后评价指标的比较:

优化前: Accuracy: 0.9454545454545454, Precision: 0.9512987012987013, Recall: 0.9454545454545454, F1 Score: 0.9450955363197574
优化后: Accuracy: 0.9636363636363636, Precision: 0.9636363636363636, Recall: 0.9636363636363636, F1 Score: 0.9636363636363636

任务三:LIGHTGBM
# 1. 使用 LGBMRegressor() 方法建立回归模型并训练
# 1.1 使用 LGBMRegressor() 方法建立回归模型并训练
lgb_reg = lgb.LGBMRegressor(random_state=42)
lgb_reg.fit(X_train, y_aqi_train)

# 1.2 使用该模型预测 AQI 值
y_aqi_pred = lgb_reg.predict(X_test)

# 1.3 对模型进行评价
mse = mean_squared_error(y_aqi_test, y_aqi_pred)
r2 = r2_score(y_aqi_test, y_aqi_pred)

print("LGBMRegressor Model Evaluation Metrics:")
print(f'Mean Squared Error (MSE): {mse}')
print(f'R^2 Score: {r2}')

LGBMRegressor Model Evaluation Metrics:
Mean Squared Error (MSE): 70.6065599506794
R^2 Score: 0.9059567406190894

from sklearn.preprocessing import LabelEncoder

# 2. 使用 LGBMClassifier() 方法建立分类模型并训练
# 2.1 使用 LGBMClassifier() 方法建立分类模型并训练
# 将类别标签转换为整数
label_encoder = LabelEncoder()
y_quality_train_encoded = label_encoder.fit_transform(y_quality_train)
y_quality_test_encoded = label_encoder.transform(y_quality_test)

lgb_clf = lgb.LGBMClassifier(random_state=42)
lgb_clf.fit(X_train, y_quality_train_encoded)

# 2.2 使用该模型预测空气质量等级
y_quality_pred = lgb_clf.predict(X_test)
print('预测空气质量等级结果:', y_quality_pred)

# 2.3 对模型进行评价
conf_matrix = confusion_matrix(y_quality_test_encoded, y_quality_pred)
accuracy = accuracy_score(y_quality_test_encoded, y_quality_pred)
precision = precision_score(y_quality_test_encoded, y_quality_pred, average='weighted')
recall = recall_score(y_quality_test_encoded, y_quality_pred, average='weighted')
f1 = f1_score(y_quality_test_encoded, y_quality_pred, average='weighted')

print("LGBMClassifier Model Evaluation Metrics:")
print(f'Confusion Matrix:\n{conf_matrix}')
print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')
print(f'F1 Score: {f1}')

预测空气质量等级结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 1 2 1 2 1]
LGBMClassifier Model Evaluation Metrics:
Confusion Matrix:
[[22 1 0]
[ 1 24 0]
[ 0 1 6]]
Accuracy: 0.9454545454545454
Precision: 0.9468531468531469
Recall: 0.9454545454545454
F1 Score: 0.9452900041135335

# 3. 如评价结果不理想需对模型进行优化
# 3.1 定义参数网格
param_grid_reg = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'num_leaves': [31, 63, 127],
    'max_depth': [-1, 5, 10],
    'min_child_samples': [20, 50, 100]
}

param_grid_clf = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'num_leaves': [31, 63, 127],
    'max_depth': [-1, 5, 10],
    'min_child_samples': [20, 50, 100]
}

# 3.2 创建 GridSearchCV 对象
grid_search_reg = GridSearchCV(estimator=lgb_reg, param_grid=param_grid_reg, cv=5, scoring='neg_mean_squared_error')
grid_search_clf = GridSearchCV(estimator=lgb_clf, param_grid=param_grid_clf, cv=5, scoring='accuracy')

# 3.3 进行网格搜索
grid_search_reg.fit(X_train, y_aqi_train)
grid_search_clf.fit(X_train, y_quality_train_encoded)

# 3.4 获取最佳参数组合
best_params_reg = grid_search_reg.best_params_
best_params_clf = grid_search_clf.best_params_

print(f'Best Parameters for Regression: {best_params_reg}')
print(f'Best Parameters for Classification: {best_params_clf}')

Best Parameters for Regression: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}
Best Parameters for Classification: {‘learning_rate’: 0.1, ‘max_depth’: 5, ‘min_child_samples’: 20, ‘n_estimators’: 100, ‘num_leaves’: 31}

# 3.5 根据最佳参数重新训练模型
best_lgb_reg = lgb.LGBMRegressor(**best_params_reg, random_state=42)
best_lgb_clf = lgb.LGBMClassifier(**best_params_clf, random_state=42)

best_lgb_reg.fit(X_train, y_aqi_train)
best_lgb_clf.fit(X_train, y_quality_train_encoded)

# 3.6 使用优化后的模型进行预测
y_aqi_pred_optimized = best_lgb_reg.predict(X_test)
y_quality_pred_optimized = best_lgb_clf.predict(X_test)
print('优化后的模型预测空气质量结果:', y_aqi_pred_optimized)
print('优化后的模型预测空气API结果:', y_quality_pred_optimized)

# 3.7 对优化后的模型进行评价
mse_optimized = mean_squared_error(y_aqi_test, y_aqi_pred_optimized)
r2_optimized = r2_score(y_aqi_test, y_aqi_pred_optimized)

conf_matrix_optimized = confusion_matrix(y_quality_test_encoded, y_quality_pred_optimized)
accuracy_optimized = accuracy_score(y_quality_test_encoded, y_quality_pred_optimized)
precision_optimized = precision_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
recall_optimized = recall_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')
f1_optimized = f1_score(y_quality_test_encoded, y_quality_pred_optimized, average='weighted')

print("Optimized LGBMRegressor Model Evaluation Metrics:")
print(f'Optimized Mean Squared Error (MSE): {mse_optimized}')
print(f'Optimized R^2 Score: {r2_optimized}')

print("Optimized LGBMClassifier Model Evaluation Metrics:")
print(f'Optimized Confusion Matrix:\n{conf_matrix_optimized}')
print(f'Optimized Accuracy: {accuracy_optimized}')
print(f'Optimized Precision: {precision_optimized}')
print(f'Optimized Recall: {recall_optimized}')
print(f'Optimized F1 Score: {f1_optimized}')

优化后的模型预测空气质量结果: [119.6722501 90.43625783 76.95094102 69.7134339 45.49522429
85.11016632 35.29020956 34.373013 76.12352252 86.39110431
48.60966258 71.83512479 73.98876859 44.13139587 40.82771554
34.96190592 45.33698962 73.97657317 84.40383692 48.74370587
42.31917891 54.61740284 43.89328402 50.84420449 61.99838848
44.00117867 54.84723723 47.00982841 47.98332788 49.8258541
62.28614705 36.04575205 113.75560249 34.31105093 88.98552298
45.43941569 106.8158533 62.86787307 111.01787045 82.98067324
34.80876636 65.3185259 50.05687814 115.46064086 84.07845619
49.74122766 52.93800566 54.78650467 54.40771277 40.1914266
36.17207261 107.11934225 97.1210987 100.3162011 74.79308805]
优化后的模型预测空气API结果: [2 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 0 2 0 1 0 1
1 2 1 0 1 0 2 1 0 0 1 1 0 0 2 1 2 1]
Optimized LGBMRegressor Model Evaluation Metrics:
Optimized Mean Squared Error (MSE): 75.10416813845606
Optimized R^2 Score: 0.8999662245297593
Optimized LGBMClassifier Model Evaluation Metrics:
Optimized Confusion Matrix:
[[22 1 0]
[ 2 23 0]
[ 0 1 6]]
Optimized Accuracy: 0.9272727272727272
Optimized Precision: 0.9287878787878787
Optimized Recall: 0.9272727272727272
Optimized F1 Score: 0.9271536973664632


# 比较优化前后的指标
print("Comparison of Evaluation Metrics Before and After Optimization:")
print(f"Regression: MSE: {mse} -> {mse_optimized}, R^2 Score: {r2} -> {r2_optimized}")
print(f"Classification: Accuracy: {accuracy} -> {accuracy_optimized}, Precision: {precision} -> {precision_optimized}, Recall: {recall} -> {recall_optimized}, F1 Score: {f1} -> {f1_optimized}")

Comparison of Evaluation Metrics Before and After Optimization:
Regression: MSE: 70.6065599506794 -> 75.10416813845606, R^2 Score: 0.9059567406190894 -> 0.8999662245297593
Classification: Accuracy: 0.9454545454545454 -> 0.9272727272727272, Precision: 0.9468531468531469 -> 0.9287878787878787, Recall: 0.9454545454545454 -> 0.9272727272727272, F1 Score: 0.9452900041135335 -> 0.9271536973664632

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2145363.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

ubuntu安装wordpress(基于LNMP环境)

参考链接 Ubuntu安装LNMP 安装步骤 环境需要LNMP环境,如果没有安装可以参考ZATA—LNMP简单安装 在mysql中设置wordpress所用的用户名和密码 #1. 登录mysql mysql -uroot -p #2. 创建wordpress数据库 create database wordpress; #3. 创建新用户user,…

【有啥问啥】深入解析 OpenAI o1 模型家族:推理能力的跃升与应用场景

深入解析 OpenAI o1 模型家族:推理能力的跃升与应用场景 随着人工智能的不断发展,推理能力已经成为影响 AI 系统性能的关键因素。2024 年 9 月 12 日【好家伙,在笔者生日当天ヘ(ー`ヘ)搞事情】,O…

腾讯百度阿里华为常见算法面试题TOP100(5):子串、堆

之前总结过字节跳动TOP50算法面试题: 字节跳动常见算法面试题top50整理_沉迷单车的追风少年-CSDN博客_字节算法面试题 子串 560.和为K的子数组

谷歌云推出全新区块链RPC服务:简化Web3开发

2024年9月,谷歌云(Google Cloud)宣布推出区块链RPC(远程过程调用)服务的预览版,进一步表明其支持Web3开发者的承诺。此次发布旨在简化开发者与区块链数据的交互,降低Web3应用开发的技术门槛。这…

制作U盘安装操作系统(启动盘、系统盘、Windows、Linux)

第一种(Windows) 官网windows制作启动盘 1. 打开Win11下载官网 下载 Windows 11https://www.microsoft.com/zh-cn/software-download/windows11 2. 下载制作操作系统工具 这里不要下载错了 3. 启动工具 选择U盘,选择你的U盘即可&#xf…

[Redis][环境配置]详细讲解

目录 1.安装 && 简单配置2.文件目录说明3.客户端 1.安装 && 简单配置 Ubuntu下,直接使用sudo apt install redis -y即可支持远程连接:修改/etc/redis/redis.conf 将bind 127.0.0.1改为bing 0.0.0.0作为学习用途,可以将prote…

vue3前端开发-小兔鲜超市-本地购物车列表页面的统计计算

vue3前端开发-小兔鲜超市-本地购物车列表页面的统计计算!这一次,实现了一些本地购物车列表页面的,简单的计算。 代码如下所示: import { computed, ref } from vue import { defineStore } from pinia export const useCartStor…

新升级|优化航拍/倾斜摄影模型好消息,支持处理多套贴图模型!

【天元轻量化软件】一直在不断地追求进步和完善,以满足更多用户的各种需求。 电脑登录天元官网免费体验:天元轻量化软件官网 本次我们对“智能PBR”功能进行了更新。更新后的“智能PBR”支持带多套贴图的模型进行使用。 本轮更新后,主要受益…

统信服务器操作系统【1050e版】安装手册

统信服务器操作系统1050e版本的安装 文章目录 功能概述一、准备环境二、安装方式介绍安装步骤步骤一:制作启动盘步骤二:系统的安装步骤三:安装引导界面步骤四:图形化界面安装步骤五:选择安装引导程序语言步骤六:进入安装界面步骤七:设置键盘步骤八:设置系统语言步骤九:…

音视频入门基础:AAC专题(8)——FFmpeg源码中计算AAC裸流AVStream的time_base的实现

一、引言 本文讲解FFmpeg源码对AAC裸流行解复用(解封装)时,其AVStream的time_base是怎样被计算出来的。 二、FFmpeg源码中计算AAC裸流AVStream的time_base的实现 FFmpeg对AAC裸流进行解复用(解封装)时,其…

Docker 镜像制作(Dockerfile)

1 Dockerfile 概念 Dockerfile 是什么? 镜像的定制实际上就是定制每一层所添加的配置、文件。如果我们可以把每一层修改、安装、构建、操作的命令都写入一个脚本,用这个脚本来构建、定制镜像,这个脚本就是 Dockerfile。 Dockerfile 是一个文本文件&a…

CVE-2024-2389 未经身份验证的命令注入

什么是 Progress Flowmon? Progress Flowmon 是一种网络监控和分析工具,可提供对网络流量、性能和安全性的全面洞察。Flowmon 将 Nette PHP 框架用于其 Web 应用程序。 未经身份验证的路由 我们开始在“AllowedModulesDecider.php”文件中枚举未经身份验证的端点,这是一个描…

superset 解决在 mac 电脑上发送 slack 通知的问题

参考文档: https://superset.apache.org/docs/configuration/alerts-reports/ 核心配置: FROM apache/superset:3.1.0USER rootRUN apt-get update && \apt-get install --no-install-recommends -y firefox-esrENV GECKODRIVER_VERSION0.29.0 RUN wget -q https://g…

边缘智能-大模型架构初探

R2Cloud接口 机器人注册 请求和应答 注册是一个简单的 HTTP 接口,根据机器人/用户信息注册,创建一个新机器人。 请求 URL URLhttp://ip/robot/regTypePOSTHTTP Version1.1Content-Typeapplication/json 请求参数 Param含义Rule是否必须缺省roboti…

活动系统开发之采用设计模式与非设计模式的区别-后台功能总结

1、数据库ER图 2、后台功能字段 题目功能字段 数据列表 编号题目名称选项数量状态 1启用0禁用创建时间修改时间保存 题目名称选项集 选项内容是否正确答案 1正确0错误启禁用删除素材图库功能字段 数据列表 编号原文件名称文件类型文件大小加密后文件名文件具体路径上传类型状态…

为您的任意模型赋能——RAG

随着大语言模型的参数规模越来越大,微调模型的代价越来越大,于是知识检索增强的方式成为越来越主流的选择。通过提前准备好的知识库,在模型进行推理之前进行知识检索作为上下文一同交给大模型进行推理,从而提升大模型对领域知识的…

kafka 一步步探究消费者组与分区分配策略

本期主要聊聊kafka消费者组与分区 消费者组 & 消费者 每个消费者都需要归属每个消费者组,每个分区只能被消费者组中一个消费者消费 上面这段话还不够直观,我们举个例子来说明。 订单系统 订单消息通过 order_topic 发送,该topic 有 5个分区 结算系…

Cursor免费 GPT-4 IDE 工具的保姆级使用教程

Cursor免费 GPT-4 IDE 工具的保姆级使用教程 简介 Cursor 是一款基于人工智能技术的代码生成工具。 它利用先进的自然语言处理和深度学习算法,可根据用户的输入或需求,自动生成高质量代码。 不管是初学者,还是资深开发者,Curs…

网络爬虫到底难在哪里?

如果你是自己做爬虫脚本开发,那确实难,因为你需要掌握Python、HTML、JS、xpath、database等技术,而且还要处理反爬、动态网页、逆向等情况,不然压根不知道怎么去写代码,这些技术和经验储备起码得要个三五年。 比如这几…

Qt5详细安装教程(包含导入pycharm)

1.自行下载Qt 2.双击进行安装 3.设置完成后勾选接受,跳转下一步 4.可选择安装位置,比较习惯安装在D盘 5.根据需求勾选对应组件安装 6.安装完成后,打开pycharm,进入settings—>选择ExternalTools,根据以下步骤进行配…