机器学习/数据分析--通俗语言带你入门随机森林，并用随机森林进行天气分类预测(Accuracy为0.92)

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

前言

机器学习是深度学习和数据分析的基础，接下来将更新常见的机器学习算法及其案例
注意：在打数学建模比赛中，机器学习用的也很多，可以一起学习
欢迎收藏 + 点赞 + 关注

文章目录

1、简介
- 1、集成学习Bagging
- 2、随机森林简介
2、案例：不同天气分类
- 1、导入数据
- 2、数据检查和数据预处理
- 3、数据分析
- 4、模型创建
- - 1、标签编码
  - 2、模型创建
- 5、模型预测于评估
- 6、特征重要特征展示

1、简介

1、集成学习Bagging

Bagging集成核心思想：将数据集集随机分为N份，每一份用一个模型求解，最后将所有模型结果进行投票得出结果。

转化为图像如下：

在这里插入图片描述

自动采样法：

自动采样法，可以有放回的采样，假设m个样本的数据集，每一次随机拿去一个样本，然后放回，这样就有概率下一次再被选中，经过m次采样，一次大概有百分之63.2%(数学公式推导而出)的数据被选中。

数学公式推导：

假设每一个样本被选择的概率为 1/m，这样进行m次选择，没有选择的概率为：

$(1-\frac1m)^m$

当m-> $\infty$ 的时候，结果趋于： $\frac1e\approx0.368\text{ 。}$ 然后用1减去，得到的。

2、随机森林简介

随机森林是一种集成学习算法，将多个决策树按照Bagging思想进行集成，然后对每个决策树的结果进行投票，非常适合复杂分类的情况下处理数据，下图为随机森林大体结构：

在这里插入图片描述

2、案例：不同天气分类

简介：本项目使用了一个人工合成的天气数据集，模拟了雨天、晴天、多云和雪天四种类型

任务：对数进行数据分析，建立随机森林模型对天气类别进行分类预测

1、导入数据

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 

data = pd.read_csv('weather_classification_data.csv')
data

	Temperature	Humidity	Wind Speed	Precipitation (%)	Cloud Cover	Atmospheric Pressure	UV Index	Season	Visibility (km)	Location	Weather Type
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

names = ['温度', '湿度', '风速', '降水量(%)', '云量', '气压', '紫外线指数', '季节' ,'能见度(km)', '地点', '天气类型']
data.columns = names
data

	温度	湿度	风速	降水量(%)	云量	气压	紫外线指数	季节	能见度(km)	地点	天气类型
0	14.0	73	9.5	82.0	partly cloudy	1010.82	2	Winter	3.5	inland	Rainy
1	39.0	96	8.5	71.0	partly cloudy	1011.43	7	Spring	10.0	inland	Cloudy
2	30.0	64	7.0	16.0	clear	1018.72	5	Spring	5.5	mountain	Sunny
3	38.0	83	1.5	82.0	clear	1026.25	7	Spring	1.0	coastal	Sunny
4	27.0	74	17.0	66.0	overcast	990.67	1	Winter	2.5	mountain	Rainy
...	...	...	...	...	...	...	...	...	...	...	...
13195	10.0	74	14.5	71.0	overcast	1003.15	1	Summer	1.0	mountain	Rainy
13196	-1.0	76	3.5	23.0	cloudy	1067.23	1	Winter	6.0	coastal	Snowy
13197	30.0	77	5.5	28.0	overcast	1012.69	3	Autumn	9.0	coastal	Cloudy
13198	3.0	76	10.0	94.0	overcast	984.27	0	Winter	2.0	inland	Snowy
13199	-5.0	38	0.0	92.0	overcast	1015.37	5	Autumn	10.0	mountain	Rainy

13200 rows × 11 columns

2、数据检查和数据预处理

# 查看是否有缺失值
data.isnull().sum()

温度         0
湿度         0
风速         0
降水量(%)     0
云量         0
气压         0
紫外线指数      0
季节         0
能见度(km)    0
地点         0
天气类型       0
dtype: int64

# 查看数据信息
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13200 entries, 0 to 13199
Data columns (total 11 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
0   温度       13200 non-null  float64
 1   湿度       13200 non-null  int64
2   风速       13200 non-null  float64
 3   降水量(%)   13200 non-null  float64
 4   云量       13200 non-null  object
 5   气压       13200 non-null  float64
 6   紫外线指数    13200 non-null  int64
7   季节       13200 non-null  object
 8   能见度(km)  13200 non-null  float64
 9   地点       13200 non-null  object
 10  天气类型     13200 non-null  object
dtypes: float64(5), int64(2), object(4)
memory usage: 1.1+ MB

# 分别对云量、季节、地点、天气类型进行分类
columns = ['云量', '季节', '地点', '天气类型']
for i in columns:
    print(f'{i}')
    print(data[i].unique())
    print('*' * 50)

云量
['partly cloudy' 'clear' 'overcast' 'cloudy']
**************************************************
季节
['Winter' 'Spring' 'Summer' 'Autumn']
**************************************************
地点
['inland' 'mountain' 'coastal']
**************************************************
天气类型
['Rainy' 'Cloudy' 'Sunny' 'Snowy']
**************************************************

分析：

云量：四类
季节：四类
地点：四类
天气类型：四类

# 纸箱图分析，对数据进行异常值分析
import seaborn as sns 

#设置字体
from pylab import mpl
mpl.rcParams["font.sans-serif"] = ["SimHei"]  # 显示中文
plt.rcParams['axes.unicode_minus'] = False		# 显示负号

feature_map = {
    '温度': '温度',
    '湿度': '湿度百分比',
    '风速': '风速',
    '降水量(%)': '降水量百分比',
    '气压': '大气压力',
    '紫外线指数': '紫外线指数',
    '能见度(km)': '能见度'
}

plt.figure(figsize=(15, 10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):     # 1 是索引从1开始，1、2、3、4……
    plt.subplot(2, 4, i)
    sns.boxplot(y=data[col])
    plt.title(f'{col_name}的纸箱图', fontsize=14)
    plt.ylabel('数值', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)
  
plt.tight_layout()   # 自动调整宽距
plt.show()

在这里插入图片描述

异常值分析

温度：高于60，违背常理，删去
湿度：存在超过100的值，删除
风速：风速影响因素很多，这里不做处理
降雨量：存在超过100的值，删除
大气压力：大气压力受到很多因素影响，如：高海拔引起，不处理
能见度：可能受到雾霾、雨季的影响，不处理

# 统计异常值占比
print('温度: ', data[data['温度'] > 60.0]['温度'].count() / data['温度'].count())
print('湿度: ', data[data['湿度'] > 100.0]['湿度'].count() / data['湿度'].count())
print('降水量(%): ', data[data['降水量(%)'] > 100.0]['降水量(%)'].count() / data['降水量(%)'].count())

温度:  0.015681818181818182
湿度:  0.03151515151515152
降水量(%):  0.029696969696969697

分析：

发现异常值占比极低，故删除

# 删除异常值
print(f'删除前数据维度{data.shape}')
data = data[(data['温度'] <= 60.0) & (data['湿度'] <= 100.0) & (data['降水量(%)'] <= 100.0)]
print(f'删除后数据维度{data.shape}')

删除前数据维度(13200, 11)
删除后数据维度(12360, 11)

3、数据分析

# 统计分析
data.describe()

	温度	湿度	风速	降水量(%)	气压	紫外线指数	能见度(km)
count	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000	12360.000000
mean	18.071359	66.937460	9.356837	50.864968	1005.713743	3.791262	5.535801
std	15.804363	19.390333	6.318334	30.967846	38.300471	3.720638	3.377554
min	-24.000000	20.000000	0.000000	0.000000	800.120000	0.000000	0.000000
25%	4.000000	56.000000	5.000000	19.000000	994.587500	1.000000	3.000000
50%	21.000000	69.000000	8.500000	54.000000	1007.495000	2.000000	5.000000
75%	30.000000	81.000000	13.000000	79.000000	1016.750000	6.000000	7.500000
max	60.000000	100.000000	48.500000	100.000000	1199.210000	14.000000	20.000000

# 对每个数据进行图示化展示，看** 处理后的数据 ** 是否正常
plt.figure(figsize=(20, 15))
plt.subplot(3, 4, 1)
sns.histplot(data['温度'], kde=True, bins=20)   # kde：直方图上绘制核密度曲线，bins：分为几个柱子
plt.title('温度分布')
plt.xlabel('温度')
plt.ylabel('频率')

plt.subplot(3, 4, 2)
sns.boxplot(y=data['湿度'])
plt.title('湿度百分比图')
plt.ylabel('湿度占比')

plt.subplot(3, 4, 3)
sns.histplot(data['风速'], kde=True, bins=20)
plt.title('风速分布')
plt.xlabel('风速(km/h)')
plt.ylabel('频率')

plt.subplot(3, 4, 4)
sns.boxplot(y=data['降水量(%)'])
plt.title('降水量百分比纸箱图')
plt.ylabel('降水量占比')

plt.subplot(3, 4, 5)
sns.countplot(x='云量', data=data)
plt.title('云量分布')
plt.xlabel('云量描述')
plt.ylabel('频率')

plt.subplot(3, 4, 6)
sns.histplot(data['气压'], kde=True, bins=20)
plt.title('气压分布')
plt.xlabel('气压')
plt.ylabel('频率')

plt.subplot(3, 4, 7)
sns.histplot(data['紫外线指数'], kde=True, bins=20)
plt.title('紫外线等级分布')
plt.xlabel('紫外线')
plt.ylabel('频率')

plt.subplot(3, 4, 8)
season_counts = data['季节'].value_counts()
plt.pie(season_counts, labels=season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('季节分布')

plt.subplot(3, 4, 9)
sns.histplot(data['能见度(km)'], kde=True, bins=20)
plt.title('能见度分布')
plt.xlabel('能见度km/h')
plt.ylabel('频率')

plt.subplot(3, 4, 10)
sns.countplot(x='地点', data=data)
plt.title('地点分布')
plt.xlabel('地点')
plt.ylabel('频速')

plt.subplot(3, 4, (11,12))
sns.countplot(x='天气类型', data=data)
plt.title('天气类型分布')
plt.xlabel('天气类型')
plt.ylabel('频数')

plt.tight_layout()
plt.show()

在这里插入图片描述

数据分析

温度：>60度已经去除，主要集中在(-10, 5)，(10, 40)之间，符合常理
湿度：分布在百分之二十到百分之百之间，且数据集中在 中位数附件，符合常理
风速：集中在0-20之间，且集中分布，风速较低，极端风速极少，符合常理
降雨量：分布在0-100之间，主要集中在20-80，中位数大概在50左右，对称，能够反映大多数天气情况下的降雨量
云量：主要为 局部多云和 阴天比较多,多云最少
气压分布：极端情况少，主要集中在1000附件，符合常理
紫外线：大多数较低，高的占比较少，缝合常理
季节分布：冬天气温占比做多
能见度：能见度大多数集中在5KM附件，能见度正常
地点分布：反映了数据中不同地点天气的数量
天气类型：四种天气类型数量差不多，比较平均和对称

总的来说，数据处理后，数据没有问题，可以进行下一步处理

4、模型创建

1、标签编码

from sklearn.preprocessing import LabelEncoder
new_data = data.copy()
columns = ['云量', '季节', '地点', '天气类型']

feture_name = {}

for i in columns:
    le = LabelEncoder()
    new_data[i] = le.fit_transform(data[i])  
    feture_name[i] = le   # 每一个编码器，都是返回的是编码结果

# 展示对应的标签编码
for i in columns:
    print(i, ': ')
    for index, class_ in enumerate(feture_name[i].classes_):
        print(f'index: {index}: {class_}')

云量 :
index: 0: clear
index: 1: cloudy
index: 2: overcast
index: 3: partly cloudy
季节 :
index: 0: Autumn
index: 1: Spring
index: 2: Summer
index: 3: Winter
地点 :
index: 0: coastal
index: 1: inland
index: 2: mountain
天气类型 :
index: 0: Cloudy
index: 1: Rainy
index: 2: Snowy
index: 3: Sunny

2、模型创建

from sklearn.model_selection import train_test_split
# 数据集划分
X = new_data.drop('天气类型', axis=1)
y = new_data['天气类型']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.ensemble import RandomForestClassifier
# 创建随机森林模型
model = RandomForestClassifier()
# 模型的训练
model.fit(X_train, y_train)

5、模型预测于评估

from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
model_evaluation = classification_report(y_test, y_pred)
print(model_evaluation)

precision    recall  f1-score   support

0       0.89      0.93      0.91       636
           1       0.92      0.90      0.91       622
           2       0.94      0.93      0.93       614
           3       0.92      0.92      0.92       600

accuracy                           0.92      2472
   macro avg       0.92      0.92      0.92      2472
weighted avg       0.92      0.92      0.92      2472

分析：
平均准确率、平均召回率、平均f1得分均在0.92，效果极好

6、特征重要特征展示

feature_importances = model.feature_importances_
feture_rf = pd.DataFrame({'特征': X.columns, '重要度': feature_importances})
feture_rf.sort_values(by='重要度', inplace=True, ascending=False)  # ascending=False：说明降序
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=feture_rf)
plt.title('特征影响程度')
plt.xlabel('特征重要占比')
plt.ylabel('特征名字')
plt.show()