L8打卡学习笔记

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

SVM与集成学习

SVM
- SVM线性模型
- SVM非线性模型
- SVM常用参数
集成学习
随机森林
- 导入数据
- 查看数据信息
- 数据分析
- 随机森林模型
- 预测结果
- 结果分析
个人总结

SVM

超平面：SVM 在特征空间中寻找一个能够最大化类别间隔的超平面，称为最大间隔超平面。这个超平面就是将数据集分成不同类别的边界。
支持向量：支持向量是离分隔超平面最近的样本点，它们决定了超平面的位置和方向。换句话说，只有这些样本对分类结果有影响，其他的样本点则不影响。

SVM线性模型

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

# 数据预处理
sc = StandardScaler()
X = sc.fit_transform(X)

# 训练集和测试集的分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建SVM模型
svm = SVC(kernel='linear', C=1.0)

# 训练模型
svm.fit(X_train, y_train)

# 预测
y_pred = svm.predict(X_test)

# 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: %.2f' % (accuracy * 100.0))

SVM非线性模型

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# 加载数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

# 数据预处理
sc = StandardScaler()
X = sc.fit_transform(X)

# 训练集和测试集的分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建SVM模型
svm = SVC(kernel='rbf', C=1.0, gamma=0.1)

# 训练模型
svm.fit(X_train, y_train)

# 预测
y_pred = svm.predict(X_test)

# 评估模型性能
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: %.2f' % (accuracy * 100.0))

SVM常用参数

C（默认值：1.0）
○ 作用：惩罚参数，用于平衡最大化分类间隔和误分类惩罚之间的关系。
○ 解释：较大的 C 值意味着对误分类的惩罚更大，模型会倾向于将更多的训练数据点分类正确，但可能会导致间隔变小，可能出现过拟合；较小的 C 值则会更关注于间隔的大小，而允许更多的误分类，从而提高模型的泛化能力。
○ 常用范围：通常在 0.001 到 1000 之间进行调节。
kernel（默认值：‘rbf’）
○ 作用：指定要使用的核函数，支持不同的非线性映射方法。
○ 可选值：
■ ‘linear’：线性核函数，即不进行任何非线性映射。
■ ‘poly’：多项式核函数，通常用于多项式可分的情况。
■ ‘rbf’：径向基函数（Radial Basis Function），又称高斯核，是最常用的非线性核函数。
■ ‘sigmoid’：类似于神经网络的激活函数，较少使用。
■ 你也可以传递自定义核函数，方法是传递一个函数。
degree （默认值：3）
○ 作用：当 kernel=‘poly’ 时，指定多项式核的多项式次数。
○ 解释：如果使用多项式核函数（poly），degree 参数决定多项式的阶数，通常是 2 或 3。
gamma（默认值：‘scale’）
○ 作用：核函数系数，适用于 ‘rbf’、‘poly’ 和 ‘sigmoid’ 核函数。
○ 可选值：
■ ‘scale’：使用 1 / (n_features * X.var()) 作为默认值。这个值会根据输入特征的数量和方差自动调整。
■ ‘auto’：使用 1 / n_features 作为值。
○ 解释：gamma 值越大，模型越倾向于拟合训练数据，但可能会导致过拟合；gamma 值越小，模型更倾向于平滑。
coef0（默认值：0.0）
○ 作用：核函数中的独立项，仅在 kernel=‘poly’ 或 kernel=‘sigmoid’ 时有意义。
○ 解释：用于控制多项式核函数和 sigmoid 核函数中的偏移量。

集成学习

Bagging在做预测时，对于分类任务，使用简单的投票法。对于回归任务使用简单平均法。若分类预测时出现两个类票数一样时，则随机选择一个。
Boosting 工作原理：

弱学习器：中的弱学习器通常是性能稍微优于随机猜测的模型，通常使用简单的模型（如浅层决策树）。
加权训练：在每一次迭代中，Boosting 会调整每个样本的权重，增加那些前一次模型预测错误样本的权重，使得后续的学习器更关注这些难以分类的样本。
加权投票：最终模型是通过将所有弱学习器的预测结果加权整合而成，通常采用加权投票（分类问题）或加权平均（回归问题）。

随机森林

一种基于集成学习的算法，主要用于分类和回归分析。随机森林通过结合多个决策树来提高模型的准确性和稳健性，步骤如下：

随机抽样：从原始训练数据中随机抽取多个样本集（通常是相同大小），为每棵决策树准备训练数据。
构建决策树：对于每个样本集，根据随机选取的特征构建一棵决策树。树的生长过程中使用信息增益、基尼指数等标准进行节点分裂。
集成预测：
对于分类任务，随机森林通过对所有决策树的预测进行投票，选择票数最多的类别作为最终类别。
对于回归任务，计算所有树的预测值的平均值。

导入数据

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

data = pd.read_csv(r'C:\Users\11054\Desktop\kLearning\L678_learning\data.csv')
data

查看数据信息

data.info()

import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'SimHei'  # 指定默认字体为黑体
feature_map = {
    'Temperature': '温度',
    'Humidity': '湿度百分比',
    'Wind Speed': '风速',
    'Precipitation (%)': '降水量百分比',
    'Atmospheric Pressure': '大气压力',
    'UV Index': '紫外线指数',
    'Visibility (km)': '能见度'
}
plt.figure(figsize=(15, 10))

for i, (col, col_name) in enumerate(feature_map.items(), 1):
    plt.subplot(2, 4, i)
    sns.boxplot(y=data[col])
    plt.title(f'{col_name}的箱线图', fontsize=14)
    plt.ylabel('数值', fontsize=12)
    plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

C:\Users\11054\AppData\Local\Temp\ipykernel_7496\1699620420.py:22: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from current font.
  plt.tight_layout()
C:\Users\11054\.conda\envs\kmate\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)

在这里插入图片描述

print(f"温度超过60°C的数据量：{data[data['Temperature'] > 60].shape[0]}，占比{round(data[data['Temperature'] > 60].shape[0] / data.shape[0] * 100,2)}%。")
print(f"湿度百分比超过100%的数据量：{data[data['Humidity'] > 100].shape[0]}，占比{round(data[data['Humidity'] > 100].shape[0] / data.shape[0] * 100,2)}%。")
print(f"降雨量百分比超过100%的数据量：{data[data['Precipitation (%)'] > 100].shape[0]}，占比{round(data[data['Precipitation (%)'] > 100].shape[0] / data.shape[0] * 100,2)}%。")

温度超过60°C的数据量：207，占比1.57%。
湿度百分比超过100%的数据量：416，占比3.15%。
降雨量百分比超过100%的数据量：392，占比2.97%。

数据分析

data.describe(include='all')

plt.figure(figsize=(20, 15))
plt.subplot(3, 4, 1)
sns.histplot(data['Temperature'], kde=True,bins=20)
plt.title('温度分布')
plt.xlabel('温度')
plt.ylabel('频数')

plt.subplot(3, 4, 2)
sns.boxplot(y=data['Humidity'])
plt.title('湿度百分比箱线图')
plt.ylabel('湿度百分比')

plt.subplot(3, 4, 3)
sns.histplot(data['Wind Speed'], kde=True,bins=20)
plt.title('风速分布')
plt.xlabel('风速（km/h）')
plt.ylabel('频数')

plt.subplot(3, 4, 4)
sns.boxplot(y=data['Precipitation (%)'])
plt.title('降雨量百分比箱线图')
plt.ylabel('降雨量百分比')

plt.subplot(3, 4, 5)
sns.countplot(x='Cloud Cover', data=data)
plt.title('云量 (描述)分布')
plt.xlabel('云量 (描述)')
plt.ylabel('频数')

plt.subplot(3, 4, 6)
sns.histplot(data['Atmospheric Pressure'], kde=True,bins=10)
plt.title('大气压分布')
plt.xlabel('气压 (hPa)')
plt.ylabel('频数')

plt.subplot(3, 4, 7)
sns.histplot(data['UV Index'], kde=True,bins=14)
plt.title('紫外线等级分布')
plt.xlabel('紫外线指数')
plt.ylabel('频数')

plt.subplot(3, 4, 8)
Season_counts = data['Season'].value_counts()
plt.pie(Season_counts, labels=Season_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('季节分布')

plt.subplot(3, 4, 9)
sns.histplot(data['Visibility (km)'], kde=True,bins=10)
plt.title('能见度分布')
plt.xlabel('能见度（Km）')
plt.ylabel('频数')

plt.subplot(3, 4, 10)
sns.countplot(x='Location', data=data)
plt.title('地点分布')
plt.xlabel('地点')
plt.ylabel('频数')

plt.subplot(3, 4, (11,12))
sns.countplot(x='Weather Type', data=data)
plt.title('天气类型分布')
plt.xlabel('天气类型')
plt.ylabel('频数')

plt.tight_layout()
plt.show()

C:\Users\11054\AppData\Local\Temp\ipykernel_7496\3587563545.py:65: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from current font.
  plt.tight_layout()
C:\Users\11054\.conda\envs\kmate\lib\site-packages\IPython\core\pylabtools.py:152: UserWarning: Glyph 8722 (\N{MINUS SIGN}) missing from current font.
  fig.canvas.print_figure(bytes_io, **kw)

在这里插入图片描述

随机森林模型

new_data = data.copy()
label_encoders = {}
categorical_features = ['Cloud Cover', 'Season', 'Location', 'Weather Type']
for feature in categorical_features:
    le = LabelEncoder()
    new_data[feature] = le.fit_transform(data[feature])
    label_encoders[feature] = le

for feature in categorical_features:
    print(f"'{feature}'特征的对应关系：")
    for index, class_ in enumerate(label_encoders[feature].classes_):
        print(f"  {index}: {class_}")

'Cloud Cover'特征的对应关系：
  0: clear
  1: cloudy
  2: overcast
  3: partly cloudy
'Season'特征的对应关系：
  0: Autumn
  1: Spring
  2: Summer
  3: Winter
'Location'特征的对应关系：
  0: coastal
  1: inland
  2: mountain
'Weather Type'特征的对应关系：
  0: Cloudy
  1: Rainy
  2: Snowy
  3: Sunny

# 构建x，y
x = new_data.drop(['Weather Type'],axis=1)
y = new_data['Weather Type']

# 划分数据集
x_train,x_test,y_train,y_test = train_test_split(x,y,
                                                 test_size=0.3,
                                                 random_state=15)

# 构建随机森林模型
rf_clf = RandomForestClassifier(random_state=15)
rf_clf.fit(x_train, y_train)

预测结果

y_pred_rf = rf_clf.predict(x_test)
class_report_rf = classification_report(y_test, y_pred_rf)
print(class_report_rf)

              precision    recall  f1-score   support

           0       0.87      0.93      0.90      1018
           1       0.93      0.91      0.92       967
           2       0.96      0.92      0.94      1007
           3       0.91      0.91      0.91       968

    accuracy                           0.92      3960
   macro avg       0.92      0.91      0.92      3960
weighted avg       0.92      0.92      0.92      3960

结果分析

feature_importances = rf_clf.feature_importances_
features_rf = pd.DataFrame({'特征': x.columns, '重要度': feature_importances})
features_rf.sort_values(by='重要度', ascending=False, inplace=True)
plt.figure(figsize=(10, 8))
sns.barplot(x='重要度', y='特征', data=features_rf)
plt.xlabel('重要度')
plt.ylabel('特征')
plt.title('随机森林特征图')
plt.show()