2022高教社杯全国大学生数学建模竞赛C题问题二(1) Python代码

news2025/4/16 18:39:25

- 问题 2
- - 2.1 依据附件数据分析高钾玻璃、铅钡玻璃的分类规律
  - - 数据类别编码
    - 不平衡数据处理
    - 分类模型
    - - 决策树分类
      - 随机森林分类
      - XGBoost分类
      - LightGBM分类
      - Catboost分类
      - 基于直方图的梯度提升Histogram-Based Gradient Boosting
      - 梯度提升树Gradient Boosting Tree
      - 逻辑回归Logistic
      - 朴素贝叶斯Naive Bayes
      - 支持向量机SVM
      - 神经网络Neural network

问题 2

2.1 依据附件数据分析高钾玻璃、铅钡玻璃的分类规律

在这里插入图片描述

数据类别编码

d12 = d12.drop('rowSum', axis=1)

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder

# data encode
# Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = d12.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = d12.select_dtypes(exclude=['object']).values

df_encode = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1)

# rename columns
colnames = list(d12.columns[i] for i in ([0] + list(range(6,20)))) + list(df_encode.columns[i] for i in list(range(15,21)))
df_encode.columns = colnames
df_encode.head()

	文物编号	二氧化硅(SiO2)	氧化钾(K2O)	氧化钙(CaO)	氧化镁(MgO)	氧化铝(Al2O3)	氧化铁(Fe2O3)	氧化铜(CuO)	氧化铅(PbO)	...	五氧化二磷(P2O5)	氧化锶(SrO)	二氧化硫(SO2)	纹饰	类型	颜色	表面风化	文物采样点	风化标记
0	1.0	69.33	9.99	6.32	0.87	3.93	1.74	3.87	0.00	...	1.17	0.00	0.39	2	1	6	0	0	1
1	2.0	36.28	1.05	2.34	1.18	5.73	1.86	0.26	47.43	...	3.57	0.19	0.00	0	0	1	1	1	1
2	3.0	87.05	5.19	2.01	0.00	4.06	0.00	0.78	0.25	...	0.66	0.00	0.00	0	1	6	0	2	1
3	3.0	61.71	12.37	5.87	1.11	5.50	2.16	5.09	1.41	...	0.70	0.10	0.00	0	1	6	0	3	1
4	4.0	65.88	9.67	7.12	1.56	6.44	2.06	2.18	0.00	...	0.79	0.00	0.36	0	1	6	0	4	1

5 rows × 21 columns

from sklearn.model_selection import train_test_split
X = df_encode.drop('类型', axis=1)
y = df_encode['类型']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

不平衡数据处理

d12['类型'].value_counts()

类型
铅钡    49
高钾    18
Name: count, dtype: int64

df_encode['类型'].value_counts()

类型
0    49
1    18
Name: count, dtype: int64

from imblearn.over_sampling import SMOTE

oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

y_train_smote.value_counts()

类型
0    37
1    37
Name: count, dtype: int64

分类模型

决策树分类

模型评估：https://www.statology.org/sklearn-classification-report/

# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics 
from sklearn.metrics import classification_report