目录
-
- 问题 2
-
- 2.1 依据附件数据分析高钾玻璃、铅钡玻璃的分类规律
-
- 数据类别编码
- 不平衡数据处理
- 分类模型
-
- 决策树分类
- 随机森林分类
- XGBoost分类
- LightGBM分类
- Catboost分类
- 基于直方图的梯度提升Histogram-Based Gradient Boosting
- 梯度提升树Gradient Boosting Tree
- 逻辑回归Logistic
- 朴素贝叶斯Naive Bayes
- 支持向量机SVM
- 神经网络Neural network
问题 2
2.1 依据附件数据分析高钾玻璃、铅钡玻璃的分类规律
数据类别编码
d12 = d12.drop('rowSum', axis=1)
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
# data encode
# Check for and handle categorical variables
label_encoder = LabelEncoder()
x_categorical = d12.select_dtypes(include=['object']).apply(label_encoder.fit_transform)
x_numerical = d12.select_dtypes(exclude=['object']).values
df_encode = pd.concat([pd.DataFrame(x_numerical), x_categorical], axis=1)
# rename columns
colnames = list(d12.columns[i] for i in ([0] + list(range(6,20)))) + list(df_encode.columns[i] for i in list(range(15,21)))
df_encode.columns = colnames
df_encode.head()
文物编号 | 二氧化硅(SiO2) | 氧化钠(Na2O) | 氧化钾(K2O) | 氧化钙(CaO) | 氧化镁(MgO) | 氧化铝(Al2O3) | 氧化铁(Fe2O3) | 氧化铜(CuO) | 氧化铅(PbO) | ... | 五氧化二磷(P2O5) | 氧化锶(SrO) | 氧化锡(SnO2) | 二氧化硫(SO2) | 纹饰 | 类型 | 颜色 | 表面风化 | 文物采样点 | 风化标记 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 69.33 | 0.0 | 9.99 | 6.32 | 0.87 | 3.93 | 1.74 | 3.87 | 0.00 | ... | 1.17 | 0.00 | 0.0 | 0.39 | 2 | 1 | 6 | 0 | 0 | 1 |
1 | 2.0 | 36.28 | 0.0 | 1.05 | 2.34 | 1.18 | 5.73 | 1.86 | 0.26 | 47.43 | ... | 3.57 | 0.19 | 0.0 | 0.00 | 0 | 0 | 1 | 1 | 1 | 1 |
2 | 3.0 | 87.05 | 0.0 | 5.19 | 2.01 | 0.00 | 4.06 | 0.00 | 0.78 | 0.25 | ... | 0.66 | 0.00 | 0.0 | 0.00 | 0 | 1 | 6 | 0 | 2 | 1 |
3 | 3.0 | 61.71 | 0.0 | 12.37 | 5.87 | 1.11 | 5.50 | 2.16 | 5.09 | 1.41 | ... | 0.70 | 0.10 | 0.0 | 0.00 | 0 | 1 | 6 | 0 | 3 | 1 |
4 | 4.0 | 65.88 | 0.0 | 9.67 | 7.12 | 1.56 | 6.44 | 2.06 | 2.18 | 0.00 | ... | 0.79 | 0.00 | 0.0 | 0.36 | 0 | 1 | 6 | 0 | 4 | 1 |
5 rows × 21 columns
from sklearn.model_selection import train_test_split
X = df_encode.drop('类型', axis=1)
y = df_encode['类型']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
不平衡数据处理
d12['类型'].value_counts()
类型
铅钡 49
高钾 18
Name: count, dtype: int64
df_encode['类型'].value_counts()
类型
0 49
1 18
Name: count, dtype: int64
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)
y_train_smote.value_counts()
类型
0 37
1 37
Name: count, dtype: int64
分类模型
决策树分类
模型评估:https://www.statology.org/sklearn-classification-report/
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics
from sklearn.metrics import classification_report