本文介绍几种不同的 GBDT 优化算法:
- XGBoost
XGBoost 对损失函数展开二阶导,使得提升树能逼近真是损失,增加正则项防止过拟合,XGBoost 公式:
L( y i y_i yi, y ^ i \hat{y}_i y^i): 损失函数
Ω ( f k ) \Omega(f_k) Ω(fk): 正则项
分类点增加了二阶导:
G:一阶导数
H:二阶导数
# 安装依赖
pip install xgboost
import numpy as np
from cart import TreeNode, BinaryDecisionTree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from utils import cat_label_convert
### 准备数据
from sklearn import datasets
# 导入鸢尾花数据集
data = datasets.load_iris()
# 获取输入输出
X, y = data.data, data.target
# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
# 设置模型参数
params = {
'booster': 'gbtree',
'objective': 'multi:softmax',
'num_class': 3,
'gamma': 0.1,
'max_depth': 2,
'lambda': 2,
'subsample': 0.7,
'colsample_bytree': 0.7,
'min_child_weight': 3,
'eta': 0.001,
'seed': 1000,
'nthread': 4,
}
dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 200
model = xgb.train(params, dtrain, num_rounds)
# 对测试集进行预测
dtest = xgb.DMatrix(X_test)
y_pred = model.predict(dtest)
# 计算准确率
accuracy = accuracy_score(y_test, y_pred)
print ("Accuracy:", accuracy)
# 绘制特征重要性
plot_importance(model)
plt.show();
- LightGBM
XGBoost 需找最优分裂点的计算复杂度可以估计为:特征数 x 分裂点数量 x 样本量
,LightGBM 对 XGBoost 算法通过这三方面进行优化。
- 直方图优化(Histogram-Based):按桶计算特征值的分裂点而不是去尝试每一个分裂点,每个桶中包含多个特征值。
- 互斥特征合并(Exclusive Feature Bundling):把多个互斥的特征进行合并,可以有效的减少特征数量。
- 叶子策略(Leaf-Wise):叶子生长策略相对于按层生长的策略,优势在于只保留有效降低损失值的节点,缺点是如果正则值设置的不合适,有可能产生过拟合。
# 安装依赖
pip install lightgbm
# 导入相关模块
import lightgbm as lgb
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# 导入iris数据集
iris = load_iris()
data = iris.data
target = iris.target
# 数据集划分
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=43)
# 创建lightgbm分类模型
gbm = lgb.LGBMClassifier(objective='multiclass',
num_class=3,
num_leaves=31,
learning_rate=0.05,
n_estimators=20)
# 模型训练
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)])
# 预测测试集
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)
# 模型评估
print('Accuracy of lightgbm:', accuracy_score(y_test, y_pred))
lgb.plot_importance(gbm)
plt.show();
- CatBoost
CatBoost 算法是使用类别特征的 Boost 框架,使用目标变量统计算法而不是 OneHot 编码,通过排序提升让后面的角色树只能前面的数据,而不能看到后面决策树所能看到的数据库,这个可以大大提升训练效果。
#安装依赖
pip install catboost
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import catboost as cb
from sklearn.metrics import f1_score
# 读取数据
data = pd.read_csv('./adult.data', header=None)
# 变量重命名
data.columns = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
# 标签转换
data['income'] = data['income'].astype("category").cat.codes
# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(data.drop(['income'], axis=1), data['income'],
random_state=10, test_size=0.3)
# 配置训练参数
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=4, iterations=500, l2_leaf_reg=1,
learning_rate=0.1)
# 类别特征索引
cat_features_index = [1, 3, 5, 6, 7, 8, 9, 13]
# 训练
clf.fit(X_train, y_train, cat_features=cat_features_index)
# 预测
y_pred = clf.predict(X_test)
# 测试集f1得分
print(f1_score(y_test, y_pred))
总结
本文介绍了三种 GBDT 的优化算法,可以根据实际情况进行选择。