【Kaggle】练习赛《肥胖风险的多类别预测》

news2024/11/18 10:29:41

前言

作为机器学习的初学者,Kaggle提供了一个很好的练习和学习平台,其中有一个栏目《PLAYGROUND》,可以理解为游乐场系列赛,提供有趣、平易近人的数据集,以练习他们的机器学习技能,并每个月都会有一场比赛。非常适合新手学习的机会,同时会有大量的高手分享其代码,本期是2024年2月份的题目《Multi-Class Prediction of Obesity Risk》即《肥胖风险的多类别预测》,在此我分享在这个比赛过程的点点滴滴。

题目说明

目标: 预测个体的肥胖风险

数据集介绍:
列名完整含义详细说明
‘id’id每个人的唯一号
‘Gender’性别个人性别
‘Age’年龄年龄在14岁到61岁之间
‘Height’高度高度以米为单位,介于1.45米到1.98米之间
‘Weight’体重体重介于39到165之间,单位为 KG.
‘family_history_with_overweight’家族史
是否有超重问题
‘FAVC’消费频率
对于高热量食物
对于高热量食物,这是否定的问题。我想他们问的问题是你吃高热量的食物吗
‘FCVC’蔬菜消费频率类似于FAVC. 的问题
‘NCP’主餐数类型为浮点数, NCP介于1和4之间
应该是1,2,3,4,但由于数据是合成的,因此为浮点数
‘CAEC’消费 两餐之间的食物共有4个值,有时经常从不总是
‘SMOKE’吸烟的问题, 问题应该是“你抽烟吗?”
‘CH2O’每日摄水量取值在1和3之间,同样是生成的数据,所以是浮点类型
‘SCC’热量消耗监控的问题
‘FAF’体育活动频率FAF在0到3之间,0表示没有体力活动, 3表示高强度锻炼
‘TUE’使用技术设备时间TUE在0到2之间。 问题是“你有多长时间一直在使用技术设备来跟踪您的健康状况。“
‘CALC’饮酒量有 3 值,有时从不经常
‘MTRANS’使用的交通工具MTRANS取5个值 公共交通汽车步行摩托车自行车
‘NObeyesdad’目标这是我们的目标,取7个值,在这个竞赛中,我们必须给予类名(不是概率,大多数竞赛中都是这种情况)
NObeyesdad (目标变量):
  • Insufficient_Weight (体重不足) : 小于18.5
  • Normal_Weight (正常体重) : 18.5 到 24.9
  • Obesity_Type_I (肥胖I级) : 30.0 到 34.9
  • Obesity_Type_II (肥胖II级) : 35.0 到 39.9
  • Obesity_Type_III (肥胖III级): 高于 40
  • Overweight_Level_I(超级肥胖I级), Overweight_Level_II (超级肥胖II级)takes values between 25 to 29

加载库

(略)

加载数据

# 加载所有数据
train = pd.read_csv(os.path.join(FILE_PATH, "train.csv"))
test = pd.read_csv(os.path.join(FILE_PATH, "test.csv"))

探索数据

Train Data
Total number of rows: 20758
Total number of columns: 18

Test Data
Total number of rows: 13840
Total number of columns:17

  • 训练数据统计汇总如下
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
| Column Name | count |  dtype  | nunique | %nunique | %null |         min         |         max         |
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
|      id     | 20758 |  int64  |  20758  |  100.0   |  0.0  |          0          |        20757        |
|    Gender   | 20758 |  object |    2    |   0.01   |  0.0  |        Female       |         Male        |
|     Age     | 20758 | float64 |   1703  |  8.204   |  0.0  |         14.0        |         61.0        |
|    Height   | 20758 | float64 |   1833  |   8.83   |  0.0  |         1.45        |       1.975663      |
|    Weight   | 20758 | float64 |   1979  |  9.534   |  0.0  |         39.0        |      165.057269     |
|     FHWO    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FAVC    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FCVC    | 20758 | float64 |   934   |  4.499   |  0.0  |         1.0         |         3.0         |
|     NCP     | 20758 | float64 |   689   |  3.319   |  0.0  |         1.0         |         4.0         |
|     CAEC    | 20758 |  object |    4    |  0.019   |  0.0  |        Always       |          no         |
|    SMOKE    | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     CH2O    | 20758 | float64 |   1506  |  7.255   |  0.0  |         1.0         |         3.0         |
|     SCC     | 20758 |  object |    2    |   0.01   |  0.0  |          no         |         yes         |
|     FAF     | 20758 | float64 |   1360  |  6.552   |  0.0  |         0.0         |         3.0         |
|     TUE     | 20758 | float64 |   1297  |  6.248   |  0.0  |         0.0         |         2.0         |
|     CALC    | 20758 |  object |    3    |  0.014   |  0.0  |      Frequently     |          no         |
|    MTRANS   | 20758 |  object |    5    |  0.024   |  0.0  |      Automobile     |       Walking       |
|  NObeyesdad | 20758 |  object |    7    |  0.034   |  0.0  | Insufficient_Weight | Overweight_Level_II |
+-------------+-------+---------+---------+----------+-------+---------------------+---------------------+
  • 目标值对性别的分类
gender_count%gender_counttarget_class_count%target_class_count
NObeyesdadGender
Insufficient_WeightFemale16210.6425230.12
Male9020.3625230.12
Normal_WeightFemale16600.5430820.15
Male14220.4630820.15
Obesity_Type_IFemale12670.4429100.14
Male16430.5629100.14
Obesity_Type_IIFemale80.0032480.16
Male32401.0032480.16
Obesity_Type_IIIFemale40411.0040460.19
Male50.0040460.19
Overweight_Level_IFemale10700.4424270.12
Male13570.5624270.12
Overweight_Level_IIFemale7550.3025220.12
Male17670.7025220.12
从上表中,我们可以看到
  • Obesity_Type_II中的所有人都是男性,Obesity_Type_III中的所有人为女性
  • Overweight_Level_II由70%的男性组成,Insufficient_Weight由60%以上的女性组成
  • 从这一点我们可以说,性别是肥胖预测的一个重要特征

数据可视化

在本节中,我们将看到:

  • 单个数值图
  • 个体分类图
  • 数值相关图
  • 组合数字图
目标分布与性别
fig, axs = plt.subplots(1,2,figsize = (12,5))
plt.suptitle("Target Distribution")

sns.histplot(binwidth=0.5,x=TARGET,data=train,hue='Gender',palette="dark",ax=axs[0],discrete=True)
axs[0].tick_params(axis='x', rotation=60)

axs[1].pie(
        train[TARGET].value_counts(),
        shadow = True,
        explode=[.1 for i in range(train[TARGET].nunique())],
        labels = train[TARGET].value_counts().index,
        autopct='%1.f%%',
    )

plt.tight_layout()
plt.show()

target

单个数值图
fig,axs = plt.subplots(len(raw_num_cols),1,figsize=(12,len(raw_num_cols)*2.5),sharex=False)
for i, col in enumerate(raw_num_cols):
    sns.violinplot(x=TARGET, y=col,hue="Gender", data=train,ax = axs[i], split=False)
    if col in full_form.keys():
        axs[i].set_ylabel(full_form[col])

plt.tight_layout()
plt.show()

plt2

个体分类图
_,axs = plt.subplots(int(len(raw_cat_cols)-1),2,figsize=(12,len(raw_cat_cols)*3),width_ratios=[1, 4])
for i,col in enumerate(raw_cat_cols[1:]):
    sns.countplot(y=col,data=train,palette="bright",ax=axs[i,0])
    sns.countplot(x=col,data=train,hue=TARGET,palette="bright",ax=axs[i,1])
    if col in full_form.keys():
        axs[i,0].set_ylabel(full_form[col])


plt.tight_layout()
plt.show()

plt3

数值相关图
tmp = train[raw_num_cols].corr("pearson")
sns.heatmap(tmp,annot=True,cmap ="crest")

heatmap

组合数字图
  • 身高与体重
sns.jointplot(data=train, x="Height", y="Weight", hue=TARGET,height=6)

height & weight

  • 年龄与身高
    age & height
主成分分析(PCA)和KMeans
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

#PCA
pca = PCA(n_components=2)
pca_top_2 = pca.fit_transform(train[raw_num_cols])

tmp = pd.DataFrame(data = pca_top_2, columns = ['pca_1','pca_2'])
tmp['TARGET'] = train[TARGET]

fig,axs = plt.subplots(2,1,figsize = (12,6))
sns.scatterplot(data=tmp, y="pca_1", x="pca_2", hue='TARGET',ax=axs[0])
axs[0].set_title("Top 2 Principal Components")

#KMeans
kmeans = KMeans(7,random_state=RANDOM_SEED)
kmeans.fit(tmp[['pca_1','pca_2']])
sns.scatterplot( y= tmp['pca_1'],x = tmp['pca_2'],c = kmeans.labels_,cmap='viridis', marker='o', edgecolor='k', s=50, alpha=0.8,ax = axs[1])
axs[1].set_title("Kmean Clustring on First 2 Principal Components")
plt.tight_layout()
plt.show()

pca kmeans

特征工程与处理

#在age_rounder、height_rounder函数中,我们将值相乘
#这有时会提高模型的CV分数
#在提取功能中,我们将功能组合以获得新功能

def age_rounder(x):
    x_copy = x.copy()
    x_copy['Age'] = (x_copy['Age']*100).astype(np.uint16)
    return x_copy

def height_rounder(x):
    x_copy = x.copy()
    x_copy['Height'] = (x_copy['Height']*100).astype(np.uint16)
    return x_copy

def extract_features(x):
    x_copy = x.copy()
    x_copy['BMI'] = (x_copy['Weight']/x_copy['Height']**2)
#     x_copy['PseudoTarget'] = pd.cut(x_copy['BMI'],bins = [0,18.4,24.9,29,34.9,39.9,100],labels = [0,1,2,3,4,5],)    
    return x_copy

def col_rounder(x):
    x_copy = x.copy()
    cols_to_round = ['FCVC',"NCP","CH2O","FAF","TUE"]
    for col in cols_to_round:
        x_copy[col] = round(x_copy[col])
        x_copy[col] = x_copy[col].astype('int')
    return x_copy

AgeRounder = FunctionTransformer(age_rounder)
HeightRounder = FunctionTransformer(height_rounder)
ExtractFeatures = FunctionTransformer(extract_features)
ColumnRounder = FunctionTransformer(col_rounder)
#使用FeatureDropper,我们可以删除列。这是
#如果我们想传递不同的功能集,这一点很重要
#适用于不同模型
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureDropper(BaseEstimator, TransformerMixin):
    def __init__(self, cols):
        self.cols = cols
    def fit(self,x,y):
        return self
    def transform(self, x):
        return x.drop(self.cols, axis = 1)
接下来,我们将定义“cross_val_model”,它将用于训练和验证我们将在本文中使用的所有模型

cross_val_model函数提供了三个内容:val_scoresvalid_prdictionstest_predictions

  • val_scores:这为我们提供了验证数据的准确性分数。
  • valid_products:这是一个数组,用于在验证集上存储模型预测
  • test_predictions:这提供了按我们使用的分割数平均的测试预测
# 使用交叉验证模型
# 结合 分层 K 折.

# 对目标分类进行编码
target_mapping = {
                  'Insufficient_Weight':0,
                  'Normal_Weight':1,
                  'Overweight_Level_I':2,
                  'Overweight_Level_II':3, 
                  'Obesity_Type_I':4,
                  'Obesity_Type_II':5 ,
                  'Obesity_Type_III':6
                  }

# 定义分层K折交叉验证方法
skf = StratifiedKFold(n_splits=n_splits)

def cross_val_model(estimators,cv = skf, verbose = True):
    '''
        estimators : pipeline consists preprocessing, encoder & model
        cv : Method for cross validation (default: StratifiedKfold)
        verbose : print train/valid score (yes/no)
    '''
    
    X = train.copy()
    y = X.pop(TARGET)

    y = y.map(target_mapping)
    test_predictions = np.zeros((len(test),7))
    valid_predictions = np.zeros((len(X),7))

    val_scores, train_scores = [],[]
    for fold, (train_ind, valid_ind) in enumerate(skf.split(X,y)):
        model = clone(estimators)
        #define train set
        X_train = X.iloc[train_ind]
        y_train = y.iloc[train_ind]
        #define valid set
        X_valid = X.iloc[valid_ind]
        y_valid = y.iloc[valid_ind]

        model.fit(X_train, y_train)
        if verbose:
            print("-" * 100)
            print(f"Fold: {fold}")
            print(f"Train Accuracy Score:-{accuracy_score(y_true=y_train,y_pred=model.predict(X_train))}")
            print(f"Valid Accuracy Score:-{accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid))}")
            print("-" * 100)

        
        test_predictions += model.predict_proba(test)/cv.get_n_splits()
        valid_predictions[valid_ind] = model.predict_proba(X_valid)
        val_scores.append(accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid)))
    if verbose: 
        print(f"Average Mean Accuracy Score:- {np.array(val_scores).mean()}")
    return val_scores, valid_predictions, test_predictions
# 合并原始和生成数据

train.drop(['id'],axis = 1, inplace = True)
test_ids = test['id']
test.drop(['id'],axis = 1, inplace=True)

train = pd.concat([train,train_org],axis = 0)
train = train.drop_duplicates()
train.reset_index(drop=True, inplace=True)

# 产生空的 dataframe 用于存储 得分,训练 ,测试预测 
score_list, oof_list, predict_list = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

建模

在这场比赛中,与其只关注一个模型,不如将许多高性能模型的预测结合起来。在本文中,我们将训练四种不同类型的模型,并将它们的预测结合起来,以获得最终的提交。

  • 随机森林模型
  • LGBM 模型
  • XGB 模型
  • Catboost 模型
随机森林模型
# Define Random Forest Model Pipeline

RFC = make_pipeline(
                        ExtractFeatures,
                        MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                       RandomForestClassifier(random_state=RANDOM_SEED)
                    )
# 执行随机森林模型
val_scores,val_predictions,test_predictions = cross_val_model(RFC)

# 保存相应的结果
for k,v in target_mapping.items():
    oof_list[f"rfc_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"rfc_{k}"] = test_predictions[:,v]
# 0.8975337326149792
# 0.9049682643904575
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9999027237354086
Valid Accuracy Score:-0.8954048140043763
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9999513618677043
Valid Accuracy Score:-0.9010940919037199
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9999513618677043
Valid Accuracy Score:-0.8940919037199124
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9999027237354086
Valid Accuracy Score:-0.8905908096280087
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9998540856031128
Valid Accuracy Score:-0.9102844638949672
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9999027284665143
Valid Accuracy Score:-0.8975481611208407
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.9998054569330286
Valid Accuracy Score:-0.8966725043782837
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9998540926997714
Valid Accuracy Score:-0.9080560420315237
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9998540926997714
Valid Accuracy Score:-0.9063047285464098
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9999513642332571
Valid Accuracy Score:-0.9610332749562172
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9061080794184259
LGBM 模型

由于LGBM模型的参数非常多,很多均是超参数,本文采用 Optuna 进行微调

# 定义 Optuna 函数 微调模型

def lgbm_objective(trial):
    params = {
        'learning_rate' : trial.suggest_float('learning_rate', .001, .1, log = True),
        'max_depth' : trial.suggest_int('max_depth', 2, 20),
        'subsample' : trial.suggest_float('subsample', .5, 1),
        'min_child_weight' : trial.suggest_float('min_child_weight', .1, 15, log = True),
        'reg_lambda' : trial.suggest_float('reg_lambda', .1, 20, log = True),
        'reg_alpha' : trial.suggest_float('reg_alpha', .1, 10, log = True),
        'n_estimators' : 1000,
        'random_state' : RANDOM_SEED,
        'device_type' : "gpu",
        'num_leaves': trial.suggest_int('num_leaves', 10, 1000),

        #'boosting_type' : 'dart',
    }
    
    optuna_model = make_pipeline(
                                 ExtractFeatures,
                                 MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                                LGBMClassifier(**params,verbose=-1)
                                )
    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

lgbm_study = optuna.create_study(direction = 'maximize',study_name="LGBM")

如果打开微调开关,将会执行很长时间,请谨慎操作。

# 微调开关 
TUNE = False

warnings.filterwarnings("ignore")
if TUNE:
    lgbm_study.optimize(lgbm_objective, 50)

将原数据分类数值和分类型,方便以下不同操作

numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = train.select_dtypes(include=['object']).columns.tolist()
categorical_columns.remove('NObeyesdad')

以下参数是我微调的结果

best_params = {
    "objective": "multiclass",          # Objective function for the model
    "metric": "multi_logloss",          # Evaluation metric
    "verbosity": -1,                    # Verbosity level (-1 for silent)
    "boosting_type": "gbdt",            # Gradient boosting type
    "random_state": 42,       # Random state for reproducibility
    "num_class": 7,                     # Number of classes in the dataset
    'learning_rate': 0.030962211546832760,  # Learning rate for gradient boosting
    'n_estimators': 500,                # Number of boosting iterations
    'lambda_l1': 0.009667446568254372,  # L1 regularization term
    'lambda_l2': 0.04018641437301800,   # L2 regularization term
    'max_depth': 10,                    # Maximum depth of the trees
    'colsample_bytree': 0.40977129346872643,  # Fraction of features to consider for each tree
    'subsample': 0.9535797422450176,    # Fraction of samples to consider for each boosting iteration
    'min_child_samples': 26             # Minimum number of data needed in a leaf
}

类似随机森林的方法进行操作

lgbm = make_pipeline(    
                        ColumnTransformer(
                        transformers=[('num', StandardScaler(), numerical_columns),
                                  ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                        LGBMClassifier(**best_params,verbose=-1)
                    )
# Train LGBM Model

val_scores,val_predictions,test_predictions = cross_val_model(lgbm)

for k,v in target_mapping.items():
    oof_list[f"lgbm_{k}"] = val_predictions[:,v]
    
for k,v in target_mapping.items():
    predict_list[f"lgbm_{k}"] = test_predictions[:,v]
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9771400778210116
Valid Accuracy Score:-0.9089715536105033
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9767509727626459
Valid Accuracy Score:-0.9076586433260394
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9776264591439688
Valid Accuracy Score:-0.9059080962800875
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9775291828793774
Valid Accuracy Score:-0.9089715536105033
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9770428015564202
Valid Accuracy Score:-0.9164113785557987
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9779679976654831
Valid Accuracy Score:-0.9076182136602452
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.9779193618987403
Valid Accuracy Score:-0.9058669001751314
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9779193618987403
Valid Accuracy Score:-0.9194395796847635
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.977676183065026
Valid Accuracy Score:-0.908493870402802
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9742230436262828
Valid Accuracy Score:-0.9527145359019265
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.91420543252078
XGB 模型

按LGBM方式一样对XGB模型进行操作

# Optuna 处理 xgb
def xgb_objective(trial):
    params = {
        'grow_policy': trial.suggest_categorical('grow_policy', ["depthwise", "lossguide"]),
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
        'gamma' : trial.suggest_float('gamma', 1e-9, 1.0),
        'subsample': trial.suggest_float('subsample', 0.25, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.25, 1.0),
        'max_depth': trial.suggest_int('max_depth', 0, 24),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),
    }

    params['booster'] = 'gbtree'
    params['objective'] = 'multi:softmax'
    params["device"] = "cuda"
    params["verbosity"] = 0
    params['tree_method'] = "gpu_hist"
    
    
    optuna_model = make_pipeline(
#                     ExtractFeatures,
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    XGBClassifier(**params,seed=RANDOM_SEED)
                   )
    
    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

xgb_study = optuna.create_study(direction = 'maximize')
# Optuna 微调开关
TUNE = False
if TUNE:
    xgb_study.optimize(xgb_objective, 50)
# XGB Pipeline

params = {
    'n_estimators': 1312,
    'learning_rate': 0.018279520260162645,
    'gamma': 0.0024196354156454324,
    'reg_alpha': 0.9025931173755949,
    'reg_lambda': 0.06835667255875388,
    'max_depth': 5,
    'min_child_weight': 5,
    'subsample': 0.883274050086088,
    'colsample_bytree': 0.6579828557036317
}
# {'eta': 0.018387615982905264, 'max_depth': 29, 'subsample': 0.8149303101087905, 'colsample_bytree': 0.26750463604831476, 'min_child_weight': 0.5292380065098192, 'reg_lambda': 0.18952063379457604, 'reg_alpha': 0.7201451827004944}

params = {'grow_policy': 'depthwise', 'n_estimators': 690, 
               'learning_rate': 0.31829021594473056, 'gamma': 0.6061120644431842, 
               'subsample': 0.9032243794829076, 'colsample_bytree': 0.44474031945048287,
               'max_depth': 10, 'min_child_weight': 22, 'reg_lambda': 4.42638097284094,
               'reg_alpha': 5.927900973354344e-07,'seed':RANDOM_SEED}

best_params = {'grow_policy': 'depthwise', 'n_estimators': 982, 
               'learning_rate': 0.050053726931263504, 'gamma': 0.5354391952653927, 
               'subsample': 0.7060590452456204, 'colsample_bytree': 0.37939433412123275, 
               'max_depth': 23, 'min_child_weight': 21, 'reg_lambda': 9.150224029846654e-08,
               'reg_alpha': 5.671063656994295e-08}
best_params['booster'] = 'gbtree'
best_params['objective'] = 'multi:softmax'
best_params["device"] = "cuda"
best_params["verbosity"] = 0
best_params['tree_method'] = "gpu_hist"
    
XGB = make_pipeline(
#                     ExtractFeatures,
#                     MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
#                                            'SMOKE','SCC','CALC','MTRANS']),
#                     FeatureDropper(['FAVC','FCVC']),
#                     ColumnRounder,
#                     ColumnTransformer(
#                     transformers=[('num', StandardScaler(), numerical_columns),
#                                   ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    XGBClassifier(**best_params,seed=RANDOM_SEED)
                   )
# 以上不同参数有不同结果
val_scores,val_predictions,test_predictions = cross_val_model(XGB)

for k,v in target_mapping .items():
    oof_list[f"xgb_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"xgb_{k}"] = test_predictions[:,v]
    
# 0.90634942296329
#0.9117093455898445 with rounder
#0.9163506382522121
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9452821011673151
Valid Accuracy Score:-0.9111597374179431
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.945136186770428
Valid Accuracy Score:-0.9063457330415755
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9449902723735408
Valid Accuracy Score:-0.9080962800875274
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.9454280155642023
Valid Accuracy Score:-0.9059080962800875
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9432392996108949
Valid Accuracy Score:-0.9199124726477024
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9460629346821653
Valid Accuracy Score:-0.9128721541155866
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.946160206215651
Valid Accuracy Score:-0.9106830122591943
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.9456252127814795
Valid Accuracy Score:-0.9168126094570929
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9446524974466223
Valid Accuracy Score:-0.9106830122591943
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9407130003404504
Valid Accuracy Score:-0.9610332749562172
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9163506382522121
Catboost 模型

用 Optuna 设参

# Optuna Function For Catboost Model
def cat_objective(trial):
    
    params = {
        
        'iterations': 1000,  # High number of estimators
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'depth': trial.suggest_int('depth', 3, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.01, 10.0),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'random_seed': RANDOM_SEED,
        'verbose': False,
        'task_type':"GPU"
    }
    
    cat_features = ['Gender','family_history_with_overweight','FAVC','FCVC','NCP',
                'CAEC','SMOKE','CH2O','SCC','FAF','TUE','CALC','MTRANS']
    optuna_model = make_pipeline(
                        ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),
                        CatBoostClassifier(**params,cat_features=cat_features)
                        )
    val_scores,_,_ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()
    
cat_study = optuna.create_study(direction = 'maximize')

参数结果如下:

params = {'learning_rate': 0.13762007048684638, 'depth': 5, 
          'l2_leaf_reg': 5.285199432056192, 'bagging_temperature': 0.6029582154263095,
         'random_seed': RANDOM_SEED,
        'verbose': False,
        'task_type':"GPU",
         'iterations':1000}


CB = make_pipeline(
#                         ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),
#                         CatBoostEncoder(cols = cat_features),
                        CatBoostClassifier(**params, cat_features=categorical_columns)
                        )

用上述参数训练模型

# Train Catboost Model
val_scores,val_predictions,test_predictions = cross_val_model(CB)
for k,v in target_mapping.items():
    oof_list[f"cat_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"cat_{k}"] = test_predictions[:,v]

# best 0.91179835368868 with extract features, n_splits = 10
# best 0.9121046227778054 without extract features, n_splits = 10
----------------------------------------------------------------------------------------------------
Fold: 0
Train Accuracy Score:-0.9478599221789883
Valid Accuracy Score:-0.9050328227571116
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 1
Train Accuracy Score:-0.9498540856031128
Valid Accuracy Score:-0.9054704595185996
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 2
Train Accuracy Score:-0.9500972762645914
Valid Accuracy Score:-0.9024070021881838
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 3
Train Accuracy Score:-0.949124513618677
Valid Accuracy Score:-0.9050328227571116
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 4
Train Accuracy Score:-0.9482976653696498
Valid Accuracy Score:-0.912472647702407
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 5
Train Accuracy Score:-0.9502456106220515
Valid Accuracy Score:-0.9089316987740805
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 6
Train Accuracy Score:-0.950780604056223
Valid Accuracy Score:-0.9045534150612959
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 7
Train Accuracy Score:-0.95073196828948
Valid Accuracy Score:-0.9098073555166375
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 8
Train Accuracy Score:-0.9513155974903944
Valid Accuracy Score:-0.9111208406304728
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
Fold: 9
Train Accuracy Score:-0.9446524974466223
Valid Accuracy Score:-0.957968476357268
----------------------------------------------------------------------------------------------------
Average Mean Accuracy Score:- 0.9122797541263168

模型融合和评估

由以上四个模型,采用不同的权重,进行 融合

# skf = StratifiedKFold(n_splits=5)
weights = {"rfc_":0,
           "lgbm_":3,
           "xgb_":1,
           "cat_":0}
tmp = oof_list.copy()
for k,v in target_mapping.items():
    tmp[f"{k}"] = (weights['rfc_']*tmp[f"rfc_{k}"] +
              weights['lgbm_']*tmp[f"lgbm_{k}"]+
              weights['xgb_']*tmp[f"xgb_{k}"]+
              weights['cat_']*tmp[f"cat_{k}"])    
tmp['pred'] = tmp[target_mapping.keys()].idxmax(axis = 1)
tmp['label'] = train[TARGET]
print(f"Ensemble Accuracy Scoe: {accuracy_score(train[TARGET],tmp['pred'])}")
    
cm = confusion_matrix(y_true = tmp['label'].map(target_mapping),
                      y_pred = tmp['pred'].map(target_mapping),
                     normalize='true')

cm = cm.round(2)
plt.figure(figsize=(8,8))
disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                              display_labels = target_mapping.keys())
disp.plot(xticks_rotation=50)
plt.tight_layout()
plt.show()

"""   BEST     """

# Best LB [0,1,0,0]
# Average Train Score:0.9142044335854003
# Average Valid Score:0.91420543252078

# Best CV [1,3, 1,1]
# Average Train Score:0.9168308163711971
# Average Valid Score:0.9168308163711971
# adding orignal data improves score

在这里插入图片描述

最终提交

for k,v in target_mapping.items():
    predict_list[f"{k}"] = (weights['rfc_']*predict_list[f"rfc_{k}"]+
                            weights['lgbm_']*predict_list[f"lgbm_{k}"]+
                            weights['xgb_']*predict_list[f"xgb_{k}"]+
                            weights['cat_']*predict_list[f"cat_{k}"])

final_pred = predict_list[target_mapping.keys()].idxmax(axis = 1)

sample_sub[TARGET] = final_pred
sample_sub.to_csv("submission.csv",index=False)
sample_sub
idNObeyesdad
020758Obesity_Type_II
120759Overweight_Level_I
220760Obesity_Type_III
320761Obesity_Type_I
420762Obesity_Type_III
1383534593Overweight_Level_II
1383634594Normal_Weight
1383734595Insufficient_Weight
1383834596Normal_Weight
1383934597Obesity_Type_II

13840 rows × 2 columns

结论

  1. 全文,从数据探索(EDA),可视化(VIS),特征工程(FE),交叉验证(CV),建模(MOD),模型评估(EV),到最终的提交(SUB),完整的记录整个过程,给机器学习的初学者提供了一个标准的模板;
  2. 本文的题目是解决多分类问题,评估上只用了一个 混淆矩阵(confusion matrix),在实际应用中还有多个工具可以使用,准确率(Accuracy)、精确度(Precision)、召回率(Recall)AUC得分( AUC_score) F1得分(F1_score);
  3. 文中使用模型融合采用了加权法,除此之外,还有stackingBlendingvoting(分为硬投票和软投票),这些内容,可以在我的早期文章找到相关的内容(原理);
  4. 当时提交的得分为0.92+(Public),最终提交的结果分数如下图,排名为34名 达到1%之前。
    rank
    在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1501173.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Postman(注册,使用,作用)【详解】

目录 一、Postman 1. Postman介绍 2. 安装Postman 3. 注册帐号再使用(可保存测试记录) 4. 创建workspace 5. 测试并保存测试记录 一、Postman postman工具可以发送不同方式的请求,浏览器只能发送get请求(所有用这个工具) 在前后端分离开发模式下,前端技术人员…

mybatis plus 查询数据库 字段名 自动添加下划线

问题 mybatis plus 查询数据库 字段名 自动添加下划线 详细问题 笔者使用mybatis plus 查询数据库,执行查询语句报错。详细报错信息 2024-03-08 11:08:33.156 ERROR 4816 --- [nio-9090-exec-9] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() …

数据开发 - 面经(已OC) - 北京中海通

投递流程: 2023.12.28 Boss 打招呼 2024.1.3 约面 2024.1.4 上午面试 (手机端腾讯会议) 2024.1.5 上午 通知面试通过 腾讯会议手机端无法和录影机同时运行,录音无效,之后注意使用电脑面试 面试流程:首…

SpringCloud-搭建RabbitMQ消息队列

本文介绍了在 Windows 环境下安装 RabbitMQ 及其依赖的 Erlang 语言的过程。通过提供下载链接和详细的安装步骤,使读者能够快速搭建 RabbitMQ 开发环境。同时,展示了常用的命令和验证方法,确保安装正确完成。这为搭建 RabbitMQ 服务奠定了基…

【深入理解设计模式】享元设计模式

享元设计模式 概述 享元设计模式(Flyweight Design Pattern)是一种用于性能优化的设计模式,它通过共享尽可能多的相似对象来减少对象的创建,从而降低内存使用和提高性能。享元模式的核心思想是将对象的共享部分提取出来&#xff…

推房子游戏c++

这段代码是一个推箱子游戏的实现。游戏中有一个地图,地图上有墙壁、人、箱子和目标位置。玩家通过键盘输入WASD或方向键来控制人物的移动,目标是将所有的箱子推到相应的目标位置上。 代码中的dt数组表示地图,每个位置上的字符表示对应的元素…

分享2024年在家轻松兼职赚钱的5个副业

今天在网上看到这么一句话,真的让我深有感触:“职场人一定要有居安思危的意识,创业的人一定要三思而后行”。在这个瞬息万变的时代,连被视为铁饭碗的公务员、教师等体制内工作都不能保证一辈子的稳定。发展副业,似乎成…

Deepl翻译相关介绍

DeepL是一种机器翻译软件,它在2017年首次发布。该软件利用了神经网络和深度学习技术,以提供更准确和自然的翻译结果。DeepL的翻译质量被广泛认为是当前机器翻译技术中最佳的之一。 官网:DeepL翻译:全世界最准确的翻译 DeepL具有许…

伊理威科技:新手开抖店的教程

在数字浪潮中,抖音小店如星火燎原,吸引无数创业者。你是否也心潮澎湃,想要一试身手?别急,让我们一步步揭开开店的神秘面纱。 注册流程。想象一下,你只需在抖音平台上点击“我要开店”,按提示填写相关信息&…

20240308-2-校招前端面试常见问题-网络及浏览器

校招前端面试常见问题【4】——网络及浏览器 1、网络相关 Q:请简述一下 HTTP 协议,以及 HTTP1.0/1.1/2.0/3.0 的区别? HTTP 协议:超文本传输协议,使用 TCP/IP 协议传输数据。是一个应用层的协议。 HTTP1.0&#xff…

二进制模二除法

例:1100100100 对 1011做模二除法 ① 第一位商 除数 1011 是一个四位二进制数,因此先拿出被除数的前四位(从高位开始取) 11001100 就是本次的被除数,取其首位数 1 作为第一位商然后对 1100 和 1011 做异或运算,得出结果 0111 ②…

HNU-算法设计与分析-甘晴void学习感悟

前言 算法设计与分析,仅就课程而言,似乎是数据结构与算法分析的延续 教材使用: 课程 关于课程,橙学长讲的非常清晰,我深以为然。 HNUCS-大三课程概览-CSDN博客文章浏览阅读1.3k次,点赞5次,收…

【视频图像取证篇】Impress模糊图像增强技术之颜色滤波器场景实例教程(蘇小沐)

【视频图像取证篇】Impress模糊图像增强技术之颜色滤波器场景实例教程(蘇小沐) Impress模糊图像增强技术之颜色滤波器场景实例教程—【蘇小沐】 1、实验环境 系统环境Impress,[v8.2.02]Windows 11 专业版,[23H2(226…

B端系统优化,可不是换个颜色和图标,看看与大厂系统的差距。

Hi,我是贝格前端工场,优化升级各类管理系统的界面和体验,是我们核心业务之一,欢迎老铁们评论点赞互动,有需求可以私信我们 一、不要被流于表面的需求描述迷惑。 很多人找我们优化系统界面,对需求总是轻描淡…

win11修改主机mac地址

很多时候,为了限制恶意的蹭流浪,除了分配固定的ip地址外,还限制mac地址。只有mac与ip一致,才能上网冲浪 网络适配器中修改 搜索“控制面板”打开 控制面板 > 网络和Internet > 网络和共享中心 >查看网络状态和任务>…

代码随想录 回溯算法-排序

目录 46.全排序 47.全排列|| 332.重新安排行程 46.全排序 46. 全排列 中等 给定一个不含重复数字的数组 nums ,返回其 所有可能的全排列 。你可以 按任意顺序 返回答案。 示例 1: 输入:nums [1,2,3] 输出:[[1,2,3],[1,…

【NR技术】 3GPP支持无人机的关键技术以及场景

1 背景 人们对使用蜂窝连接来支持无人机系统(UAS)的兴趣浓厚,3GPP生态系统为UAS的运行提供了极好的好处。无处不在的覆盖范围、高可靠性和QoS、强大的安全性和无缝移动性是支持UAS指挥和控制功能的关键因素。与此同时,监管机构正在调查安全和性能标准以及…

讲讲 SaaS 平台的多租户设计

本篇就来讲讲 SaaS 平台的多租户设计。 以“钉钉”为例看实际的多租户场景 在讲设计之前,我们先以“钉钉”为例,来看看一个 SaaS 平台是如何运作的。相信大部分B 端产品经理都体验过钉钉,我们分两个维度来讲钉钉的租户注册到使用的流程。一…

mysql数据库(下)

目录 约束 约束的概念和分类 1、约束的概念: 2、约束的分类 1、主键约束 2、默认约束 3、非空约束 4、唯一约束 5、外键约束 约束 约束的概念和分类 1、约束的概念: 约束时作用于表中列上的规则,用于限制加入表的数据约束的存在保证…

代码随想录算法训练营第day40|343. 整数拆分 、 96.不同的二叉搜索树

a.343. 整数拆分 题目链接 给定一个正整数 n ,将其拆分为 k 个 正整数 的和( k > 2 ),并使这些整数的乘积最大化。 返回 你可以获得的最大乘积 。 示例 1: 输入: n 2 输出: 1 解释: 2 1 1, 1 1 1。 示例 2: 输入: …