Kaggle(3):Predict CO2 Emissions in Rwanda

news2024/9/25 19:17:29

Kaggle(3):Predict CO2 Emissions in Rwanda

在这里插入图片描述

1. Introduction

在本次竞赛中,我们的任务是预测非洲 497 个不同地点 2022 年的二氧化碳排放量。 在训练数据中,我们有 2019-2021 年的二氧化碳排放量

本笔记本的内容:

1.通过平滑消除2020年一次性的新冠疫情趋势。 或者,用 2019 年和 2021 年的平均值来估算 2020 年也是一种有效的方法,但此处未实施
2. 观察靠近最大排放位置的位置也具有较高的排放水平。 执行 K-Means 聚类以根据数据点的位置对数据点进行聚类。 这允许具有相似排放的数据点被分组在一起
3. 以 2019 年和 2020 年为训练数据,用一些集成模型进行实验,以测试其在 2021 年数据上的 CV

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
from tqdm import tqdm
from sklearn.preprocessing import SplineTransformer
from holidays import CountryHoliday
from tqdm.notebook import tqdm
from typing import List



from category_encoders import OneHotEncoder, MEstimateEncoder, GLMMEncoder, OrdinalEncoder
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold, KFold, RepeatedKFold, TimeSeriesSplit, train_test_split, cross_val_score
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import HistGradientBoostingRegressor, VotingRegressor, StackingRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor, LogisticRegression
from sklearn.linear_model import PassiveAggressiveRegressor, ARDRegression
from sklearn.linear_model import TheilSenRegressor, RANSACRegressor, HuberRegressor
from sklearn.cross_decomposition import PLSRegression
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, roc_auc_score, roc_curve
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MinMaxScaler, LabelEncoder, SplineTransformer
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import squareform
from sklearn.feature_selection import RFECV
from sklearn.decomposition import PCA
from xgboost import XGBRegressor, XGBClassifier
import lightgbm as lgbm
from lightgbm import LGBMRegressor, LGBMClassifier
from lightgbm import log_evaluation, early_stopping, record_evaluation
from catboost import CatBoostRegressor, CatBoostClassifier, Pool
from sklearn import set_config
from sklearn.multioutput import MultiOutputClassifier
from datetime import datetime, timedelta
import gc

import warnings
warnings.filterwarnings('ignore')

set_config(transform_output = 'pandas')

pal = sns.color_palette('viridis')

pd.set_option('display.max_rows', 100)
M = 1.07

2. Examine Data

2.1

在这里,我们试图平滑 2020 年的数据以消除新冠趋势

1.使用平滑导入的数据集
2. 使用 2019 年和 2021 年值的平均值 [https://www.kaggle.com/code/kacperrabczewski/rwanda-co2-step-by-step-guide]

extrp = pd.read_csv("./data/PS3E20_train_covid_updated")
extrp = extrp[(extrp["year"] == 2020)]
extrp
ID_LAT_LON_YEAR_WEEKlatitudelongitudeyearweek_noSulphurDioxide_SO2_column_number_densitySulphurDioxide_SO2_column_number_density_amfSulphurDioxide_SO2_slant_column_number_densitySulphurDioxide_cloud_fractionSulphurDioxide_sensor_azimuth_angle...Cloud_cloud_top_heightCloud_cloud_base_pressureCloud_cloud_base_heightCloud_cloud_optical_depthCloud_surface_albedoCloud_sensor_azimuth_angleCloud_sensor_zenith_angleCloud_solar_azimuth_angleCloud_solar_zenith_angleemission
53ID_-0.510_29.290_2020_00-0.51029.290202000.0000640.9702900.0000730.163462-100.602665...5388.60205460747.0635304638.6021766.2877090.283116-13.29137533.679610-140.30917330.0534473.753601
54ID_-0.510_29.290_2020_01-0.51029.29020201NaNNaNNaNNaNNaN...6361.48875453750.1741625361.48875419.1672690.317732-30.47497248.119754-139.43777730.3919574.051966
55ID_-0.510_29.290_2020_02-0.51029.29020202-0.0003610.668526-0.0002310.08619973.269733...5320.71590261012.6250004320.71586148.2037330.265554-12.46115035.809728-137.85444929.1004154.154116
56ID_-0.510_29.290_2020_03-0.51029.290202030.0005970.5537290.0003310.14925773.522247...6219.31929455704.7829985219.31926912.8093500.26703016.38107935.836898-139.01775426.2655614.165751
57ID_-0.510_29.290_2020_04-0.51029.290202040.0001071.0452380.0001120.22428377.588455...6348.56000654829.3317765348.56001435.2839810.268983-12.19365047.092968-134.47427927.0613214.233635
..................................................................
78965ID_-3.299_30.301_2020_48-3.29930.3012020480.0001141.1239350.0001250.14988574.376836...6092.32372257479.3977765169.18514215.3312960.26160816.30962539.924967-132.25870030.39360426.929207
78966ID_-3.299_30.301_2020_49-3.29930.3012020490.0000510.6179270.0000310.21313572.364738...5992.05300657739.3001554992.05300627.2140850.276616-0.28765645.624810-134.46041830.91174126.606790
78967ID_-3.299_30.301_2020_50-3.29930.301202050-0.0002350.633192-0.0001490.257000-99.141518...6104.23124156954.5172315181.57021326.2703650.260574-50.41124137.645974-132.19316132.51668527.256273
78968ID_-3.299_30.301_2020_51-3.29930.301202051NaNNaNNaNNaNNaN...4855.53758564839.9557183858.18745314.5197890.24848430.84092239.529722-138.96401628.57409125.591976
78969ID_-3.299_30.301_2020_52-3.29930.3012020520.0000251.1030250.0000280.265622-99.811790...5345.67946462098.7165464345.67939713.0821620.283677-13.00295738.243055-136.66095829.58405825.559870

26341 rows × 76 columns

DATA_DIR = "./data/"
train = pd.read_csv(DATA_DIR + "train.csv")
test = pd.read_csv(DATA_DIR + "test.csv")

def add_features(df):
    #df["week"] = df["year"].astype(str) + "-" + df["week_no"].astype(str)
    #df["date"] = df["week"].apply(lambda x: get_date_from_week_string(x))
    #df = df.drop(columns = ["week"])
    df["week"] = (df["year"] - 2019) * 53 + df["week_no"]
    #df["lat_long"] = df["latitude"].astype(str) + "#" + df["longitude"].astype(str)
    return df

train = add_features(train)
test = add_features(test)

2.2

对预测进行一些有风险的后处理。

假设数据点的 MAX = max(2019 年排放量、2020 年排放量、2021 年排放量)。

如果 2021 年排放量 > 2019 年排放量,我们将 MAX * 1.07 分配给预测,否则我们只分配 MAX。 参考:https://www.kaggle.com/competitions/playground-series-s3e20/discussion/430152

vals = set()
for x in train[["latitude", "longitude"]].values:
    vals.add(tuple(x))
    
vals = list(vals)
zeros = []

for lat, long in vals:
    subset = train[(train["latitude"] == lat) & (train["longitude"] == long)]
    em_vals = subset["emission"].values
    if all(x == 0 for x in em_vals):
        zeros.append([lat, long])
test["2021_emission"] = test["week_no"]
test["2020_emission"] = test["week_no"]
test["2019_emission"] = test["week_no"]

for lat, long in vals:
    test.loc[(test.latitude == lat) & (test.longitude == long), "2021_emission"] = train.loc[(train.latitude == lat) & (train.longitude == long) & (train.year == 2021) & (train.week_no <= 48), "emission"].values
    test.loc[(test.latitude == lat) & (test.longitude == long), "2020_emission"] = train.loc[(train.latitude == lat) & (train.longitude == long) & (train.year == 2020) & (train.week_no <= 48), "emission"].values
    test.loc[(test.latitude == lat) & (test.longitude == long), "2019_emission"] = train.loc[(train.latitude == lat) & (train.longitude == long) & (train.year == 2019) & (train.week_no <= 48), "emission"].values
    #print(train.loc[(train.latitude == lat) & (train.longitude == long) & (train.year == 2021), "emission"])
    
test["ratio"] = (test["2021_emission"] / test["2019_emission"]).replace(np.nan, 0)
test["pos_ratio"] = test["ratio"].apply(lambda x: max(x, 1))
test["pos_ratio"] = test["pos_ratio"].apply(lambda x: 1.07 if x > 1 else x)
test["max"] = test[["2019_emission", "2020_emission", "2021_emission"]].max(axis=1)
test["lazy_pred"] = test["max"] * test["pos_ratio"]
test = test.drop(columns = ["ratio", "pos_ratio", "max", "2019_emission", "2020_emission", "2021_emission"])
train.loc[train.year == 2020, "emission"] = extrp
train
ID_LAT_LON_YEAR_WEEKlatitudelongitudeyearweek_noSulphurDioxide_SO2_column_number_densitySulphurDioxide_SO2_column_number_density_amfSulphurDioxide_SO2_slant_column_number_densitySulphurDioxide_cloud_fractionSulphurDioxide_sensor_azimuth_angle...Cloud_cloud_base_pressureCloud_cloud_base_heightCloud_cloud_optical_depthCloud_surface_albedoCloud_sensor_azimuth_angleCloud_sensor_zenith_angleCloud_solar_azimuth_angleCloud_solar_zenith_angleemissionweek
0ID_-0.510_29.290_2019_00-0.51029.29020190-0.0001080.603019-0.0000650.255668-98.593887...61085.8095702615.12048315.5685330.272292-12.62898635.632416-138.78642330.7521403.7509940
1ID_-0.510_29.290_2019_01-0.51029.290201910.0000210.7282140.0000140.13098816.592861...66969.4787353174.5724248.6906010.25683030.35937539.557633-145.18393027.2517794.0251761
2ID_-0.510_29.290_2019_02-0.51029.290201920.0005140.7481990.0003850.11001872.795837...60068.8944483516.28266921.1034100.25110115.37788330.401823-142.51954526.1932964.2313812
3ID_-0.510_29.290_2019_03-0.51029.29020193NaNNaNNaNNaNNaN...51064.5473394180.97332215.3868990.262043-11.29339924.380357-132.66582828.8291554.3052863
4ID_-0.510_29.290_2019_04-0.51029.29020194-0.0000790.676296-0.0000480.1211644.121269...63751.1257813355.7101078.1146940.23584738.53226337.392979-141.50980522.2046124.3473174
..................................................................
79018ID_-3.299_30.301_2021_48-3.29930.3012021480.0002841.1956430.0003400.19131372.820518...60657.1019134590.87950420.2459540.304797-35.14036840.113533-129.93550832.09521429.404171154
79019ID_-3.299_30.301_2021_49-3.29930.3012021490.0000831.1308680.0000630.177222-12.856753...60168.1915284659.1303786.1046100.3140154.66705847.528435-134.25287130.77146929.186497155
79020ID_-3.299_30.301_2021_50-3.29930.301202150NaNNaNNaNNaNNaN...56596.0272095222.64682314.8178850.288058-0.34092235.328098-134.73172330.71616629.131205156
79021ID_-3.299_30.301_2021_51-3.29930.301202151-0.0000340.879397-0.0000280.184209-100.344827...46533.3481946946.85802232.5947680.2740478.42769948.295652-139.44784929.11286828.125792157
79022ID_-3.299_30.301_2021_52-3.29930.301202152-0.0000910.871951-0.0000790.00000076.825638...47771.6818876553.29501819.4640320.226276-12.80852847.923441-136.29998430.24638727.239302158

79023 rows × 77 columns

test
ID_LAT_LON_YEAR_WEEKlatitudelongitudeyearweek_noSulphurDioxide_SO2_column_number_densitySulphurDioxide_SO2_column_number_density_amfSulphurDioxide_SO2_slant_column_number_densitySulphurDioxide_cloud_fractionSulphurDioxide_sensor_azimuth_angle...Cloud_cloud_base_pressureCloud_cloud_base_heightCloud_cloud_optical_depthCloud_surface_albedoCloud_sensor_azimuth_angleCloud_sensor_zenith_angleCloud_solar_azimuth_angleCloud_solar_zenith_angleweeklazy_pred
0ID_-0.510_29.290_2022_00-0.51029.29020220NaNNaNNaNNaNNaN...41047.9375007472.3134777.9356170.240773-100.11379233.697044-133.04754633.7795831593.753601
1ID_-0.510_29.290_2022_01-0.51029.290202210.0004560.6911640.0003160.00000076.239196...54915.7085795476.14716111.4484370.293119-30.51031942.402593-138.63282231.0123801604.051966
2ID_-0.510_29.290_2022_02-0.51029.290202220.0001610.6051070.0001060.079870-42.055341...39006.0937507984.79570310.7531790.26713039.08736145.936480-144.78498826.7433611614.231381
3ID_-0.510_29.290_2022_03-0.51029.290202230.0003500.6969170.0002430.20102872.169566...57646.3683685014.72411511.7645560.304679-24.46512742.140419-135.02789129.6047741624.305286
4ID_-0.510_29.290_2022_04-0.51029.29020224-0.0003170.580527-0.0001840.20435276.190865...52896.5418735849.28039413.0653170.284221-12.90785030.122641-135.50011926.2768071634.347317
..................................................................
24348ID_-3.299_30.301_2022_44-3.29930.301202244-0.0006180.745549-0.0004610.23449272.306198...55483.4599805260.12005630.3985080.180046-25.52858845.284576-116.52141229.99256220330.327420
24349ID_-3.299_30.301_2022_45-3.29930.301202245NaNNaNNaNNaNNaN...53589.9173835678.95152119.2238440.177833-13.38000543.770351-122.40575929.01797520430.811167
24350ID_-3.299_30.301_2022_46-3.29930.301202246NaNNaNNaNNaNNaN...62646.7613404336.28249113.8011940.219471-5.07206533.226455-124.53063930.18747220531.162886
24351ID_-3.299_30.301_2022_47-3.29930.3012022470.0000711.0038050.0000770.20507774.327427...50728.3139916188.57846427.8874890.247275-0.66871445.885617-129.00679730.42745520631.439606
24352ID_-3.299_30.301_2022_48-3.29930.301202248NaNNaNNaNNaNNaN...46260.0390926777.86381923.7712690.239684-40.82613930.680056-124.89547334.45772020729.944366

24353 rows × 77 columns

Insights

训练数据集有 79023 个观测值,测试数据集有 24353 个观测值。 正如我们所观察到的,某些列具有空值

3. EDA and Data Distribution

def plot_emission(train):
    
    plt.figure(figsize=(15, 6))
    sns.lineplot(data=train, x="week", y="emission", label="Emission", alpha=0.7, color='blue')

    plt.xlabel('Week')
    plt.ylabel('Emission')
    plt.title('Emission over time')

    plt.legend()
    plt.tight_layout()
    plt.show()
    
plot_emission(train)

在这里插入图片描述

sns.histplot(train["emission"])

在这里插入图片描述

4. Data Transformation

print(len(vals))
497

Insights

有 497 个独特的经纬度组合

4.1

大多数特征只是噪音,我们可以将它们删除。(Reference: multiple discussion posts)

#train = train.drop(columns = ["ID_LAT_LON_YEAR_WEEK", "lat_long"])
#test = test.drop(columns = ["ID_LAT_LON_YEAR_WEEK", "lat_long"])

train = train[["latitude", "longitude", "year", "week_no", "emission"]]
test = test[["latitude", "longitude", "year", "week_no", "lazy_pred"]]

4.2

K Means 聚类 + 到最高排放量的距离

#https://www.kaggle.com/code/lucasboesen/simple-catboost-6-features-cv-21-7
from sklearn.cluster import KMeans
import haversine as hs

km_train = train.groupby(by=['latitude', 'longitude'], as_index=False)['emission'].mean()
model = KMeans(n_clusters = 7, random_state = 42)
model.fit(km_train)
yhat_train = model.predict(km_train)
km_train['kmeans_group'] = yhat_train

""" Own Groups """
# Some locations have emission == 0
km_train['is_zero'] = km_train['emission'].apply(lambda x: 'no_emission_recorded' if x==0 else 'emission_recorded')

# Distance to the highest emission location
max_lat_lon_emission = km_train.loc[km_train['emission']==km_train['emission'].max(), ['latitude', 'longitude']]
km_train['distance_to_max_emission'] = km_train.apply(lambda x: hs.haversine((x['latitude'], x['longitude']), (max_lat_lon_emission['latitude'].values[0], max_lat_lon_emission['longitude'].values[0])), axis=1)

train = train.merge(km_train[['latitude', 'longitude', 'kmeans_group', 'distance_to_max_emission']], on=['latitude', 'longitude'])
test = test.merge(km_train[['latitude', 'longitude', 'kmeans_group', 'distance_to_max_emission']], on=['latitude', 'longitude'])
#train = train.drop(columns = ["latitude", "longitude"])
#test = test.drop(columns = ["latitude", "longitude"])
train
latitudelongitudeyearweek_noemissionkmeans_groupdistance_to_max_emission
0-0.51029.290201903.7509946207.849890
1-0.51029.290201914.0251766207.849890
2-0.51029.290201924.2313816207.849890
3-0.51029.290201934.3052866207.849890
4-0.51029.290201944.3473176207.849890
........................
79018-3.29930.30120214829.4041716157.630611
79019-3.29930.30120214929.1864976157.630611
79020-3.29930.30120215029.1312056157.630611
79021-3.29930.30120215128.1257926157.630611
79022-3.29930.30120215227.2393026157.630611

79023 rows × 7 columns

test
latitudelongitudeyearweek_nolazy_predkmeans_groupdistance_to_max_emission
0-0.51029.290202203.7536016207.849890
1-0.51029.290202214.0519666207.849890
2-0.51029.290202224.2313816207.849890
3-0.51029.290202234.3052866207.849890
4-0.51029.290202244.3473176207.849890
........................
24348-3.29930.30120224430.3274206157.630611
24349-3.29930.30120224530.8111676157.630611
24350-3.29930.30120224631.1628866157.630611
24351-3.29930.30120224731.4396066157.630611
24352-3.29930.30120224829.9443666157.630611

24353 rows × 7 columns

cat_params = {
    
    'n_estimators': 799, 
    'learning_rate': 0.09180872710592884,
    'depth': 8, 
    'l2_leaf_reg': 1.0242996861886846, 
    'subsample': 0.38227256755249117, 
    'colsample_bylevel': 0.7183481537623551,
    'random_state': 42,
    "silent": True,
}

lgb_params = {
    
    'n_estimators': 835, 
    'max_depth': 12, 
    'reg_alpha': 3.849279869880706, 
    'reg_lambda': 0.6840221712299135, 
    'min_child_samples': 10, 
    'subsample': 0.6810493885301987, 
    'learning_rate': 0.0916362259866008, 
    'colsample_bytree': 0.3133780298325982, 
    'colsample_bynode': 0.7966712089198238,
    "random_state": 42,
}

xgb_params = {
    
    "random_state": 42,
}

rf_params = {
    
    'n_estimators': 263, 
    'max_depth': 41, 
    'min_samples_split': 10, 
    'min_samples_leaf': 3,
    "random_state": 42,
    "verbose": 0
}

et_params = {
    
    "random_state": 42,
    "verbose": 0
}

5. Validate Performance on 2021 data

def rmse(a, b):
    return mean_squared_error(a, b, squared=False)
validation = train[train.year == 2021]
clusters = train["kmeans_group"].unique()

for i in range(len(clusters)):
               
    cluster = clusters[i]
    
    print("==============================================")
    print(f" Cluster {cluster} ")
    
    
    train_c = train[train["kmeans_group"] == cluster]
    
    X_train = train_c[train_c.year < 2021].drop(columns = ["emission", "kmeans_group"])
    y_train = train_c[train_c.year < 2021]["emission"].copy()
    X_val = train_c[train_c.year >= 2021].drop(columns = ["emission", "kmeans_group"])
    y_val = train_c[train_c.year >= 2021]["emission"].copy()
    
    
    
    #=======================================================================================
    catboost_reg = CatBoostRegressor(**cat_params)
    catboost_reg.fit(X_train, y_train, eval_set=(X_val, y_val))

    catboost_pred = catboost_reg.predict(X_val) * M
    print(f"RMSE of CatBoost: {rmse(catboost_pred, y_val)}")

    #=======================================================================================
    lightgbm_reg = LGBMRegressor(**lgb_params,verbose=-1)
    lightgbm_reg.fit(X_train, y_train, eval_set=(X_val, y_val))

    lightgbm_pred = lightgbm_reg.predict(X_val) * M
    print(f"RMSE of LightGBM: {rmse(lightgbm_pred, y_val)}")

    #=======================================================================================
    xgb_reg = XGBRegressor(**xgb_params)
    xgb_reg.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose = False)

    xgb_pred = xgb_reg.predict(X_val) * M
    print(f"RMSE of XGBoost: {rmse(xgb_pred, y_val)}")

    #=======================================================================================
    rf_reg = RandomForestRegressor(**rf_params)
    rf_reg.fit(X_train, y_train)

    rf_pred = rf_reg.predict(X_val) * M
    print(f"RMSE of Random Forest: {rmse(rf_pred, y_val)}")

    #=======================================================================================
    et_reg = ExtraTreesRegressor(**et_params)
    et_reg.fit(X_train, y_train)

    et_pred = et_reg.predict(X_val) * M
    print(f"RMSE of Extra Trees: {rmse(et_pred, y_val)}")
    
    
    overall_pred = lightgbm_pred #(catboost_pred + lightgbm_pred) / 2
    validation.loc[validation["kmeans_group"] == cluster, "emission"] = overall_pred
    
    print(f"RMSE Overall: {rmse(overall_pred, y_val)}")

print("==============================================")
print(f"[DONE] RMSE of all clusters: {rmse(validation['emission'], train[train.year == 2021]['emission'])}")
print(f"[DONE] RMSE of all clusters Week 1-20: {rmse(validation[validation.week_no < 21]['emission'], train[(train.year == 2021) & (train.week_no < 21)]['emission'])}")
print(f"[DONE] RMSE of all clusters Week 21+: {rmse(validation[validation.week_no >= 21]['emission'], train[(train.year == 2021) & (train.week_no  >= 21)]['emission'])}")
==============================================
 Cluster 6 
RMSE of CatBoost: 2.3575606902299895
RMSE of LightGBM: 2.2103640167714094
RMSE of XGBoost: 2.5018849673349863
RMSE of Random Forest: 2.6335510523545556
RMSE of Extra Trees: 3.0029623116826776
RMSE Overall: 2.2103640167714094
==============================================
 Cluster 5 
RMSE of CatBoost: 19.175306730779514
RMSE of LightGBM: 17.910821889134688
RMSE of XGBoost: 19.6677120674706
RMSE of Random Forest: 18.856743714624777
RMSE of Extra Trees: 20.70417439300032
RMSE Overall: 17.910821889134688
==============================================
 Cluster 1 
RMSE of CatBoost: 9.26195004601851
RMSE of LightGBM: 8.513309514506675
RMSE of XGBoost: 10.137965612920658
RMSE of Random Forest: 9.838001199034126
RMSE of Extra Trees: 11.043246766709913
RMSE Overall: 8.513309514506675
==============================================
 Cluster 4 
RMSE of CatBoost: 44.564695183442716
RMSE of LightGBM: 43.946690922308754
RMSE of XGBoost: 50.18811358270916
RMSE of Random Forest: 46.39201148051631
RMSE of Extra Trees: 50.58999576441371
RMSE Overall: 43.946690922308754
==============================================
 Cluster 0 
RMSE of CatBoost: 28.408461784012662
RMSE of LightGBM: 26.872533954605416
RMSE of XGBoost: 30.622689084145943
RMSE of Random Forest: 28.46657485784377
RMSE of Extra Trees: 31.733046766544884
RMSE Overall: 26.872533954605416
==============================================
 Cluster 3 
RMSE of CatBoost: 263.29528869714665
RMSE of LightGBM: 326.12883397111284
RMSE of XGBoost: 336.5771065570381
RMSE of Random Forest: 303.9321016178147
RMSE of Extra Trees: 336.67756932119914
RMSE Overall: 326.12883397111284
==============================================
 Cluster 2 
RMSE of CatBoost: 206.96165808156715
RMSE of LightGBM: 222.40891682146665
RMSE of XGBoost: 281.12604107718465
RMSE of Random Forest: 232.11332438348992
RMSE of Extra Trees: 281.29392713471816
RMSE Overall: 222.40891682146665
==============================================
[DONE] RMSE of all clusters: 23.275548123498453
[DONE] RMSE of all clusters Week 1-20: 31.92891146501802
[DONE] RMSE of all clusters Week 21+: 15.108200701163458

6. Predicting 2022 result

clusters = train["kmeans_group"].unique()

for i in tqdm(range(len(clusters))):
    
    cluster = clusters[i]
    
    train_c = train[train["kmeans_group"] == cluster]
    if "emission" in test.columns:
        test_c = test[test["kmeans_group"] == cluster].drop(columns = ["emission", "kmeans_group", "lazy_pred"])
    else:
        test_c = test[test["kmeans_group"] == cluster].drop(columns = ["kmeans_group", "lazy_pred"])
    
    X = train_c.drop(columns = ["emission", "kmeans_group"])
    y = train_c["emission"].copy()
    #=======================================================================================
    catboost_reg = CatBoostRegressor(**cat_params)
    catboost_reg.fit(X, y)
    #print(test_c)

    catboost_pred = catboost_reg.predict(test_c)

    #=======================================================================================
    lightgbm_reg = LGBMRegressor(**lgb_params,verbose=-1)
    lightgbm_reg.fit(X, y)
    #print(test_c)

    lightgbm_pred = lightgbm_reg.predict(test_c)

    #=======================================================================================
    #xgb_reg = XGBRegressor(**xgb_params)
    #xgb_reg.fit(X, y, verbose = False)

    #xgb_pred = xgb_reg.predict(test)

    #=======================================================================================
    rf_reg = RandomForestRegressor(**rf_params)
    rf_reg.fit(X, y)

    rf_pred = rf_reg.predict(test_c)

    #=======================================================================================
    #et_reg = ExtraTreesRegressor(**et_params)
    #et_reg.fit(X, y)

    #et_pred = et_reg.predict(test)

    overall_pred = lightgbm_pred #(catboost_pred + lightgbm_pred) / 2
    test.loc[test["kmeans_group"] == cluster, "emission"] = overall_pred
  0%|          | 0/7 [00:00<?, ?it/s]
test["emission"] = test["emission"] * 1.07
test.to_csv('submission.csv', index=False)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/930149.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

查看IIS管理器网站的日志和简要分析

一、查看IIS管理器网站的日志 1.1、查看网站的日志 1.2、IIS的4种日志文件格式 《1》 IIS日志文件格式 使用 Microsoft IIS 日志文件格式来记录有关站点的信息。这种格式由 HTTP.sys 进行处理,并且是固定的基于 ASCII 文本的格式,这意味着无法自定义记录的字段。字段由逗号…

【算法日志】动态规划刷题:完全背包应用问题(day39)

代码随想录刷题60Day 目录 前言 零钱兑换 完全平方数 前言 今天重点是对完全背包问题进一步了解&#xff0c;难度不大&#xff0c;重点是区分与其他背包问题在初始和遍历上的一些细节。 零钱兑换 int coinChange(vector<int>& coins, int amount) {if (!amount)re…

利用敏捷开发工具实现敏捷项目管理的实践经验分享

Scrum中非常强调公开、透明、直接有效的沟通&#xff0c;这也是“可视化的管理工具”在敏捷开发中如此重要的原因之一。通过“可视化的管理工具”让所有人直观的看到需求&#xff0c;故事&#xff0c;任务之间的流转状态&#xff0c;可以使团队成员更加快速适应敏捷开发流程。 …

Android scrollTo、scrollBy、以及scroller详解 自定义ViewPager

Scroller VelocityTracker VelocityTracker 是一个速度跟踪器&#xff0c;通过用户操作时&#xff08;通常在 View 的 onTouchEvent 方法中&#xff09;传进去一系列的 Event&#xff0c;该类就可以计算出用户手指滑动的速度&#xff0c;开发者可以方便地获取这些参数去做其他…

研磨设计模式day10中介者模式

目录 场景 思考 解决思路 模式讲解 调用示意图 中介者模式的优缺点 中介者模式的本质 何时选用 场景 如果没有主板&#xff0c;电脑各个配件怎么交互呢&#xff1f; 有些配件接口不同&#xff0c;必须把数据接口进行转换才能匹配上,无敌复杂。 有了主板之后就是下面这…

人工智能技术的主要类别

人工智能技术主要类别&#xff1a; 机器学习&#xff1a; 监督学习&#xff1a;使用带有标签的训练数据来训练模型&#xff0c;使其能够预测未知数据的标签。常见任务包括分类和回归。无监督学习&#xff1a;使用无标签的训练数据&#xff0c;模型通过发现数据中的模式、聚类或…

在idea上使用git的reset操作后,出现的四个选项Soft、Mixed、Hard、Keep选择说明

出现场景 选择已经commit的版本,点击Reset Current Branch to Here 然后便会出现下述四个选项 下面便对这个四个选项进行总结说明 原理 git revert是用于“反做”某一个版本,以达到撤销该版本的修改的目的 Soft Soft选项:在选择的回退点之后的所有更改将会保留并被gi…

预测下一波物联网网络安全挑战

本文讨论从孤立的物联网设置过渡到互连环境的复杂性&#xff0c;不断扩大的攻击面以及这种演变带来的微妙复杂性。 深入探讨标准化的紧迫性、级联故障的威胁以及利益相关者之间模糊的责任界限。 鉴于从孤立的物联网设备到互连的物联网环境的转变&#xff0c;这给网络安全带来…

研磨设计模式day12迭代器模式

目录 场景 解决方案 解决思路 代码示例 代码改造 Java实现迭代器 迭代器模式的优点 思考 何时选用 场景 大公司收购了一个小公司&#xff0c;大公司的工资系统采用List来记录工资列表&#xff0c;而小公司是采用数组&#xff0c;老板希望通过决策辅助系统来统一查看…

Axure RP

Axure RP 简介下载安装汉化注册 简介 Axure RP&#xff08;Rapid Prototyping&#xff09;是一款交互式原型设计工具&#xff0c;用于创建高保真的交互式界面原型和线框图。它主要用于用户体验&#xff08;UX&#xff09;和用户界面&#xff08;UI&#xff09;设计&#xff0c…

4.RabbitMQ高级特性 幂等 可靠消息 等等

一、如何保证生产者生产消息100%的投递成功 保障消息的成功发出保障MQ节点的成功接收发送端收到MQ节点&#xff08;Broker&#xff09;确认应答完善的消息进行补偿机制 1. 理解Confirm确认消息机制 消息的确认&#xff0c;是指生产者投递消息后&#xff0c;如果Broker收到消…

【C语言进阶(7)】内存函数 —— 使用方法 + 模拟实现

文章目录 前言Ⅰ 内存函数⒈memcpy⒉memmove⒊memcmp⒋memset Ⅱ 模拟实现⒈模拟实现 memcpy⒉模拟实现 memmove⒊模拟实现 memcmp⒋模拟实现 memset 前言 内存操作函数的优势 字符串函数只能操作字符串的内容&#xff0c;局限性很大。而内存函数可以操作任意类型的数据&…

Kali 安装浩劫(Havoc Command and Control Framework)

拉取 github 上的项目到本地进入 Havoc 目录 git clone https://github.com/HavocFramework/Havoc.git cd Havoc下载基于 Kali 的一系列软件 sudo apt install -y git build-essential apt-utils cmake libfontconfig1 libglu1-mesa-dev libgtest-dev libspdlog-dev libboost…

改进YOLO系列:7.添加CA注意力机制

添加CA注意力机制 1. CA注意力机制论文2. CA注意力机制原理3. CA注意力机制的配置3.1common.py配置3.2yolo.py配置3.3yaml文件配置1. CA注意力机制论文 论文题目:Coordinate Attention for Efficient Mobile Network Design 论文链接:Coordinate Attention for Effi…

400电话系统的数据分析和优化对企业的发展和增长有什么具体的好处?

对企业而言&#xff0c;通过400电话系统的数据分析和优化可以带来以下具体好处&#xff0c;促进企业的发展和增长&#xff1a; 优化客户满意度&#xff1a;通过数据分析和优化&#xff0c;企业可以更好地了解客户的需求和偏好&#xff0c;针对性地提供个性化的服务。这将提升客…

如何评估分类模型的好坏

如何评估分类模型的好坏 评估分类预测模型的质量&#xff0c;常用一个矩阵、三条曲线和六个指标。 一个矩阵&#xff1a;混淆矩阵&#xff1b;三条曲线&#xff1a;ROC曲线、PR曲线、KS曲线&#xff1b;六个指标&#xff1a;正确率Acc、查全率R、查准率P、F值、AUC、BEP值、KS…

「快学Docker」Docker容器安全性探析

「快学Docker」Docker容器安全性探析 引言容器安全性威胁Docker容器安全性目录容器镜像安全性主机与容器隔离访问控制运行时监控与防御网络安全性Docker容器安全性最佳实践 总结 引言 在当今快速发展的软件开发和部署领域&#xff0c;容器化技术已经成为一种不可或缺的工具。然…

Zotero文件同步方案:Zotero + Koofr + GooleDrive/OneDrive

Zotero文件同步方案&#xff1a;Zotero Koofr GooleDrive/OneDrive 背景知识ZoteroKoofrGoogleDrive/OneDrive配置步骤注意事项参考资料 觉得文章有收获&#xff0c;欢迎关注公众号鼓励一下作者呀~ 在学习的过程中&#xff0c;也搜集了一些量化、技术的视频及书籍资源&#x…

【业务功能篇83】微服务SpringCloud-ElasticSearch-Kibanan-docke安装-应用层实战

五、ElasticSearch应用 1.ES 的Java API两种方式 Elasticsearch 的API 分为 REST Client API&#xff08;http请求形式&#xff09;以及 transportClient API两种。相比来说transportClient API效率更高&#xff0c;transportClient 是通过Elasticsearch内部RPC的形式进行请求…

Win11安装VMware中的镜像的下载

首先&#xff0c;下载好VMware之后需要许可证&#xff0c;在VMware选择许可证填上即可&#xff08;可以解决一部分VMware创建虚拟机过程中出现的问题&#xff09;。 百度网盘自取&#xff1a; 链接&#xff1a;https://pan.baidu.com/s/17gBySqoPi2HeGJJlalp-VQ 提取码&…