前言
上篇利用各地区的GDP数据还填充目标标签的缺失值;中篇顺着这个思路,利用这个原理来预测未来的销量,具体方法思路:先一一对国家、产品和商店进行汇总,然后对未来三年的每日销售额进行预测,然后再进行分解,得到每个国家、产品和商店的销售额;本篇利用机器学习的常规思路进行预测未来的销量。
分别采用了不同的方案,对时间序列的特点,根据上篇EDA的特性,增加了周期性的内容,如何正弦,余弦特征列等。
加载库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
加载数据
train = pd.read_csv("/kaggle/input/playground-series-s5e1/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s5e1/test.csv")
方案一
# 复制一份原始数据
train_data=train.copy()
test_data=test.copy()
# 查看缺失值情况
train_data.isna().sum().sort_values(ascending=False)
num_sold 8871
id 0
date 0
country 0
store 0
product 0
dtype: int64
# 直接删除
train_data = train_data.dropna()
EDA
train_data['date'] = pd.to_datetime(train_data['date'])
test_data['date'] = pd.to_datetime(test_data['date'])
train_data['Year'] = train_data['date'].dt.year
train_data['Month'] = train_data['date'].dt.month
train_data['Day'] = train_data['date'].dt.day
test_data['Year'] = test_data['date'].dt.year
test_data['Month'] = test_data['date'].dt.month
test_data['Day'] = test_data['date'].dt.day
简单的处理一下年月日,根据上篇EDA基本特征。
# 删除 date
train_data.drop('date',axis=1,inplace=True)
test_data.drop('date',axis=1,inplace=True)
分离出分类和数据特征
train_data = train_data.drop('id', axis = 1)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns.difference(['num_sold']))
cat_cols = list(train_data.select_dtypes(include=['object']).columns)
test_data = test_data.drop('id', axis = 1)
num_cols_test = list(test_data.select_dtypes(exclude=['object']).columns.difference(['id']))
cat_cols_test = list(test_data.select_dtypes(include=['object']).columns)
对分类特征进行编码
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoders = {col: LabelEncoder() for col in cat_cols}
# Apply LabelEncoder to each categorical column
for col in cat_cols:
train_data[col] = label_encoders[col].fit_transform(train_data[col])
test_data[col] = label_encoders[col].transform(test_data[col])
对分离X、y ,并分隔训练集和验证集
from sklearn.model_selection import train_test_split
X = train_data.drop(['num_sold'], axis=1)
y = train_data['num_sold']
# Split datainto training set and test set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
引入xgbboost的回归算法库,定义评估函数
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
# Define MAPE metric
def mape(y_true, y_pred):
return mean_absolute_percentage_error(y_true, y_pred)
def mse(y_true, y_pred):
return mean_squared_error(y_true, y_pred)
用xgbboost的回归算法建模
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
预测和评估
# 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(y_valid, y_pred)
me=mse(y_valid, y_pred)
score,me
(0.30150839272752505, 9599.83371487707)
方案二
查看标签的偏度和峰度
train_data['num_sold'].skew(),train_data['num_sold'].kurtosis()
(1.4153734524983919, 2.61233506292136)
发现数据分布偏度较大
画图查看数据分布
sns.histplot(data=y, kde=True, stat="density");
从图形看出,呈现明显偏态情况,因此对数据进行对数处理,再看图形
sns.histplot(data=np.log(y), kde=True, stat="density");
经过对数处理后,偏态情况明显改善,因此以对数后的值作为标签进行建模
对数处理后
y=np.log1p(y)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
不直接用log函数,而采用np.log1p函数,即加1后求对数 ,更好避免 0 求对数,同理,评估和预测时,是需要返回的,即使用np.expm1的方式。
建模、预测、评估
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
# 预测和评估
y_pred = xgb.predict(X_valid)
# score = mape(y_valid, y_pred) # 注意这里不对的,一定要返回原来的值,=否则不能比较了
# me=mse(y_valid, y_pred)
# (0.012581281054301561, 0.00724846629002101)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me
(0.06777639649910845, 8359.441860251725)
效果提升明显
评测试结果
方案 | MAPE 得分 | 处理方式 |
---|---|---|
方案一 | 0.3015 | 年月日分离 |
方案二 | 0.0678 | 方案一基础上将标签对数处理 |
方案三
train_data=train.copy()
test_data=test.copy()
train_data = train_data.dropna()
常规年月日处理
train_data['date'] = pd.to_datetime(train_data['date'])
test_data['date'] = pd.to_datetime(test_data['date'])
train_data['year'] = train_data['date'].dt.year
train_data['month'] = train_data['date'].dt.month
train_data['day'] = train_data['date'].dt.day
test_data['year'] = test_data['date'].dt.year
test_data['month'] = test_data['date'].dt.month
test_data['day'] = test_data['date'].dt.day
增加GDP权重比率
这里增加的方式与前面的方式不一样,直接用API调用的方式,灵活方便
# Function to fetch GDP per capita
import requests
def get_gdp_per_capita(country, year):
alpha3 = {
'Canada': 'CAN', 'Finland': 'FIN', 'Italy': 'ITA',
'Kenya': 'KEN', 'Norway': 'NOR', 'Singapore': 'SGP'
}
url = f"https://api.worldbank.org/v2/country/{alpha3[country]}/indicator/NY.GDP.PCAP.CD?date={year}&format=json"
response = requests.get(url).json()
try:
return response[1][0]['value']
except (IndexError, TypeError):
return None
countries = ['Canada', 'Finland', 'Italy', 'Kenya', 'Norway', 'Singapore']
years = range(2010, 2020)
gdp_data = {}
for country in countries:
for year in years:
gdp_data[(country, year)] = get_gdp_per_capita(country, year)
# Add GDP feature to train and test DataFrames
def add_gdp_feature(df):
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year # Extract year from the date
df['gdp'] = df.apply(lambda row: gdp_data.get((row['country'], row['year']), None), axis=1)
return df
# Apply to train and test datasets
train_data = add_gdp_feature(train_data)
test_data = add_gdp_feature(test_data)
填充训练集和测试GDP数据
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoders = {col: LabelEncoder() for col in cat_cols}
# Apply LabelEncoder to each categorical column
for col in cat_cols:
train_data[col] = label_encoders[col].fit_transform(train_data[col])
test_data[col] = label_encoders[col].transform(test_data[col])
train_date=train_data.pop('date')
test_date=test_data.pop('date')
train_data = train_data.drop('id', axis = 1)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns.difference(['num_sold']))
cat_cols = list(train_data.select_dtypes(include=['object']).columns)
test_data = test_data.drop('id', axis = 1)
num_cols_test = list(test_data.select_dtypes(exclude=['object']).columns.difference(['id']))
cat_cols_test = list(test_data.select_dtypes(include=['object']).columns)
查看数据
train_data.head()
- | country | store | product | num_sold | year | month | day | gdp |
---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 973.0 | 2010 | 1 | 1 | 47560.666601 |
2 | 0 | 0 | 2 | 906.0 | 2010 | 1 | 1 | 47560.666601 |
3 | 0 | 0 | 3 | 423.0 | 2010 | 1 | 1 | 47560.666601 |
4 | 0 | 0 | 4 | 491.0 | 2010 | 1 | 1 | 47560.666601 |
5 | 0 | 2 | 0 | 300.0 | 2010 | 1 | 1 | 47560.666601 |
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
# 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me
(0.06931576130020592, 8393.297674737376)
test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log.csv', index=False)
- | id | num_sold |
---|---|---|
0 | 230130 | 134.386810 |
1 | 230131 | 759.479980 |
2 | 230132 | 680.182922 |
3 | 230133 | 350.549652 |
4 | 230134 | 413.237976 |
方案四
增加星期几的信息
train_data['weekday']=train_date.dt.weekday
test_data['weekday'] = test_date.dt.weekday
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
# 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
(0.045671587590535107, 3451.7557869198386)
test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log11.csv', index=False)
- | id | num_sold |
---|---|---|
0 | 230130 | 162.664078 |
1 | 230131 | 892.149536 |
2 | 230132 | 819.040649 |
3 | 230133 | 412.940216 |
4 | 230134 | 496.703644 |
方案五
train_data=train.copy()
test_data=test.copy()
train_data = train_data.dropna()
def eda(df):
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['day'] = df['date'].dt.day
df['month'] = df['date'].dt.month
df['quarter'] = df['date'].dt.quarter
df['month_name'] = df['date'].dt.month_name()
df['day_of_week'] = df['date'].dt.day_name()
df['week'] = df['date'].dt.isocalendar().week
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
df['quarter_sin'] = np.sin(2 * np.pi * df['quarter'] / 4)
df['quarter_cos'] = np.cos(2 * np.pi * df['quarter'] / 4)
df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)
df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
df['group'] = (df['year'] - 2020) * 48 + df['month'] * 4 + df['day'] // 7
df.drop('date', axis=1, inplace=True)
df['cos_year'] = np.cos(df['year'] * (2 * np.pi) / 100)
df['sin_year'] = np.sin(df['year'] * (2 * np.pi) / 100)
# why using sin/cos? to tell model that after Dec we have Jan, of we dont do this it will
# consider 1 to 12 and then 12 to 1 wont be considered. same applies on week day also
# this is universal funnction whenever we have date.
dummy_prefixes = ['country', 'store', 'product','month_name','day_of_week']
df = pd.get_dummies(df, columns=dummy_prefixes, drop_first=True)
return df
train_data = eda(train_data)
test_data = eda(test_data)
train_data = train_data.drop('id',axis=1)
test_data = test_data.drop('id',axis=1)
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
# 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me
(0.04549245301889767, 3233.53332460358)
test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log_up.csv', index=False)
- | id | num_sold |
---|---|---|
0 | 230130 | 144.742188 |
1 | 230131 | 869.433594 |
2 | 230132 | 759.550293 |
3 | 230133 | 403.059479 |
4 | 230134 | 456.856323 |
这个0.0454924提交后的成绩如下,成绩远不如中篇的不用模型进行比率计算的结果。
评测总结果
方案 | MAPE 得分 | 处理方式 |
---|---|---|
方案一 | 0.3015 | 年月日分离 |
方案二 | 0.06777 | 方案一基础上将标签对数处理 |
方案三 | 0.06931 | 方案二基础上增加GDP数据 |
方案四 | 0.04567 | 方案三基础上增加星期几内容 |
方案五 | 0.04549 | 重新特征工程 |
总结
一、对时间次序数据的特征处理,有以下
- 第一步一般采用年月日分离处理;
- 有些国家,特别在销售产品时,节假日(Holiday)也有一定的效果;
- GDP数据直接会影响结果,在方案三处理后的效果不明显;
- 在销量上周末的特征表现明显;
- 通上篇EDA,发现销量是有规律的周期性变化的,因此可以采用三角函数处理方式,效果明显;
二、本文只对特征工程展开研究和讨论,并没有在建模和调参上做一些说明,因此在模型上还可以选择Randomforest
、LightGBM
、catboost
等回归算法; 在调参方面可以用之前讲过的Optuna
来优化; 通过这些处理,成绩肯定会有所提升;
三、详测结果只能代表在验证集的效果,最终的成绩只提交看到,因此,两者还存在着较大的差距的。