线性回归 (Linear Regression)案例分析1

广告费用与产品销量

工欲善其事必先利其器
数据分析
- 1. 检查缺失值、异常值
- 3. 散点图查看特征、响应相关性
- 3. 热力图查看特征、响应相关性
特征工程
- 1、导入必要工具包
- 2、读取数据
- 3、数据标准化
- 4、保存特征工程的结果到文件，供机器学习模型使用
模型选择
- 读取数据
- 数据准备
- 1. 最小二乘线性回归（OLS）
- - 查看残差分布
- 2. 岭回归：正则参数 $\lambda$
- 3. Lasso：正则参数 $\lambda$
- 4. 默认超参数的Huber损失回归
模型评价指标
模型测试结果
- 回归系数
- $R^2$ 分数评价

工欲善其事必先利其器

预备知识： 数据分析入门操作
相关工具库的参考文档： 多维数组&矩阵处理-numpy 文件操作与处理-pandas
数据分析可视化工具1-matplotlib
数据分析可视化工具2-seaborn
机器学习算法库scikit-learn
*深度学习框架TensorFlow
*深度学习框架Pytorch

数据分析

1. 检查缺失值、异常值

导入相关库

#数据处理
import numpy as np 
import pandas as pd 

#数据可视化
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#显示中文
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

读取数据并查看列属性

#读取数据
dpath = "./data/"
df = pd.read_csv(dpath + "Advertising.csv")

#df.columns = ['记录号','电视广告费用', '广播广告费用', '报纸广告费用', '产品销量']
df.columns = ['ID','TV', 'radio', 'newspaper', 'sales']
#通过观察前5行，了解数据每列（特征）的概况
df.head()

# 数据总体信息
df.info()

在这里插入图片描述

# 对数值型特征，得到每个特征的描述统计量，查看特征的大致分布
df.describe()

200个样本，没有缺失值，看不出来有噪声数据
每个样本有3个数值型特征： TV, radio, newspaper
标签：产品销量

3. 散点图查看特征、响应相关性

#散点图查看单个特征与目标之间的关系
plt.scatter(df['TV'], df['sales'])
plt.xlabel(u'电视广告费用')
plt.ylabel(u'销量')

在这里插入图片描述

log_TV = np.log1p(df['TV'])
log_sales = np.log1p(df['sales'])

plt.scatter(log_TV, log_sales)
plt.xlabel(u'log(电视广告费用)')
plt.ylabel(u'log(销量)')

在这里插入图片描述

plt.scatter(df['radio'], df['sales'])
plt.xlabel(u'广播广告费用')
plt.ylabel(u'销量')

在这里插入图片描述

plt.scatter(df['newspaper'], df['sales'])
plt.xlabel(u'报纸广告费用')
plt.ylabel(u'销量')

在这里插入图片描述

plt.xlim(0,300)
plt.ylim(0,300)

plt.scatter(df['TV'], df['radio'])
plt.xlabel(u'电视广告费用')
plt.ylabel(u'广播广告费用')

在这里插入图片描述

特征与特征之间的相关性不太高
特征TV, radio与标签相关性较高，特征newspaper相关性不太高

3. 热力图查看特征、响应相关性

# 得到相关系数的绝对值，通常认为相关系数的绝对值大于0.6的特征为强相关
data_corr = df.corr()
data_corr = data_corr.abs()
sns.heatmap(data_corr,annot=True)
plt.show()

在这里插入图片描述

特征工程

特征均为数值型特征，暂时不做过多处理，只是对特征做标准化处理，使得各特征处理后均为0均值（数据围绕 0 中心化）、1标准差（数据的离散程度被统一为相同的尺度）（并不要求是正态分布：虽然标准正态分布的均值为 0，标准差为 1，但并不是说处理后的特征必须服从正态分布。标准化处理后，特征的分布形状可能仍与原始分布相似（只要经过均值和方差的调整即可））。

1、导入必要工具包

#数据处理
import numpy as np 
import pandas as pd 

#数据可视化
import matplotlib.pyplot as plt
%matplotlib inline

#显示中文
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

2、读取数据

#读取数据
dpath = "./data/"
df = pd.read_csv(dpath + "Advertising.csv")

#通过观察前5行，了解数据每列（特征）的概况
df.head()

#去掉索引列
df.drop(['Unnamed: 0'], axis = 1, inplace = True)

# 数据总体信息
df.info()

3、数据标准化

# 从原始数据中分离输入特征x和输出y
y = df['sales']
X = df.drop('sales', axis = 1)

#特征名称，用于后续显示权重系数对应的特征
feat_names = X.columns

# 数据标准化
# 本数据集中3个特征的单位相同，可以不做特征缩放，不影响正则
# 但3个特征的取值范围不同，如果采用梯度下降/随机梯度下降法求解，
# 还是将所有特征的取值范围缩放到相同区间
from sklearn.preprocessing import StandardScaler

# 分别初始化对特征和目标值的标准化器
ss_X = StandardScaler()

# 对训练数据，先调用fit方法训练模型，得到模型参数；然后对训练数据和测试数据进行transform
X = ss_X.fit_transform(X)

4、保存特征工程的结果到文件，供机器学习模型使用

fe_df = pd.DataFrame(data = X, columns = feat_names, index = df.index)

#加上标签y
fe_df["sales"] = y

#保存结果到文件
fe_df.to_csv(dpath + 'FE_Advertising.csv', index=False)

模型选择

读取数据

#数据处理
import numpy as np 
import pandas as pd 

#数据可视化
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# 使用r2_score作为回归模型性能的评价
from sklearn.metrics import r2_score  

#显示中文
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']

#读取数据
dpath = "./data/"
df = pd.read_csv(dpath + "FE_Advertising.csv")

#通过观察前5行，了解数据每列（特征）的概况
df.head()

# 数据总体信息
df.info()

数据准备

# 从原始数据中分离输入特征x和输出y
y = df['sales']
X = df.drop('sales', axis = 1)

#特征名称，用于后续显示权重系数对应的特征
feat_names = X.columns

#将数据分割训练数据与测试数据
from sklearn.model_selection import train_test_split

# 随机采样20%的数据构建测试样本，其余作为训练样本
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2)
#X_train.shape

1. 最小二乘线性回归（OLS）

# 线性回归
from sklearn.linear_model import LinearRegression

# 1.使用默认配置初始化学习器实例
lr = LinearRegression()

# 2.用训练数据训练模型参数
lr.fit(X_train, y_train)

# 3. 用训练好的模型对测试集进行预测
y_test_pred_lr = lr.predict(X_test)
y_train_pred_lr = lr.predict(X_train)


#性能评估，R方分数
print("The r2 score of LinearRegression on test is %f" %(r2_score(y_test, y_test_pred_lr)))
print("The r2 score of LinearRegression on train is %f" %(r2_score(y_train, y_train_pred_lr)))

# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef":list((lr.coef_.T))})
#fs.sort_values(by=['coef'],ascending=False)
fs = fs.append([{'columns':'intercept','coef':lr.intercept_}], ignore_index=True)
fs

在这里插入图片描述

查看残差分布

#在训练集上观察预测残差的分布，看是否符合模型假设：噪声为0均值的高斯噪声
figsize = 11,9
res = y_train_pred_lr - y_train
sns.distplot(res, bins=40, kde = False)
plt.xlabel(u'残差', fontsize = 16)

在这里插入图片描述

figsize = 11,9
plt.scatter(y_train, res)
plt.xlabel(u'真实值', fontsize = 16)
plt.ylabel(u'残差', fontsize = 16)

在这里插入图片描述

从真实值和残差的散点图来看，真实值和残差不是没有关系。看起来真实值较小和较大时，预测残差大多<0,其余情况残差大多>0。也就是说，模型还没有完全建模y与x之间的关系，还有一部分关系残留在残差中。

2. 岭回归：正则参数 $\lambda$

from sklearn.linear_model import  RidgeCV

#1. 设置超参数（正则参数）范围
alphas = [ 0.01, 0.1, 1, 10, 100]

#2. 生成一个RidgeCV实例
ridge = RidgeCV(alphas=alphas, store_cv_values=True)  

#3. 模型训练
ridge.fit(X_train, y_train)    

#4. 预测
y_test_pred_ridge = ridge.predict(X_test)
y_train_pred_ridge = ridge.predict(X_train)

#模型性能评估
print("The r2 score of Ridge on test is %f" %(r2_score(y_test, y_test_pred_ridge)))
print("The r2 score of Ridge on train is %f" %(r2_score(y_train, y_train_pred_ridge)))

# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef":list((ridge.coef_.T))})
#fs.sort_values(by=['coef'],ascending=False)
fs = fs.append([{'columns':'intercept','coef':ridge.intercept_}], ignore_index=True)
fs

在这里插入图片描述

mse_mean = np.mean(ridge.cv_values_, axis = 0)
plt.plot(np.log10(alphas), mse_mean.reshape(len(alphas),1)) 

#最佳超参数
plt.axvline(np.log10(ridge.alpha_), color='r', ls='--')

plt.xlabel('log(alpha)', fontsize = 16)
plt.ylabel('MSE', fontsize = 16)
plt.show()

print ('alpha is:', ridge.alpha_)

在这里插入图片描述

3. Lasso：正则参数 $\lambda$

from sklearn.linear_model import LassoCV

#1. 设置超参数搜索范围
#Lasso可以自动确定最大的alpha，所以另一种设置alpha的方式是设置最小的alpha值（eps） 和 超参数的数目（n_alphas），
#然后LassoCV对最小值和最大值之间在log域上均匀取值n_alphas个
# np.logspace(np.log10(alpha_max * eps), np.log10(alpha_max),num=n_alphas)[::-1]

#2 生成LassoCV实例（默认超参数搜索范围）
lasso = LassoCV()  

#3. 训练（内含CV）
lasso.fit(X_train, y_train)  

#4. 测试
y_test_pred_lasso = lasso.predict(X_test)
y_train_pred_lasso = lasso.predict(X_train)


# 评估，使用r2_score评价模型在测试集和训练集上的性能
print("The r2 score of lasso on test is %f" %(r2_score(y_test, y_test_pred_lasso)))
print("The r2 score of lasso on train is %f" %(r2_score(y_train, y_train_pred_lasso)))

# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef_lr":list((lr.coef_.T)), "coef_ridge":list((ridge.coef_.T)), "coef_lasso":list((lasso.coef_.T))})
#fs.sort_values(by=['coef_lr'],ascending=False)
fs = fs.append([{'columns':'intercept','coef_lr':lr.intercept_, 'coef_ridge':ridge.intercept_, 'coef_lasso':lasso.intercept_}], ignore_index=True)
fs

在这里插入图片描述

mses = np.mean(lasso.mse_path_, axis = 1)
plt.plot(np.log10(lasso.alphas_), mses) 

#最佳超参数
plt.axvline(np.log10(lasso.alpha_), color='r', ls='--')

plt.xlabel('log(alpha)', fontsize = 16)
plt.ylabel('MSE', fontsize = 16)
plt.show()    
            
print ('alpha is:', lasso.alpha_)

在这里插入图片描述

from sklearn.linear_model import ElasticNetCV

#1. 设置超参数搜索范围
#Lasso可以自动确定最大的alpha，所以另一种设置alpha的方式是设置最小的alpha值（eps） 和 超参数的数目（n_alphas），
#然后LassoCV对最小值和最大值之间在log域上均匀取值n_alphas个
# np.logspace(np.log10(alpha_max * eps), np.log10(alpha_max),num=n_alphas)[::-1]
l1_ratio = [0.01, 0.1, .5, .7, .9, .95, .99, 1]

#2 ElasticNetCV（设置超参数搜索范围）
elastic_net = ElasticNetCV(l1_ratio = l1_ratio )  

#3. 训练（内含CV）
elastic_net.fit(X_train, y_train)  

#4. 测试
y_test_pred_elastic_net = elastic_net.predict(X_test)
y_train_pred_elastic_net = elastic_net.predict(X_train)


# 评估，使用r2_score评价模型在测试集和训练集上的性能
print("The r2 score of elastic_net on test is %f" %(r2_score(y_test, y_test_pred_elastic_net)))
print("The r2 score of elastic_net on train is %f" %(r2_score(y_train, y_train_pred_elastic_net)))

# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef_lr":list((lr.coef_.T)), "coef_ridge":list((ridge.coef_.T)), "coef_lasso":list((lasso.coef_.T)), 'coef_elastic_net':list((elastic_net.coef_.T))})
#fs.sort_values(by=['coef_lr'],ascending=False)
fs = fs.append([{'columns':'intercept','coef_lr':lr.intercept_, 'coef_ridge':ridge.intercept_, 'coef_lasso':lasso.intercept_, 'coef_elastic_net':elastic_net.intercept_}], ignore_index=True)
fs

在这里插入图片描述

mses = np.mean(elastic_net.mse_path_, axis = 2)

# plot results
n_l1_ratio = len(l1_ratio)
n_alpha = elastic_net.alphas_.shape[1]

for i, l1 in enumerate(l1_ratio):
    plt.plot(np.log10(elastic_net.alphas_[i]), mses[i], label= u'l1正则比例:' + str(l1))

#最佳超参数
plt.axvline(np.log10(elastic_net.alpha_), color='r', ls='--')

plt.xlabel('log(alpha)', fontsize = 16)
plt.ylabel('MSE', fontsize = 16)
plt.legend(fontsize = 12)
plt.show()    
            
print ('alpha is:', elastic_net.alpha_)
print ('l1_ratio is:', elastic_net.l1_ratio_)

在这里插入图片描述

4. 默认超参数的Huber损失回归

# Huber回归
from sklearn.linear_model import HuberRegressor

# 1.使用默认配置初始化学习器实例
hr = HuberRegressor()

# 2.用训练数据训练模型参数
hr.fit(X_train, y_train)

# 3. 用训练好的模型对测试集进行预测
y_test_pred_hr = hr.predict(X_test)
y_train_pred_hr = hr.predict(X_train)


# 看看各特征的权重系数，系数的绝对值大小可视为该特征的重要性
fs = pd.DataFrame({"columns":list(feat_names), "coef":list((hr.coef_.T))})
#fs.sort_values(by=['coef'],ascending=False)
fs = fs.append([{'columns':'intercept','coef':hr.intercept_}], ignore_index=True)
fs

在这里插入图片描述

#在训练集上观察预测残差的分布，看是否符合模型假设：噪声为0均值的高斯噪声
figsize = 11,9
res =  y_train_pred_hr - y_train
sns.distplot(res, bins=40, kde = False)
plt.xlabel(u'残差')
plt.title(u'残差直方图')

在这里插入图片描述

figsize = 11,9

res = y_train - y_train_pred_hr
plt.scatter(y_train_pred_hr, res)
plt.xlabel(u'预测值')
plt.ylabel(u'残差')

在这里插入图片描述

模型评价指标

超参数调优评价指标：MSE（红色竖线为最佳超参数）
在这里插入图片描述

模型测试结果

回归系数

在数据分析中，我们经常需要从数据集中随机抽取一部分样本作为测试数据，而剩下的样本则用作训练数据。例如，在这个例子中，我们随机选择了20%的样本作为测试数据，而其余80%的样本则用作训练数据。

接下来，我们来看一下广告数据集上不同线性回归模型的系数。这些模型包括最小二乘线性回归、岭回归、Lasso回归以及弹性网络回归。每种模型都对不同的特征（如TV、radio和newspaper）赋予了不同的回归系数。此外，还有一个截距项用于调整模型的基线预测值。

以下是这些模型的回归系数：

特征	最小二乘线性回归系数	岭回归系数	Lasso系数	弹性网络系数
TV	$3.983944$	$3.981524$	$3.921642$	$3.921642$
radio	$2.860230$	$2.858304$	$2.806374$	$2.806374$
newspaper	$0.038194$	$0.038925$	$0.000000$	$0.000000$
截距项	$13.969091$	$13.969282$	$13.972528$	$13.972528$

从这些系数中，我们可以看到岭回归、Lasso和弹性网络的回归系数的绝对值比最小二乘线性回归（OLS）的小。这表明这些模型在一定程度上进行了权值收缩，有助于防止模型过拟合。此外，Lasso回归还展示了其稀疏性特点，即它将特征newspaper的系数缩减为 $0$ ，这意味着在Lasso模型中，newspaper对预测结果的影响被完全忽略。

这种系数的缩减有助于提高模型的解释性和泛化能力，特别是在特征数量较多的情况下。通过这种方式，我们可以更清晰地理解哪些特征对预测结果有实质性的影响，同时也能够减少模型的复杂度。