R语言中的岭回归、套索回归、主成分回归:线性模型选择和正则化

news2025/1/18 17:05:01

 概述和定义

在本文中,我们将考虑一些线性模型的替代拟合方法,除了通常的  普通最小二乘法。这些替代方法有时可以提供更好的预测准确性和模型可解释性。最近我们被客户要求撰写关于模型选择的研究报告,包括一些图形和统计输出。

 主成分分析PCA降维方法和R语言分析葡萄酒可视化实例

主成分分析PCA降维方法和R语言分析葡萄酒可视化实例

,时长04:30

  • 预测精度:线性,普通最小二乘估计将具有低偏差。OLS也表现良好,  n  >>  p。但是,如果  n  不比p大很多  ,则拟合可能会有很多可变性,从而导致拟合过度和/或预测不佳。如果  p  >  n,则不再有唯一的最小二乘估计,并且根本无法使用该方法。

这个问题是维数诅咒的另一个方面  。当  p  开始变大时,观测值  x  开始变得比附近的观测值更接近类别之间的边界,这给预测带来了主要问题。此外,对于许多  p,训练样本经常是稀疏的,从而难以识别趋势和进行预测。

通过  限制  和  缩小  估计的系数,我们通常可以大大减少方差,因为偏差的增加可忽略不计,这通常会导致准确性的显着提高。

  • 模型的可解释性:不相关的变量导致结果模型不必要的复杂性。通过删除它们(设置系数= 0),我们得到了一个更容易解释的模型。但是,使用OLS使得系数极不可能为零。

    • 子集选择:我们使用子集特征的最小二乘拟合模型。

尽管我们讨论了这些技术在线性模型中的应用,但它们也适用于其他方法,例如分类。

详细方法

子集选择

最佳子集选择

在这里,我们为p个  预测变量的每种可能组合拟合单独的OLS回归  ,然后查看结果模型拟合。这种方法的问题在于,  最佳模型  隐藏在2 ^ p种  可能性之内。该算法分为两个阶段。(1)拟合所有包含k个预测变量的模型  ,其中  k  是模型的最大长度。(2)使用交叉验证的预测误差选择一个模型。下面将讨论更具体的预测误差方法,例如AIC和BIC。

这适用于其他类型的模型选择,例如逻辑回归,但我们根据选择选择的得分会有所变化。对于逻辑回归,我们将使用  偏差  而不是RSS和R ^ 2。

选择最佳模型

上面提到的三种算法中的每一种都需要我们手动确定哪种模型效果最好。如前所述,使用训练误差时,具有最多预测值的模型通常具有最小的RSS和最大的R ^ 2。为了选择测试  误差最大的模型  ,我们需要估计测试误差。有两种计算测试误差的方法。

  1.  通过对训练误差进行和调整来间接估计测试误差,以解决过度拟合的偏差。

  2.  使用验证集或交叉验证方法直接估计测试错误。

验证和交叉验证

通常,交叉验证技术是对测试的更直接估计,并且对基础模型的假设更少。此外,它可以用于更广泛的模型类型选择中。

岭回归

岭回归与最小二乘相似,不同之处在于系数是通过最小化略有不同的数量来估算的。像OLS一样,Ridge回归寻求降低RSS的系数估计,但是当系数接近于零时,它们也会产生收缩损失。这种损失的作用是将系数估计值缩小到零。参数λ控制收缩的影响。λ= 0的行为与OLS回归完全相同。当然,选择一个好的λ值至关重要,应该使用交叉验证进行选择。岭回归的要求是预测变量  X的  中心定为 mean = 0,因此必须事先对数据进行标准化。

相关 视频:Lasso回归、岭回归等正则化回归数学原理及R语言实例

Lasso回归、岭回归等正则化回归数学原理及R语言实例

为什么岭回归比最小二乘更好?

优势在偏差方差中显而易见  。随着λ的增加,岭回归拟合的灵活性降低。这导致方差减小,偏差增加较小。固定的OLS回归具有较高的方差,但没有偏差。但是,最低的测试MSE往往发生在方差和偏差之间的交点处。因此,通过适当地调整λ获取较少的方差,我们可以找到较低的潜在MSE。

在最小二乘估计具有高方差的情况下,岭回归最有效。Ridge回归比任何子集方法都具有更高的计算效率  ,因为可以同时求解所有λ值。

套索Lasso

岭回归具有至少一个缺点。它包括 最终模型中的所有  p个预测变量。惩罚项将使其中许多接近零,但永远不会  精确  为零。对于预测准确性而言,这通常不是问题,但会使模型更难以解释结果。Lasso克服了这个缺点,并且能够将s  足够小地强制将某些系数设为零  。由于  s  = 1导致常规的OLS回归,因此当  s  接近0时,系数将缩小为零。因此,套索回归也执行变量选择。

降维方法

到目前为止,我们所讨论的方法已经通过使用原始变量的子集或将其系数缩小到零来控制了方差。现在,我们探索一类模型, 这些模型可以  转换预测变量,然后使用转换后的变量拟合最小二乘模型。降维将估计  p  +1个系数的问题简化为M  +1个系数的简单问题  ,其中  M  <  p。这项任务的两种方法是  主成分回归  和  偏最小二乘

主成分回归(PCA)

可以将PCA描述为一种从大量变量中导出低维特征集的方法。

在回归中,我们构造  M个  主成分,然后在使用最小二乘的线性回归中将这些成分用作预测变量。通常,与普通最小二乘法相比,我们有可能拟合出更好的模型,因为我们可以减少过度拟合的影响。

偏最小二乘

我们上面描述的PCR方法涉及鉴定 最能代表预测变量的X的线性组合  。

PLS通过对与因变量最密切相关的变量赋予更高的权重来实现此目的。

实际上,PLS的性能不比岭回归或PCR好。这是因为即使PLS可以减少偏差,它也有可能增加方差,因此总体收益并没有真正的区别。

解释高维结果

我们必须始终谨慎对待报告获得的模型结果的方式,尤其是在高维设置中。在这种情况下,多重共线性问题非常严重,因为模型中的任何变量都可以写为模型中所有其他变量的线性组合。

例子

子集选择方法

最佳子集选择

我们希望根据上一年的各种统计数据来预测棒球运动员Salary 的情况  。

library(ISLR)
attach(Hitters)
names(Hitters)
##  [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"      
##  [6] "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"   
## [11] "CRuns"     "CRBI"      "CWalks"    "League"    "Division" 
## [16] "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"
dim(Hitters)
## [1] 322  20
str(Hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
# 检察缺失值
sum(is.na(Hitters$Salary))/length(Hitters[,1])*100
## [1] 18.32

事实证明,大约有18%的数据丢失。我们将省略丢失的数据。

Hitters <- na.omit(Hitters)
dim(Hitters)
## [1] 263  20

执行最佳子集选择,使用RSS进行量化。

library(leaps)

regfit <- regsubsets(Salary ~ ., Hitters)
summary(regfit)
## Subset selection object
## Call: regsubsets.formula(Salary ~ ., Hitters)
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 ) " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 ) " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 ) "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 ) " "   "*"  " "   " "  " " "*"   " "   "*"    "*"   "*"    " "  
## 8  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   " "    " "   "*"    "*"  
##          CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 ) "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 ) "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 ) "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 ) " "  " "    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 ) " "  "*"    " "     "*"       "*"     " "     " "    " "

星号表示变量包含在相应的模型中。

##  [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405
## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461

在这个19变量模型中,  R ^ 2单调增加。

我们可以使用内置的绘图功能来绘制RSS,adj R ^ 2,  C p,AIC和BIC。

注意:上面显示的拟合度是(除R ^ 2以外)所有测试误差的估计。

向前和向后逐步选择

## Subset selection object
## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward")
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: forward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 4  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 5  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    " "  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    " "  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  "*"  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  "*"  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 5  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 6  ( 1 )  "*"  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"
## Subset selection object
## 19 Variables  (and intercept)
##            Forced in Forced out
## AtBat          FALSE      FALSE
## Hits           FALSE      FALSE
## HmRun          FALSE      FALSE
## Runs           FALSE      FALSE
## RBI            FALSE      FALSE
## Walks          FALSE      FALSE
## Years          FALSE      FALSE
## CAtBat         FALSE      FALSE
## CHits          FALSE      FALSE
## CHmRun         FALSE      FALSE
## CRuns          FALSE      FALSE
## CRBI           FALSE      FALSE
## CWalks         FALSE      FALSE
## LeagueN        FALSE      FALSE
## DivisionW      FALSE      FALSE
## PutOuts        FALSE      FALSE
## Assists        FALSE      FALSE
## Errors         FALSE      FALSE
## NewLeagueN     FALSE      FALSE
## 1 subsets of each size up to 19
## Selection Algorithm: backward
##           AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns
## 1  ( 1 )  " "   " "  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 2  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 3  ( 1 )  " "   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 4  ( 1 )  "*"   "*"  " "   " "  " " " "   " "   " "    " "   " "    "*"  
## 5  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 6  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 7  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 8  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   " "    " "   " "    "*"  
## 9  ( 1 )  "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 10  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 11  ( 1 ) "*"   "*"  " "   " "  " " "*"   " "   "*"    " "   " "    "*"  
## 12  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 13  ( 1 ) "*"   "*"  " "   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 14  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    " "   " "    "*"  
## 15  ( 1 ) "*"   "*"  "*"   "*"  " " "*"   " "   "*"    "*"   " "    "*"  
## 16  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 17  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   " "   "*"    "*"   " "    "*"  
## 18  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   " "    "*"  
## 19  ( 1 ) "*"   "*"  "*"   "*"  "*" "*"   "*"   "*"    "*"   "*"    "*"  
##           CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
## 1  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 2  ( 1 )  " "  " "    " "     " "       " "     " "     " "    " "       
## 3  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 4  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 5  ( 1 )  " "  " "    " "     " "       "*"     " "     " "    " "       
## 6  ( 1 )  " "  " "    " "     "*"       "*"     " "     " "    " "       
## 7  ( 1 )  " "  "*"    " "     "*"       "*"     " "     " "    " "       
## 8  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 9  ( 1 )  "*"  "*"    " "     "*"       "*"     " "     " "    " "       
## 10  ( 1 ) "*"  "*"    " "     "*"       "*"     "*"     " "    " "       
## 11  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 12  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     " "    " "       
## 13  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 14  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 15  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 16  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    " "       
## 17  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 18  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"       
## 19  ( 1 ) "*"  "*"    "*"     "*"       "*"     "*"     "*"    "*"

我们可以在这里看到1-6个变量模型对于最佳子集  和  选择是相同的  。

岭回归和套索

开始交叉验证方法

我们还将在正则化方法中应用交叉验证方法。

验证集

R ^ 2  C p和BIC估计测试错误率,我们可以使用交叉验证方法。我们必须仅使用训练观察来执行模型拟合和变量选择的所有方面。然后通过将训练模型应用于测试或验证  数据来计算测试错误  。

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... 
## 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0       400   0.4       40       0.09       
##   1e-04   400   0.4       40       0.09       
##   0.1     300   0.5       40       0.09       
##
## RMSE用于使用最小值选择最佳模型。
##用于模型的最终值为lambda = 0.1。

mean(ridge.pred - test$Salary)^2
## [1] 30.1

k交叉验证

使用  k交叉验证选择最佳的lambda。

为了进行交叉验证,我们将数据分为测试和训练数据。

## Ridge Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... 
## 
## Resampling results across tuning parameters:
## 
##   lambda  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0       300   0.6       70       0.1        
##   1e-04   300   0.6       70       0.1        
##   0.1     300   0.6       70       0.1        
##
## RMSE用于使用最小值选择最佳模型。
##用于模型的最终值为lambda = 1e-04。
# 计算相关系数
predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,]
##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##   -157.221    313.860    -18.996      0.000    -70.392    171.242 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
##    -27.543      0.000      0.000     51.811    202.537    187.933 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
##   -224.951     12.839    -38.595     -9.128     13.288    -18.620 
## NewLeagueN 
##     22.326
sqrt(mean(ridge.pred - test$Salary)^2)
## [1] 17.53

因此,薪水的平均误差约为33000。回归系数似乎并没有真正趋向于零,但这是因为我们首先对数据进行了标准化。

现在,我们应该检查一下这是否比常规lm() 模型更好  。

## Linear Regression 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ... 
## 
## Resampling results
## 
##   RMSE  Rsquared  RMSE SD  Rsquared SD
##   300   0.5       70       0.2        
## 
## 
coef(lmfit$finalModel)
## (Intercept)       AtBat        Hits       HmRun        Runs         RBI 
##     535.958    -327.835     591.667      73.964    -169.699    -162.024 
##       Walks       Years      CAtBat       CHits      CHmRun       CRuns 
##     234.093     -60.557     125.017    -529.709     -45.888     680.654 
##        CRBI      CWalks     LeagueN   DivisionW     PutOuts     Assists 
##     393.276    -399.506      19.118     -46.679      -4.898      41.271 
##      Errors  NewLeagueN 
##     -22.672      22.469
sqrt(mean(lmfit.pred - test$Salary)^2)
## [1] 17.62

如我们所见,这种岭回归拟合当然具有较低的RMSE和较高的  R ^ 2。

套索

## The lasso 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: scaled, centered 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ... 
## 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE  Rsquared  RMSE SD  Rsquared SD
##   0.1       300   0.6       70       0.2        
##   0.5       300   0.6       60       0.2        
##   0.9       300   0.6       70       0.2        
## 
## RMSE用于使用最小值选择最佳模型。
##用于模型的最终值是= 0.5。
## $s
## [1] 0.5
## 
## $fraction
##   0 
## 0.5 
## 
## $mode
## [1] "fraction"
## 
## $coefficients
##      AtBat       Hits      HmRun       Runs        RBI      Walks 
##   -227.113    406.285      0.000    -48.612    -93.740    197.472 
##      Years     CAtBat      CHits     CHmRun      CRuns       CRBI 
##    -47.952      0.000      0.000     82.291    274.745    166.617 
##     CWalks    LeagueN  DivisionW    PutOuts    Assists     Errors 
##   -287.549     18.059    -41.697     -7.001     30.768    -26.407 
## NewLeagueN 
##     19.190
sqrt(mean(lasso.pred - test$Salary)^2)
## [1] 14.35

在套索lasso中,我们看到许多系数已被强制为零。即使RMSE比ridge回归高一点,它也比线性回归模型具有优势。

PCR和PLS

主成分回归

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: svdpc
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           451.5    336.9    323.9    328.5    328.4    329.9    337.1
## adjCV        451.5    336.3    323.6    327.8    327.5    328.8    335.7
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       335.2    333.7    338.5     334.3     337.8     340.4     346.7
## adjCV    332.5    331.7    336.4     332.0     335.5     337.6     343.4
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        345.1     345.7     329.4     337.3     343.5     338.7
## adjCV     341.2     341.6     325.7     332.7     338.4     333.9
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         36.55    60.81    71.75    80.59    85.72    89.76    92.74
## Salary    45.62    50.01    51.19    51.98    53.23    53.36    55.63
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         95.37    96.49     97.45     98.09     98.73     99.21     99.52
## Salary    56.48    56.73     58.57     58.92     59.34     59.44     62.01
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.77     99.90     99.97     99.99    100.00
## Salary     62.65     65.29     66.48     66.77     67.37

该算法将CV报告为RMSE,并将训练数据报告为R ^ 2。通过将MSE作图可以看出,我们实现了最低的MSE。这表明与最小二乘法相比有了很大的改进,因为我们能够仅使用3个分量而不是19个分量来解释大部分方差。

测试数据集上执行。

sqrt(mean((pcr.pred - test$Salary)^2))
## [1] 374.8

比套索/线性回归的RMSE低。

## Principal Component Analysis 
## 
## 133 samples
##  19 predictors
## 
## Pre-processing: centered, scaled 
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 121, 120, 118, 119, 120, 120, ... 
## 
## Resampling results across tuning parameters:
## 
##   ncomp  RMSE  Rsquared  RMSE SD  Rsquared SD
##   1      300   0.5       100      0.2        
##   2      300   0.5       100      0.2        
##   3      300   0.6       100      0.2        
## 
## RMSE用于使用最小值选择最佳模型。
##用于模型的最终值为ncomp = 3。

选择2个成分的最佳模型

sqrt(mean(pcr.pred - test$Salary)^2)
## [1] 21.86

但是,PCR结果不容易解释。

偏最小二乘

## Data:    X dimension: 133 19 
##  Y dimension: 133 1
## Fit method: kernelpls
## Number of components considered: 19
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           451.5    328.9    328.4    332.6    329.2    325.4    323.4
## adjCV        451.5    328.2    327.4    330.6    326.9    323.0    320.9
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       318.7    318.7    316.3     317.6     316.5     317.0     319.2
## adjCV    316.2    315.5    313.5     314.9     313.6     313.9     315.9
##        14 comps  15 comps  16 comps  17 comps  18 comps  19 comps
## CV        323.0     323.8     325.4     324.5     323.6     321.4
## adjCV     319.3     320.1     321.4     320.5     319.9     317.8
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X         35.94    55.11    67.37    74.29    79.65    85.17    89.17
## Salary    51.56    54.90    57.72    59.78    61.50    62.94    63.96
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         90.55    93.49     95.82     97.05     97.67     98.45     98.67
## Salary    65.34    65.75     66.03     66.44     66.69     66.77     66.94
##         15 comps  16 comps  17 comps  18 comps  19 comps
## X          99.02     99.26     99.42     99.98    100.00
## Salary     67.02     67.11     67.24     67.26     67.37

最好的  M  为2。评估相应的测试误差 。

sqrt(mean(pls.pred - test$Salary)^2)
## [1] 14.34

与PCR相比,在这里我们可以看到RMSE有所改善。


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/81397.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

19. Dropout从零代码实现以及简洁实现

1. 从零实现 要实现单层的暂退法函数&#xff0c; 我们从均匀分布U[0,1]中抽取样本&#xff0c;样本数与这层神经网络的维度一致。 然后我们保留那些对应样本大于p的节点&#xff0c;把剩下的丢弃。 在下面的代码中&#xff0c;我们实现 dropout_layer 函数&#xff0c; 该函…

【统一融合:U2Fusion】

U2Fusion: A Unified Unsupervised Image Fusion Network &#xff08;U2Fusion&#xff1a;一种统一的无监督图像融合网络&#xff09; 研究提出了一种新颖的统一监督和管理端到端图像融合网络,称为U2Fusion,能够解决不同的融合问题,包括多模态,多曝光,和多聚焦融合。利用特征…

网络0323和网络2303分类过程的比较

( A, B )---2*30*2---( 1, 0 )( 0, 1 ) 用网络分类A和B&#xff0c;让A是&#xff08;0&#xff0c;0&#xff09;&#xff08;1&#xff0c;1&#xff09;&#xff0c;让B是&#xff08;1&#xff0c;0&#xff09;&#xff08;1&#xff0c;1&#xff09;。测试集均为&#…

MySQL学习记录(8)MySQL锁

5、锁 5.1、概述 ​ 锁是计算机协调多个进程或线程并发访问某一资源的机制。在数据库中&#xff0c;除传统的计算资源&#xff08;CPU、 RAM、I/O&#xff09;的争用以外&#xff0c;数据也是一种供许多用户共享的资源。如何保证数据并发访问的一致性、有 效性是所有数据库必…

数字经济2023前瞻

判断1:“超个性化”已成数字化核心诉求 判断2:瞄准“既要又要还要”,业务与技术逼近“同心圆” 判断3:数据要素市场化配置的序幕拉开 判断4:产业数字化技术开源,打造深水区的群智协同平台 判断5:传统龙头企业的数字化“溢出”,实现赚钱传承双丰收 判断6:工业互联网将成…

Python学习----property属性上下文管理器生成器深拷贝浅拷贝正则表达式

property属性 property 属性就是负责把类中的一个方法当作属性使用&#xff0c;这样可以简化代码使用。 定义property属性有两种方式 1、装饰器方式 2、类属性方式 装饰器方式&#xff1a; 原本私有属性&#xff0c;我们不能直接访问&#xff0c;只能通过方法简介进行访问。…

网络安全观察报告 惯犯观察

执行摘要 从 1987 年 9 月 14 日&#xff0c;中国向世界发出第一封电子邮件到如今&#xff0c;中国的互联网发展已过去整整 31 个年头。从消费互联、产业互联到万物互联&#xff0c;互联网正在加速改变我们的交流方式和交易方式&#xff0c;一次次 004.重塑了国家的经济形态和…

[附源码]Python计算机毕业设计宠物用品购物网站Django(程序+LW)

该项目含有源码、文档、程序、数据库、配套开发软件、软件安装教程 项目运行 环境配置&#xff1a; Pychram社区版 python3.7.7 Mysql5.7 HBuilderXlist pipNavicat11Djangonodejs。 项目技术&#xff1a; django python Vue 等等组成&#xff0c;B/S模式 pychram管理等…

springboot+vue基本微信小程序的校园二手物品交易平台系统

在大学校园里&#xff0c;存在着很多的二手商品&#xff0c;但是由于信息资源的不流通以及传统二手商品信息交流方式的笨拙&#xff0c;导致了很多仍然具有一定价值或者具有非常价值的二手商品的囤积&#xff0c;乃至被当作废弃物处理。现在通过微信小程序的校园二手交易平台&a…

系统集成项目管理工程师2022年下半年广东卷下午案例分析题及答案

本系列文章将会对系统集成项目管理工程师考试中出现的各类案例分析题进行汇总解析&#xff0c;并给出分析过程&#xff0c;帮助考生备考复习。 更多复习内容请在微信搜索小程序 “系统集成项目管理工程师高频考点”。 1、A公司承接了一个信息系统开发项目&#xff0c;任命小安…

【现代机器人学】学习笔记四:一阶运动学与静力学

这节课的内容主要讲速度的正向运动学&#xff08;也就是位置的一阶导数&#xff0c;所以叫一阶运动学&#xff09;和静力学&#xff0c;这也是本书首次出现动力学相关的内容&#xff08;刚体运动那节提到的力旋量算是一个概念的介绍&#xff09;。 个人结合平时的工程项目看&a…

【DeepFuse:无监督用于与极端曝光图像】

DeepFuse: 一种深度无监督的方法&#xff0c;用于与极端曝光图像对进行曝光融合 DeepFuse: A Deep Unsupervised Approach for Exposure Fusion with Extreme Exposure Image Pairs 传统手工进行的MEF&#xff08;多曝光融合&#xff09;&#xff0c;对输入条件变化大的鲁棒性…

算法6.7BFS 算法6.8-6.9最小生成树

一个不知名大学生&#xff0c;江湖人称菜狗 original author: jacky Li Email : 3435673055qq.com Time of completion&#xff1a;2022.12.10 Last edited: 2022.12.11 目录 算法6.7BFS 第1关&#xff1a;算法6.7 BFS 任务描述 相关知识 编程要求 输入输出说明 测试说明…

[论文解析] Diffusion Models Beat GANs on Image Synthesis

code link: https://github.com/openai/guided-diffusion. 文章目录OverviewWhat problem is addressed in the paper?What is the key to the solution?What is the main contribution?Potential fundamental flaws; how this work can be improved?ContentsDiffusion mod…

web前端期末大作业:基于html化妆品购物商城项目的设计与实现——化妆品官方网站设计与实现(HTML+CSS+JS)

&#x1f389;精彩专栏推荐 &#x1f4ad;文末获取联系 ✍️ 作者简介: 一个热爱把逻辑思维转变为代码的技术博主 &#x1f482; 作者主页: 【主页——&#x1f680;获取更多优质源码】 &#x1f393; web前端期末大作业&#xff1a; 【&#x1f4da;毕设项目精品实战案例 (10…

【cocos源码学习】cocos2d-x-4.0 Android Demo工程结构的简析

说明 使用cocos new创建的Android工程&#xff0c;并对其进行简单的结构分析 Android工程结构 基于gradle构建的Android工程是于模块来组织的&#xff0c;工程根目录本身是一个主模块&#xff0c;其它为子模块&#xff0c;每个模块都有一个build.gradle文件&#xff0c;主模…

Java基本微信小程序的智能图书管理系统 计算机毕业设计

本设计旨在研究一种社区图书管理系统设计与实现系统&#xff0c;以各种浏览器web页面加上云服务器后端服务系统&#xff0c;通过这一设计过程&#xff0c;进一步熟悉web前端开发技术和云服务器后端开发技术和方法&#xff0c;培养理论联系实际及知识的综合运用能力。 图书管理系…

论文翻译:USE OF LIDAR FOR NEGATIVE OBSTACLE DETECTION: A THOROUGH REVIEW

(鉴定为水刊 综述 老板让看的) Abstract&#xff1a; 野外自动驾驶陆地车辆的负面障碍是指路面上的坑洼、沟渠、悬崖、坑或任何类型的障碍物&#xff0c;但不是以可见的方式&#xff0c;它给车辆或潜在的乘客带来风险&#xff0c;甚至给环境带来风险。 自主陆地车辆正在前进。这…

Golang前置环境配置

如何对go不同版本进行管理 安装gvm gvm工具提供多版本的go进行管理 下载命令: bash < <(curl -s -S -L http://raw.githubusercontent.com/moovweb/gvm/master/binscripts/gvm-installer)这里可能出现对应的raw.githubusercontent.com连不上的情况, 可以在https://www…

springboot+vue基本微信小程序的演出门票管理系统

转票是一个传统的行业。根据当前发展现状&#xff0c;网络信息时代的全面普及&#xff0c;转票也在发生着变化&#xff0c;单就出票这一方面&#xff0c;利用手机预约考试正在逐步进入人们的生活。传统的转票方式&#xff0c;不仅会耗费大量的人力、时间&#xff0c;有时候还会…