深度学习中的优化算法采用的原理是梯度下降法,选取适当的初值params
,不断迭代,进行目标函数的极小化,直到收敛。由于负梯度方向时使函数值下降最快的方向,在迭代的每一步,以负梯度方向更新params
的值,从而达到减少函数值的目的。
Optimizer
class Optimizer:
"""
优化器基类,默认是L2正则化
"""
def __init__(self, lr, weight_decay):
self.lr = lr
self.weight_decay = weight_decay
def step(self, grads, params):
# 计算当前时刻下降的步长
decrement = self.compute_step(grads)
if self.weight_decay:
decrement += self.weight_decay * params
# 更新参数
params -= decrement
def compute_step(self, grads):
raise NotImplementedError
SGD
随机梯度下降
θ
t
=
θ
−
η
⋅
g
t
\theta_t = \theta-\eta \cdot g_t
θt=θ−η⋅gt
-
每次随机抽取一个batch的样本进行梯度下降
-
对学习率敏感,太小收敛速度很慢,太大会在极小值附近震荡
-
对于非凸函数,容易陷入局部最小值或鞍点
class SGD(Optimizer):
"""
stochastic gradient descent
"""
def __init__(self, lr=0.1, weight_decay=0.0):
super().__init__(lr, weight_decay)
def compute_step(self, grads):
return self.lr * grads
SGDm
在SGD
中加入动量(momentum)模拟是物体运动时的惯性,即更新的时候在一定程度上保留之前更新的方向,同时利用当前batch的梯度微调最终的更新方向。这样一来,可以在一定程度上增加稳定性,从而学习地更快,并且还有一定摆脱局部最优的能力。
υ
t
=
γ
υ
t
−
1
+
g
t
θ
t
=
θ
t
−
1
−
η
υ
t
\upsilon_t = \gamma \upsilon_{t-1} + g_t \qquad \theta_t=\theta_{t-1} - \eta \upsilon_t
υt=γυt−1+gtθt=θt−1−ηυt
- gt是当前时刻的梯度,vt是当前时刻参数的下降距离
- 带动量的小球滚下山坡,可能会错过山谷
class SGDm(Optimizer):
"""
stochastic gradient descent with momentum
"""
def __init__(self, lr=0.1, momentum=0.9, weight_decay=0.0):
super().__init__(lr, weight_decay)
self.momentum = momentum
self.beta = 0
def compute_step(self, grads):
self.beta = self.momentum * self.beta + (1 - self.momentum) * grads
return self.lr * self.beta
Adagrad
θ t = θ t − 1 − η ∑ i = 0 t − 1 ( g i ) 2 g t − 1 \theta_t=\theta_{t-1} - \frac{\eta}{\sqrt{\sum^{t-1}_{i=0}{(g_i)^2}}}g_{t-1} θt=θt−1−∑i=0t−1(gi)2ηgt−1
- 自适应调节学习率
- 对低频的参数做较大的更新,对高频的做较小的更新,也因此,对于稀疏的数据它的表现很好,很好地提高了 SGD 的鲁棒性
- 缺点是分母梯度的累积,最后梯度消失
class Adagrad(Optimizer):
"""
Divide the learning rate of each parameter by the
root-mean-square of its previous derivatives
"""
def __init__(self, lr=0.1, eps=1e-8, weight_decay=0.0):
super().__init__(lr, weight_decay)
self.eps = eps
self.state_sum = 0
def compute_step(self, grads):
self.state_sum += grads ** 2
decrement = grads / (self.state_sum ** 0.5 + self.eps) * self.lr
return decrement
RMSProp
指数滑动平均更新梯度的平方,为解决Adagrad 梯度急剧下降而提出
υ
1
=
g
0
2
υ
t
=
α
υ
t
−
1
+
(
1
−
α
)
(
g
t
−
1
)
2
\upsilon_1 = g_0^2 \qquad \upsilon_t = \alpha\upsilon_{t-1} + (1-\alpha)(g_{t-1})^2
υ1=g02υt=αυt−1+(1−α)(gt−1)2
θ t = θ t − 1 − η υ t g t − 1 \theta_t=\theta_{t-1} - \frac{\eta}{\sqrt{\upsilon_t}} g_{t-1} θt=θt−1−υtηgt−1
class RMSProp(Optimizer):
"""
Root Mean Square Prop optimizer
"""
def __init__(self, lr=0.1, alhpa=0.99, eps=1e-8, weight_decay=0.0):
super().__init__(lr, weight_decay)
self.eps = eps
self.alpha = alhpa
self.state_sum = 0
def compute_step(self, grads):
self.state_sum = self.alpha * self.state_sum + (1 - self.alpha) * grads ** 2
decrement = grads / (self.state_sum ** 0.5 + self.eps) * self.lr
return decrement
Adam
SGDm和RMSProp的结合,Adam 算法通过计算梯度的一阶矩估计和二阶矩估计而为不同的参数设计独立的自适应性学习率。
- SGDm
θ t = θ t − 1 − m t m t = β 1 m t − 1 + ( 1 − β 1 ) g t − 1 \theta_t=\theta_{t-1} - m_t \qquad m_t = \beta_1 m_{t-1} + (1-\beta_1)g_{t-1} θt=θt−1−mtmt=β1mt−1+(1−β1)gt−1
- RMSProp
θ t = θ t − 1 − η υ t g t − 1 \theta_t=\theta_{t-1} - \frac{\eta}{\sqrt{\upsilon_t}} g_{t-1} θt=θt−1−υtηgt−1
υ 1 = g 0 2 υ t = β 2 υ t − 1 + ( 1 − β 2 ) ( g t − 1 ) 2 \upsilon_1 = g_0^2 \qquad \upsilon_t = \beta_2\upsilon_{t-1} + (1-\beta_2)(g_{t-1})^2 υ1=g02υt=β2υt−1+(1−β2)(gt−1)2
- Adam
θ t = θ t − 1 − η υ t ′ + ε m t ′ \theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\upsilon_t'+\varepsilon}} m_t' θt=θt−1−υt′+εηmt′
m t ′ = m t 1 − β 1 t v t ′ = v t 1 − β 2 t β 1 = 0.9 β 2 = 0.999 m_t' = \frac{m_t}{1-\beta_1^t} \qquad v_t' = \frac{v_t}{1-\beta_2^t} \qquad \beta_1=0.9 \quad \beta_2=0.999 mt′=1−β1tmtvt′=1−β2tvtβ1=0.9β2=0.999
class Adam(Optimizer):
"""
combination of SGDm and RMSProp
"""
def __init__(self, lr=0.1, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0):
super().__init__(lr, weight_decay)
self.eps = eps
self.beta1, self.beta2 = betas
self.mt = self.vt = 0
self._t = 0
def compute_step(self, grads):
self._t += 1
self.mt = self.beta1 * self.mt + (1 - self.beta1) * grads
self.vt = self.beta2 * self.vt + (1 - self.beta2) * (grads ** 2)
mt = self.mt / (1 - self.beta1 ** self._t)
vt = self.vt / (1 - self.beta2 ** self._t)
decrement = mt / (vt ** 0.5 + self.eps) * self.lr
return decrement
我平时做视觉任务主要用SGDm和Adam两个优化器,感觉带正则化的SGDm的效果非常好,然后调一下学习率和衰减策略
参考资料:
torch.optim — PyTorch documentation
tinynn: A lightweight deep learning library