批次与动量

文章目录

批次与动量
- 1. Small batch or Large batch?
- 2. Gradient descent + Momentum
- 3. 总结

1. Small batch or Large batch?

在使用 gradient descent 进行 optimization 时，在每一次 epoch 前，要 shuffle 所有的资料，然后再分成一个个 batch，每次更新参数是用 $\theta$ 对当前 batch 的 Loss 进行微分。
在这里插入图片描述
那么 batch size 是 small 一点好呢？还是 large 一点好呢？
我们在这里取两个比较极端的例子，batch size 最小是 1，最大是 N

当 batch size = 20 时，在看完 20 笔训练资料后才能 update，每一个 epoch 要 update 一次；
当 batch size = 1 时，在看完 1 笔训练资料后就 update，每一个 epoch 要有 20 次 update。

我们会想当然地以为使用 small batch 比 large batch 所需的时间更少。
在这里插入图片描述
其实不然，现在的 GPU 都有并行计算的能力，只要 batch size 不是太大，small batch 和 batch size 完成一次 update 所需要用的时间是差不多的。
例如，在 Tesla V100 GPU 这张显卡上训练 MNIST 数据集，完成 0 - 9 的数字分类任务，从下图中可以看出，batch size = 1 和 batch size = 1000 时每次更新参数所需的时间是差不多的，只有 batch size = 60000 时，数据量的确太大，大概需要 10 s.
在这里插入图片描述
刚刚比较的是使用 small batch 和 large batch 各完成一次 update 所需的时间。
如右下图所示，虽然 small batch 和 large batch(not too large) 各完成一次 update 所需的时间差不多，但是在一个 epoch 内，使用 small batch 训练模型比使用 large batch 训练模型所需要的时间要多。
在这里插入图片描述
因此一开始，我们想当然地认为使用 small batch 所需的时间更少，其实这是错误的。

比较完了训练时间后，再来比较性能。
当 batch size = 1 时，取得的效果比较好，在 validation set 上的 accuracy 要比在 training set 上的 accuracy 要高；而当 batch size = 60000 时，表现不是很好，无论是在 MNIST 数据集还是在 CIFAR-10 数据集上的 accuracy 都很低，说明 optimization 没有做好。
在这里插入图片描述
为什么 small batch size 会有着更好的表现呢？
还是以 full batch 和 small batch 作对比，使用 full batch size 训练的时候，在参数更新过程中，一旦遇到了 saddle point 或者 local minima 便会使得 gradient 为 0，模型训练也就早早地停了下来；
相较于 full batch size，将训练资料分为一个个 small batch，每一次都是在不同 batch 的训练资料上取得的 Loss 作 gradient descent 的，如果在这个 batch 上遇到了 critical point，那么在下个 batch 上说不定就没有 critical point，这样便能使得小球在 error surface 上滚下去得更远，optimization 做得更好，Loss 也能便得更小。
在这里插入图片描述
那现在再将两者在 testing set 上进行性能比较，发现使用 small batch 比使用 large batch 取得的效果更好一些。

在 error surface 中会有各种各样的 local minima，有类似平原的 minima，也有类似峡谷的 minima，假设训练过程中出现了 mismatch 现象，在平原上 training loss 和 testing loss 只差一点，而在峡谷中 training loss 和 testing loss 会差很多。
而使用 large batch 训练，会更倾向地遇到峡谷；使用 small batch 训练，会更倾向地遇到平原，所以使用 small batch 性能会更好一点。
在这里插入图片描述
因此，综合各项评判指标来看，使用 small batch 训练，虽然时间更长，但能把 optimization 做好，且泛化性能也比较好。