Day11 when gradient is small……
data:image/s3,"s3://crabby-images/9c290/9c290eb2b0cd763c86bfc1c5cfb1e403197f73b3" alt="image-20240411165229093"
data:image/s3,"s3://crabby-images/da0f0/da0f06ceee12f436456a922ee145cf5f447d975e" alt="image-20240411170420173"
怎么知道是局部小 还是鞍点?
using Math
data:image/s3,"s3://crabby-images/39cd8/39cd87a12a44eefddc231433cc8913d39ed503ca" alt="image-20240411170714889"
data:image/s3,"s3://crabby-images/21dbc/21dbcb3a458748681f8d7991e01bcb583808a751" alt="image-20240411170827423"
data:image/s3,"s3://crabby-images/676d4/676d4b12381fbac6805a2194b7c45ffe4d2d2045" alt="image-20240411183651821"
这里巧妙的说明了hessan矩阵可以决定一个二次函数的凹凸性 也就是 θ \theta θ 是min 还是max,最后那个有些有些 哈 是一个saddle;
然后这里只要看hessan矩阵是不是正定的就好(详见 线性代数)
example – using Hessan
奇怪这里为什么不是主对角线呀,难道两个都一样嘛 晕死,得复习线代了
Dont afraid of saddle point(鞍点)
data:image/s3,"s3://crabby-images/36b17/36b17ddae976dfc99df695515db317b2926ce800" alt="image-20240411191229522"
征向量 u 和对应的特征值 λ定义为满足下列关系的向量和标量:Hu=λu
在梯度下降算法中,我们希望选择使得 L*(*θ) 减小的 θ 方向。如果 λ<0,则向 u 的方向移动参数 θ 会减小损失函数 L(θ)。
换句话说,如果我们发现了一个负特征值λ 和相应的特征向量u,我们可以通过沿着 u 的方向更新 θ 来降低损失函数的值。这就是图中所说的“Decrease L”的含义。
data:image/s3,"s3://crabby-images/8db1e/8db1e6a2379a87d63d80bdd0a565708c1e91b4e7" alt="image-20240411192321757"
local minima VS saddle Point
data:image/s3,"s3://crabby-images/33e35/33e359185ff4a2b17a1771032a480b23ae722fe4" alt="image-20240411193129424"
data:image/s3,"s3://crabby-images/50fd1/50fd1b3c658661577744640092ef30cb708bce1a" alt="image-20240411193238944"
引入高维空间的观点,解决local minima的问题:我们很少遇到local minima;
data:image/s3,"s3://crabby-images/ffa87/ffa879a743b26934b0042b9954280b1146af12b9" alt="image-20240411193304041"
Day12 Tips for training :Batch and Momentum
why we use batch?
前面有讲到这里, 前倾回归
这里大家记得问自己一个问题:一个epoch 更新多少个参数?nums(batch)* parameters
例如,如果你有100个batch,那么在完成一个epoch后,每个参数会被更新100次。
shuffle :有可能batch结束后,就会重新分一次batch
small vs big
这里举了两个极端的例子,也是我们常见的学习方法:取极限看效果
data:image/s3,"s3://crabby-images/9d33d/9d33d3cc577f10c68cf2aa328147711ddcce0146" alt="image-20240411194814890"
未考虑平行运算(并行 --gpu)
data:image/s3,"s3://crabby-images/9831b/9831bf89219b795146cf86854e412d09472680d2" alt="image-20240411195047480"
data:image/s3,"s3://crabby-images/2becd/2becd7fcbdbd7090e091f9f082f337fd5bac95a9" alt="image-20240411195201626"
data:image/s3,"s3://crabby-images/d4570/d4570a4ea147983fa8cc95522dcda96401e07b87" alt="image-20240411195334111"
data:image/s3,"s3://crabby-images/12a86/12a8674217d95d68f1a303cc63e7fff69563e320" alt="image-20240411195439085"
over fitting: 比较train 和test
data:image/s3,"s3://crabby-images/978b9/978b950517026d2bd56f9c6a24fb3cb59949cef1" alt="image-20240411195730628"
data:image/s3,"s3://crabby-images/1d47c/1d47cd391c0c97e1698de78a2e911ae760c92821" alt="image-20240411195807268"
Aspect | Small Batch Size(100个样本) | Large Batch Size(10000个样本) |
---|---|---|
Speed for one update (no parallel) | Faster | Slower |
Speed for one update (with parallel) | Same | Same (not too large) |
Time for one epoch | Slower | Faster |
Gradient | Noisy | Stable |
Optimization | Better | Worse |
Generalization | Better | Worse |
batch is a hyperparameter……
Momentum
惯性
data:image/s3,"s3://crabby-images/a5b5b/a5b5b717d4a35347593d1b8fe5f2349fb955185a" alt="image-20240411200222463"
data:image/s3,"s3://crabby-images/c768f/c768fec2c41abb4d0fc7984c3cdc75eee1ddb3cc" alt="image-20240411200312757"
知道学到这里想到什么嘛……粒子群算法的公式不知道你们有没有了解,看下面那个w*vi 有没有感觉这种思想还挺常见的,用来做局部最小值的优化的
data:image/s3,"s3://crabby-images/18bd9/18bd94cc364b18a26cef2e0c045d24320091e86b" alt="image-20240411201955070"
data:image/s3,"s3://crabby-images/93b9a/93b9af8c07a86b96bffd215ab33745e7f423dd86" alt="image-20240411202039014"