文章目录
- 1. 目的
- 2. Sigmoidal Function
- 2.1 S2 用到 Sigmoidal Function
- 2.2 Sigmoidal Function 的定义
- 3. Squashing Function
- 3.1 改用 Sigmoid Suahsing function 术语
- 3.2 具体到 hyperlolic tangent 这一 squahsing function
- 4. Squahsing function 的实现
- References
1. 目的
弄清 LeNet-5 的论文原文 LeCun-98.pdf 中的 squashing function 的公式, 并用 C语言实现。
论文原文说的很拗口,其实就是 f ( a ) = 1.7159 tanh ( 2 3 a ) f(a) = 1.7159 \ \tanh(\frac{2}{3}a) f(a)=1.7159 tanh(32a) 这个公式。本着溯源复刻的精神,首先从论文源码一点点抠出来它是怎么表述的(麻烦),然后给出 C 语言的实现(简单)。
2. Sigmoidal Function
2.1 S2 用到 Sigmoidal Function
在 LeCun-98.pdf1 论文中对 S2 层的介绍中,提到了 sigmoidal function:
Layer S2 is a sub-sampling layer with 6 feature maps of size 14x14. Each unit in each feature map is connected to a 2x2 neighborhood in the corresponding feature map in C1. The four inputs to a unit in S2 are added, then multiplied by a trainable coefficient, and added to a trainable bias. The result is passed through a sigmoidal function. The 2x2 receptive fields are non-overlapping, therefore feature maps in S2 have half the number of rows and column as feature maps in C1. Layer S2 has 12 trainable parameters and 5880 connections.
也就是说, 至少对于 S2 层而言, 计算过程的最后一步是用到了 sigmoidal function 的:
sum
=
x
11
+
x
12
+
x
21
+
x
22
a
=
sum
×
c
o
e
f
f
+
bias
x
=
f
(
a
)
,
sigmoidal function
\text{sum} = x_{11} + x_{12} + x_{21} + x_{22} \\ a = \text{sum} \times coeff + \text{bias} \\ x = f(a), \ \ \text{sigmoidal function}
sum=x11+x12+x21+x22a=sum×coeff+biasx=f(a), sigmoidal function
2.2 Sigmoidal Function 的定义
在论文 Approximation by Superpositions of a Sigmoidal Function2 里有所定义:
Definition. We say that
σ
\sigma
σ is sigmoidal if
σ
(
t
)
=
{
1
as
t
→
+
∞
0
as
t
→
−
∞
\sigma(t)=\left\{ \begin{aligned} 1 & & \text{as} \ t \rightarrow +\infty \\ 0 & & \text{as} \ t \rightarrow -\infty \end{aligned} \right.
σ(t)={10as t→+∞as t→−∞
即:当自变量 t t t 趋向于正无穷时,因变量 σ ( t ) \sigma(t) σ(t) 趋向于1; 并且当自变量 t t t 趋向于负无穷时, 因变量 σ ( t ) \sigma(t) σ(t) 趋向于0, 那么函数 σ ( t ) \sigma(t) σ(t) 称为 Sigmoidal Function.
例如
σ
(
x
)
=
1
1
+
e
−
x
\sigma(x) = \frac{1}{1 + e^{-x}}
σ(x)=1+e−x1
是一个具体的 sigmoidal function, 曲线如下:
3. Squashing Function
3.1 改用 Sigmoid Suahsing function 术语
论文1 并没有一直用 sigmoidal function 这个术语, 而是转而用 sigmoid squashing function 这一术语。在对 F6 层的结构进行说明时,提到了 sigmoid squahsing function:
Layer F6, contains 86 units (the reason for this number comes from the design of the output layer, explained below) and is fully connected to C6. It has 10164 trainable parameters.
As in classical neural networks, units in layers up to F6 compute a dot product between their input vector and their weight vector, to which a bias is added. This weighted sum, denoted a i a_i ai for unit i, is then passed through a sigmoid squashing function to produce the state of unit i, denoted by x i x_i xi:
x i = f ( a i ) x_i = f(a_i) xi=f(ai)
到这里,猜测 sigmoidal function 和 sigmoid squashing function 含义相同。
3.2 具体到 hyperlolic tangent 这一 squahsing function
继续看原文1, 提到:
The squashing function is a scaled hyperbolic tangent:
f ( a ) = A tanh ( S a ) f(a) = A \tanh(Sa) f(a)=Atanh(Sa)
这里其实作者有点偷换概念了,
tanh
(
x
)
\tanh(x)
tanh(x) 并不满足 sigmoidal function 的定义,它在 x 趋向于
−
∞
-\infty
−∞ 时,极限值为 -1 而不是 0, 曲线如下:
tanh
(
x
)
=
e
x
−
e
−
x
e
x
+
e
−
x
\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
tanh(x)=ex+e−xex−e−x
作者给出的解释, 在 Appendix A 部分(P41)
For out simulation, we use A = 1.7159 and S = 2 3 S=\frac{2}{3} S=32. With this choice of parameters, the equalities f ( 1 ) = 1 f(1) = 1 f(1)=1 and f ( − 1 ) = − 1 f(-1) = -1 f(−1)=−1 are satisfied. The rationale behind this is that the overall gain of the squashing transformation is around 1 in normal operatiing conditions, and the interpretation of the state of the network is simplified. Moreover, the absolute value of the second derivative of f f f is maximum at +1 and -1, which improves the convergence torwards the end of the learning session. This particular choice of parameters is merely a convenience, and does not affect the result.
也就是说,作者用破坏了原本 Sigmoidal function 定义的函数 tanh 的目的是, 二阶导数在 x=1 和 x=-1 时能取到最大的绝对值,能让 learning 过程加速收敛,并且不影响结果。
P.S. LeCun-98b 3 再次给出了这个公式和解释:
StackOverFlow 上有人问过问题 4:
https://math.stackexchange.com/questions/1755711/why-is-there-a-factor-of-1-7159-with-the-tanh-function-used-in-neural-network-ac
4. Squahsing function 的实现
我们现在知道了 squahsing function 的具体形式5 6:
f
(
a
)
=
1.7159
tanh
(
2
3
a
)
f(a) = 1.7159 \ \tanh(\frac{2}{3}a)
f(a)=1.7159 tanh(32a)
tanh ( x ) \tanh(x) tanh(x) 的计算在 scratch lenet(9): C语言实现tanh的计算7 有给出过, 使用的是高斯连分数公式做近似。
得到的 squashing function 的 C语言实现如下:
// compute hyperbolic tangent by Continued Fraction formula, found by Gauss in 1812.
// https://math.stackexchange.com/a/107295
static double m_tanh(double x)
{
double s = x * x;
double y = 9 + s / 11;
y = 7 + s / y;
y = 5 + s / y;
y = 3 + s / y;
y = 1 + s / y;
y = x / y;
return y;
}
/// f(a) = A tanh(Sa)
/// c.f. LeCun-98.pdf formula(6) in page 8
double squashing(double a)
{
const double A = 1.7159;
const double S = 2.0 / 3.0; /// c.f. LeCun-98.pdf page 41
double res = A * m_tanh(S * a);
return res;
}
References
lecun-98.pdf ↩︎ ↩︎ ↩︎
Approximation by Superpositions of a Sigmoidal Function ↩︎
lecun-98b.pdf ↩︎
Why is there a factor of 1.7159 with the tanh function used in neural network activation? ↩︎
论文笔记:Gradient-Based Learning Applied to Document Recognition ↩︎
論文筆記 - LeCun 1998 - Gradient-Based Learning Applied to Document Recognition ↩︎
scratch lenet(9): C语言实现tanh的计算 ↩︎