文章目录
- 1. 音频信号处理介绍
- 2. 音频信号预处理
- 3. 特征
- 4. 特征重现
- 5. 语音识别
- 6. AdaBoost
- 7. 人脸识别
- 8. 神经网络
- 9. 卷积神经网络
- 10. Auto-Encoder
- 11. 循环神经网络 和 LSTM
- 12. Word Representation
- 13. 决策树
1. 音频信号处理介绍
- x KHz, y bit, n s 多少 byte: x ∗ 1000 ∗ y / 8 ∗ n b y t e s x*1000*y/8*n\ bytes x∗1000∗y/8∗n bytes
- (模拟到数字)系统采样率 20 KHz,采样声音频率最大为 20/2 KHz
- 声音频率为 20 KHz,(模拟到数字)系统采样率为 20*2 KHz
2. 音频信号预处理
-
frame blocking
- frame size (ms) + overlapping rate -> frame (Hz*ms/1000)
-
windowing (window size = N)
- s ~ ( k ) = s ( k ) ⋅ W ( k ) W ( k ) = 0.54 − 0.46 cos ( 2 π k N − 1 ) 0 ≤ k ≤ N − 1 \tilde{s}(k) = s(k)\cdot W(k)\\ W(k) = 0.54-0.46\cos(\frac{2\pi k}{N-1})\\ 0\le k\le N-1 s~(k)=s(k)⋅W(k)W(k)=0.54−0.46cos(N−12πk)0≤k≤N−1
-
Fourier Transform
-
X m = ∑ k = 0 N − 1 s ( k ) ⋅ e i ( − 2 π k m / N ) = ∑ k = 0 N − 1 s ( k ) ⋅ ( cos ( 2 π k m N ) − sin ( 2 π k m N ) ) e i θ = cos ( θ ) + i sin ( θ ) \begin{aligned} X_m & = \sum_{k=0}^{N-1}s(k)\cdot e^{i(-2\pi k m/N)} \\ & = \sum_{k=0}^{N-1}s(k)\cdot\left(\cos\left(\frac{2\pi k m}{N}\right)-\sin\left(\frac{2\pi k m}{N}\right)\right)\\ e^{i\theta} & = \cos(\theta)+i\sin(\theta) \end{aligned} Xmeiθ=k=0∑N−1s(k)⋅ei(−2πkm/N)=k=0∑N−1s(k)⋅(cos(N2πkm)−sin(N2πkm))=cos(θ)+isin(θ)
-
e n e r g y = ( X m . r e a l ) 2 + ( X m . i m g i n a r y ) 2 energy = (X_m.real)^2+(X_m.imginary)^2 energy=(Xm.real)2+(Xm.imginary)2
-
m a g n i t u d e = e n e r g y magnitude = \sqrt{energy} magnitude=energy
-
-
Inverse Fourier Transform
- KaTeX parse error: No such environment: align at position 8: \begin{̲a̲l̲i̲g̲n̲}̲ s(k) & = \frac…
-
计算第 7 个 frame 的开始和结束位置 N = 256 m = 243
- q = 7 start = 243*6
- q = 7 end = 243*6 + 256 - 1
3. 特征
-
Mel Scale
- m = 2595 log 10 ( 1 + f 700 ) m = 2595\log_{10}(1+\frac{f}{700}) m=2595log10(1+700f)
- Δ m = 2595 log 10 ( 700 + f 1 700 + f 2 ) \Delta m=2595\log_{10}(\frac{700+f_1}{700+f_2}) Δm=2595log10(700+f2700+f1)
-
LPC (Linear Predictive Coding) filter
-
windowing - pre-emphasis - autocorrelation - LPC - cepstral coef
-
pre-emphasis
-
s ′ ( k ) = s ( k ) − a ~ ⋅ s ( k − 1 ) s ′ ( 0 ) = s ′ ( 1 ) s^\prime(k) = s(k)-\tilde{a}\cdot s(k-1) \\ s^\prime(0)=s^\prime(1) s′(k)=s(k)−a~⋅s(k−1)s′(0)=s′(1)
-
a ~ \tilde{a} a~ 为给定值
-
-
LPC of order p p p
-
先算 r 0 r_0 r0 到 r p r_p rp (auto-correlation 值)
-
r i = ∑ n = 0 n = N − 1 − i ( s n ⋅ s n + i ) r_i = \sum_{n=0}^{n=N-1-i}(s_n\cdot s_{n+i}) ri=n=0∑n=N−1−i(sn⋅sn+i)
-
r = 4
- 00 11 22 33 44 55
- 01 12 23 34 45
- 02 13 24 35
- 03 14 25
- 04 15
-
-
构成矩阵和向量求 a 1 a_1 a1 到 a p a_p ap
- [ r 0 r 1 r 2 … , r p − 1 r 1 r 0 r 1 … , r p − 2 r 2 r 1 r 0 … , r p − 3 : : : … , : r p − 1 r p − 2 r p − 3 … , r 0 ] [ a 1 a 2 a 3 : a p ] = [ r 1 r 2 r 3 : r p ] \left[\begin{array}{ccccc} r_0 & r_1 & r_2 & \ldots, & r_{p-1} \\ r_1 & r_0 & r_1 & \ldots, & r_{p-2} \\ r_2 & r_1 & r_0 & \ldots, & r_{p-3} \\ : & : & : & \ldots, & : \\ r_{p-1} & r_{p-2} & r_{p-3} & \ldots, & r_0 \end{array}\right]\left[\begin{array}{c} a_1 \\ a_2 \\ a_3 \\ : \\ a_p \end{array}\right]=\left[\begin{array}{c} r_1 \\ r_2 \\ r_3 \\ : \\ r_p \end{array}\right] ⎣⎢⎢⎢⎢⎡r0r1r2:rp−1r1r0r1:rp−2r2r1r0:rp−3…,…,…,…,…,rp−1rp−2rp−3:r0⎦⎥⎥⎥⎥⎤⎣⎢⎢⎢⎢⎡a1a2a3:ap⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡r1r2r3:rp⎦⎥⎥⎥⎥⎤
-
a = A − 1 b a = A^{-1}b a=A−1b
- 伴随矩阵,右边和单位矩阵连起来,把左边化成单位矩阵
-
-
-
Cepstrum
- s’(k) = window(s(k))
- |X(m)| = dft(s’(k))
- Log(|X(m)|)
- C(n) = idft(Log(|X(m)|))
4. 特征重现
- Vector Quantization
- LPC 几个参数就是几维,如果能划分开,把相同发音的 LPC 取均值(中心坐标)即可代表该发音
- Standard K-means
- Binary-split K-means
- 10KHz, 8-bit, frame size = 25 ms 没有 overlapping, LPC 10 求压缩率
- 10*1000*8/1*25/1000 = 25*10 = 250 Bytes
- 一个浮点数 4 Bytes 10*4 = 40 Bytes
- ratio = 250/40 = 6.25
5. 语音识别
- end-point detection - pre-emphasis - frame blocking and windowing - LPC/MFCC - distortion
- end-point detection
- 能量和 zero-crossing 在一帧超过阈值
- frame blocking and windowing
- 得到的是两堆向量
- 向量和向量之间两两求距离 [1,2,3] [2,3,4] 距离 1+1+1 不用开平方,作为两个向量之间的距离
- distortion 慢慢来,慢慢来
- 两个音频之间得到一个值
- n个音频两两之间的值构成一个 confusion matrix
6. AdaBoost
7. 人脸识别
Attribute | Calculation |
---|---|
Accuracy | T P + T N T P + T N + F P + F N \frac{TP+TN}{TP+TN+FP+FN} TP+TN+FP+FNTP+TN |
Precision | T P T P + F P \frac{TP}{TP+FP} TP+FPTP |
Recall | T P T P + F N \frac{TP}{TP+FN} TP+FNTP |
8. 神经网络
-
前向传播
-
反向传播
-
f ( u i ) = x i f(u_i) = x_i f(ui)=xi 激活函数
-
隐藏层和输出层之间
-
Δ w j , i = − η ∂ ε ∂ w j , i = − η [ ( x i − t i ) ⋅ f ( u i ) ( 1 − f ( u i ) ) ] ⋅ x j \Delta w_{j,i}=-\eta \frac{\partial\varepsilon }{\partial w_{j,i}}= -\eta[(x_i-t_i)\cdot f(u_i)\left(1-f(u_i)\right)]\cdot x_j Δwj,i=−η∂wj,i∂ε=−η[(xi−ti)⋅f(ui)(1−f(ui))]⋅xj
-
t i t_i ti 是输入的正确值,用来训练的
-
-
隐藏层和隐藏层之间
-
Δ w k , j = − η ∂ ε ∂ w k , j = − η ( ∑ i = 0 i = I ( s i ⋅ w j , i ) ) ⋅ [ f ( u j ) ⋅ ( 1 − f ( u j ) ) ] ⋅ x k \Delta w_{k,j} = -\eta \frac{\partial\varepsilon }{\partial w_{k,j}}= -\eta \left(\sum_{i=0}^{i=I}(s_i\cdot w_{j,i})\right)\cdot \left[f(u_j)\cdot \left(1-f(u_j)\right)\right]\cdot x_k Δwk,j=−η∂wk,j∂ε=−η(i=0∑i=I(si⋅wj,i))⋅[f(uj)⋅(1−f(uj))]⋅xk
-
s i s_i si 是用来干啥的?
-
-
9. 卷积神经网络
- 卷积
- 每一个卷积核有一个bias
- feature map 大小 (N-m+2p)/m + 1
- 采样
- 没有bias
10. Auto-Encoder
- 传统的和新的 Auto-Encoder 的输入和输出维度都一样
- 考试考了两个分布上的转换,要细看!
11. 循环神经网络 和 LSTM
-
RNN
-
T a n h ( W h x ( 1 , : ) ∗ X t + W h h ( 1 , : ) ∗ h t + b i a s ( 1 ) ) = h t + 1 ( 1 ) Tanh(Whx(1,:)*X_t+Whh(1,:)*h_t+bias(1))=h_{t+1}(1) Tanh(Whx(1,:)∗Xt+Whh(1,:)∗ht+bias(1))=ht+1(1)
-
矩阵化:
T a n h ( W h x ∗ X t + W h h ∗ h t + b i a s ) = h t + 1 Tanh(Whx*X_t+Whh*h_t+bias)=h_{t+1} Tanh(Whx∗Xt+Whh∗ht+bias)=ht+1
-
y _ o u t = W h y ∗ h t y\_out = Why * h_t y_out=Why∗ht
-
s o f t m a x _ y _ o u t = S o f t m a x ( y _ o u t ) softmax\_y\_out = Softmax(y\_out) softmax_y_out=Softmax(y_out)
-
-
LSTM
- 权重数量计算
- cell = m, input = n, output = y, hidden layer number = l
- W = 4 ∗ m ∗ ( m + n ) + 4 ∗ m ∗ ( l − 1 ) ∗ ( m + m ) + y ∗ m W = 4*m*(m+n) + 4*m*(l-1)*(m+m)+y*m W=4∗m∗(m+n)+4∗m∗(l−1)∗(m+m)+y∗m
- B = 4 ∗ l ∗ m B = 4*l*m B=4∗l∗m
- 权重数量计算
12. Word Representation
-
BOW (Bag Of Words) cosine similarity
-
TF-IDF cosine similarity
- 在该句子中占的比例和所有句子中出现的比例的log的乘积
-
cosine similarity
- 两向量相乘除以(两向量的长度乘积)
-
Word2Vec
- N-gram, skip-gram
- 3-skip-2-gram
- 包含 2 gram
- 包含 2-skip 1-skip 2-gram
13. 决策树
-
GINI_index
-
G i n i i n d e x = 1 − ∑ i p i 2 Gini_{index}=1-\sum_ip_i^2 Giniindex=1−i∑pi2
-
G I N I s p l i t = ∣ S 1 ∣ ∣ S ∣ G I N I ( S 1 ) + ∣ S 2 ∣ ∣ S ∣ G I N I ( S 2 ) GINI_{split} = \frac{|S_1|}{|S|}GINI(S_1)+\frac{|S_2|}{|S|}GINI(S_2) GINIsplit=∣S∣∣S1∣GINI(S1)+∣S∣∣S2∣GINI(S2)
-
-
Entropy
- E n t r o p y = ∑ i − p i log 2 ( p i ) Entropy=\sum_i-p_i\log_2(p_i) Entropy=i∑−pilog2(pi)