文章目录
- 前言
- Conformer encoder model architecture
- Convolution Subsampling
- Linear & Dropout
- Conformer Blocks
- Feed Forward Module
- Multi-Headed Self-Attention Module
- Convolution Module
前言
Conformer encoder model architecture
对于语音来说,整个架构里,输入是(batch_size,time_steps,feature_dim)
,输出也是(batch_size,time_steps,feature_dim)
batch_size
:每个batch的样本数量time_steps
:每个样本语音sequence的时间长度,比如128帧feature_dim
:特征维度,如Mel频谱系数的数量
Convolution Subsampling
一种有效的降维方式,同时保持了输入特征和时间步长的缩减。卷积层通过下采样减少计算复杂度
对输入序列进行卷积下采样,同时增加通道维度。具体地,它通过两个 Conv2d 操作将输入的时间步长缩减到原来的 1/4,并且输出的特征维度与 encoder_dim 相关。
Input:[batch_size,time_steps,input_dim]
先通过增加维度为[batch_size,1,time_steps,input_dim]
再通过卷积把1拓展为encoder_dim
,同时将时间维度和特征维度约变为原来的1/4,即
new_input_dim=((input_dim - 1) // 2 - 1) // 2
new_time_steps=((time_steps-1)//2-1)//2
[batch_size,encoder_dim,new_time_steps,new_input_dim]
,随后将其展开,变为三维tensor:[batch_size,new_time_steps,new_input_dim*encoder_dim]
eg:
encoder_dim=80
[25,128,40]–>[25,31,720]
Linear & Dropout
- Linear
Linear层在convolution subsampling的作用是将
卷积后展平的特征维度映射到 encoder_dim 的维度。虽然卷积后的特征数经过扩展,但最终希望使用 Linear 层将其映射到 encoder_dim 大小,以便后续的 Conformer block 处理。
input:[batch_size,new_time_steps,new_input_dim*encoder_dim]
output:[batch_size,new_time_steps,encoder_dim]
- Dropout
随机失活层,在训练过程中以一定的概率随机丢弃一些特征,用于防止过拟合。
Conformer Blocks
需要注意的是,每个 Module 都使用了残差连接。
Feed Forward Module
对输入进行layernorm之后,第一个线性层使用扩展因子 4,第二个线性层将其投影回模型维度。中间使用swish激活函数并且进行dropout操作,最后进行一次dropout
input: [batch_size,new_time_steps,encoder_dim]
output:[batch_size,new_time_steps,encoder_dim]
class FeedForwardModule(nn.Module):
def __init__(
self,
encoder_dim: int = 512,
expansion_factor: int = 4,
dropout_p: float = 0.1,
) -> None:
super(FeedForwardModule, self).__init__()
self.sequential = nn.Sequential(
nn.LayerNorm(encoder_dim),
Linear(encoder_dim, encoder_dim * expansion_factor, bias=True),
Swish(),
nn.Dropout(p=dropout_p),
Linear(encoder_dim * expansion_factor, encoder_dim, bias=True),
nn.Dropout(p=dropout_p),
)
def forward(self, inputs: Tensor) -> Tensor:
return self.sequential(inputs)class FeedForwardModule(nn.Module):
def __init__(
self,
encoder_dim: int = 512,
expansion_factor: int = 4,
dropout_p: float = 0.1,
) -> None:
super(FeedForwardModule, self).__init__()
self.sequential = nn.Sequential(
nn.LayerNorm(encoder_dim),
Linear(encoder_dim, encoder_dim * expansion_factor, bias=True),
Swish(),
nn.Dropout(p=dropout_p),
Linear(encoder_dim * expansion_factor, encoder_dim, bias=True),
nn.Dropout(p=dropout_p),
)
def forward(self, inputs: Tensor) -> Tensor:
return self.sequential(inputs)
Multi-Headed Self-Attention Module
input: [batch_size,new_time_steps,encoder_dim]
output:[batch_size,new_time_steps,encoder_dim]
多头自注意力模块,这里用了相对位置编码(源于Transformer-XL)
Convolution Module
input: [batch_size,new_time_steps,encoder_dim]
output:[batch_size,new_time_steps,encoder_dim]
卷积模块包含一个扩展因子为 2 的逐点卷积,用 GLU 激活层投影通道数,后跟一个一维深度卷积。一维深度卷积之后是 Batchnorm,然后是 swish 激活层,再接逐点卷积,最后dropout