一、参考资料

课件：10_Transformer_1.pdf
视频：Transformer模型(1/2): 剥离RNN，保留Attention

二、Attention without RNN

Attention模型可以看到全局的信息。
本章节以 Seq2Seq（ (encoder + decoder)）模型为例，介绍Attention机制。

1. Keys&Values&Query定义

Encoder’s inputs are vectors $\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_m$ .
Decoder’s inputs are vectors $\color{red}{\mathbf{x}_1^{\prime}},\color{red}{\mathbf{x}_2^{\prime}},\cdots,\color{red}{\mathbf{x}_t^{\prime}}$ .
$\color{red}{Keys}$ and $\color {red}{Values}$ are based on encoder’s inputs $\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_m$ .
$\color {red}{Queries}$ are based on decoder’s inputs $\color{red}{\mathbf{x}_1^{\prime}},\color{red}{\mathbf{x}_2^{\prime}},\cdots,\color{red}{\mathbf{x}_t^{\prime}}$ .
$\color{red}{Keys}$ ： $\mathbf{k}_{:i}=\mathbf{W}_K\mathbf{x}_i$ .
$\color {red}{Values}$ ： $\mathbf{v}_{:i}=\mathbf{W}_V\mathbf{x}_i$ .
$\color {red}{Query}$ ： ${\mathbf{q}_{:j}=\mathbf{W}_Q}{\mathbf{x}_j^{\prime}}$ .

2. Attention机制的原理

2.1 Compute weights

${\alpha_{:1}=\mathrm{Softmax}(\mathbb{K}^T{q_{:1}})\in\mathbb{R}^m}$
在这里插入图片描述

${\alpha_{:2}=\mathrm{Softmax}(\mathbb{K}^T{q_{:2}})\in\mathbb{R}^m}$
在这里插入图片描述

2.2 Compute context vector

${\mathbf{c}_{:1}=\alpha_{:1}\mathbf{v}_{:1}+\cdots+\alpha_{:m}\mathbf{v}_{:m}=\mathbf{V}\mathbf{\alpha}_{:1}}$
在这里插入图片描述

${c_{:2}=\alpha_{12}v_{:1}+\cdots+\alpha_{:m}v_{:m}=V\alpha_{:2}}$
在这里插入图片描述

${\mathrm{c}_{:j}}=\alpha_{1j}\mathbf{v}_{:1}+\cdots+\alpha_{mj}\mathbf{v}_{:m}=\mathbf{V}\mathbf{\alpha}_{:j}$
在这里插入图片描述

2.3 Output of attention layer

${C=[c_{:1},c_{:2},c_{:3},\cdots,c_{:t}]}$ .
${\mathrm{c}_{:j}=\mathrm{V}\cdot\mathrm{Softmax}(\mathrm{K}^T {\mathbf{q}_{:j}})}$ .
$\mathrm{c}_{:j}$ is a function of $\mathbf{X}_j^{\prime}$ and $[\mathbf{x}_1,\cdots,\mathbf{x}_m]$ .

2.4 Attention Layer

Attention layer: $\mathrm{C}=\mathrm{Attn}(\mathbf{X},\mathbf{X}^{\prime})$ .
Encoder’s inputs: $\mathbf{X}=[\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_m]$ .
Decoder’s inputs: $\mathbf{X}^{\prime}=[x_1^{\prime},x_2^{\prime},\cdots,x_t^{\prime}]$ .
Parameters: $\mathbf{W}_Q\textbf{, W}_K\textbf{, W}_V$ .

2.5 Machine Translation

本章节介绍Attention机制在Machine Translation机器翻译任务中的应用。将English翻译成German。
在这里插入图片描述

3. Attention最新研究

比标准Attention快197倍！Meta推出多头注意力机制“九头蛇”
Hydra Attention: Efficient Attention with Many Heads

三、Self-Attention without RNN

The Illustrated Transformer

Attention机制详解（二）——Self-Attention与Transformer

1. 引言

在介绍Self-Attention之前，先举了一个语义处理的例子：

“The animal didn’t cross the street because it was too tired.”

我们人很容易理解，后面的it是指animal，但是要怎么让机器能够把it和animal关联起来呢？
在这里插入图片描述

Self-attention就是在这种需求下产生的，如上图所示，我们应当有一个结构能够表达每个单词和其他每个单词的关系。

2. Self-Attention机制的原理

2.0 Keys&Values&Query定义

输入为 $\color{red}{x_1, x_2, x_3,..,x_m}$ 。
$\color{red}{Query}$ : $\mathbf{q}_{:i}=\mathbf{W}_Q\mathbf{x}_i$ ;
$\color{red}{Key}$ : $\mathbf{k}_{:i}=\mathbf{W}_K\mathbf{x}_i$ ;
$\color{red}{Value}$ : $\mathbf{v}_{:i}=\mathbf{W}_V\mathbf{x}_i$ ;

2.1 Compute Weights

$\alpha_{:j}=\mathrm{Softmax}(\mathbb{K}^T{q}_{:j})\in\mathbb{R}^m$
在这里插入图片描述

$\alpha_{:2}=\mathrm{Softmax}(\mathbb{K}^T\mathbf{q}_{:2})\in\mathbb{R}^m$
在这里插入图片描述

$\alpha_{:j}=\mathrm{Softmax}(\mathbb{K}^T\mathbf{q}_{:j})\in\mathbb{R}^m$
在这里插入图片描述

2.2 Compute Context vector

$\mathbf{c}_{:1}=\alpha_{11}\mathbf{v}_{:1}+\cdots+\alpha_{m1}\mathbf{v}_{:m}=\mathbf{V}\mathbf{\alpha}_{:1}$
在这里插入图片描述

$c_{:2}=\alpha_{12}v_{:1}+\cdots+\alpha_{m2}v_{:m}=V\alpha_{:2}$
在这里插入图片描述

$\mathrm{c}_{:j}=\alpha_{1j}\mathrm{v}_{:1}+\cdots+\alpha_{mj}\mathrm{v}_{:m}=\mathrm{V}\alpha_{:j}$
在这里插入图片描述

2.3 Output of self-attention layer

$\mathrm{c}_{:j}=\mathrm{V}\cdot\mathrm{Softmax}(\mathbb{K}^T\mathbf{q}_{:j})$ .
$\mathrm{c}_{:j}$ is a function of all the 𝑚 vectors $\mathbf{x}_1,\cdots,\mathbf{X}_m$ .

2.4 Self-Attention Layer

Self-attention layer: $\mathrm{C}=\mathrm{Attn}(\mathbf{X},\mathbf{X})$ .
Inputs: $\mathbf{X}=[\mathbf{x}_1,\mathbf{x}_2,\cdots,\mathbf{x}_m]$ .
Parameters: $\mathbf{W}_Q\textbf{, W}_K\textbf{, W}_V$ .

在这里插入图片描述

3. Self-Attention的通俗理解

Self-Attention机制，最先在NLP中提出，其核心是利用文本中的其他词来增强目标词特征的表征能力，从而得到一个聚焦重点的句子特征。

如果看代码就会发现，QKV仅仅是对X做了三次线性变换（三个不同的全连接层），然后得到了QKV三个X变换之后的输出。它们三个在计算的时候，任意指定一个为QKV都可以（当然，指定后就不能变了）。得到QKV之后， $softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$ 才是真正的计算注意力的过程。所谓QKV，不过是为了引入可训练的参数，同时对X进行特征空间变换。所以，我们关心得到的三个全连接层的参数矩阵就好了，不用给QKV多么直观的解释，QKV仅仅是线性变换。

4. Self-Attention的计算过程

4.1 主要步骤

对于self-attention来说，Q（Query）、K（Key）、 V（Value）三个矩阵均来自同一输入。
在这里插入图片描述

计算Thinking的self-attention（自注意力），主要步骤有：

首先计算Q向量与K向量之间的点乘；
然后为了防止其结果过大，会除以一个尺度标度（缩放因子） $\sqrt{d_{k}}$ ，其中 $d_{k}$ 为一个query和key向量的维度；
再利用Softmax操作其结果归一化为概率分布（注意力向量）。比如，[0.88, 0.12]这个向量的意思是，要解释Thinking这个词在这个句子中的意思，应当取0.88份Thinking原本的意思，再取0.12份Machine原本的意思，这样加权就是Thinking在这个句子中的意思；
然后乘以V向量，得到加权向量（权重求和的表示）；

self-attention操作可以表示为：
$Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V$

4.2 举例说明

假如我们要翻译一个词组Thinking Machines，其中Thinking输入的Embedding vector用 $X_1$ 表示，Machines的Embedding vector用 $X_2$ 表示。在CV领域，Thinking和Machine可以理解为图片被切分的两个patch。
在这里插入图片描述

当我们处理Thinking这个词时，我们需要计算句子中所有词与它的Attention Score。简单理解，就是将当前词当作搜索的query，去和句子中所有词（包含该词本身）的key去匹配，看看相关度有多高。 $W^Q$ 矩阵是 $X_1$ 的权重矩阵， $q_1 = X1 * W^Q$ ，所以我们用 $q_1$ 代表 Thinking 对应的 query vector， $k_1$ 及 $k_2$ 分别代表 Thinking以及Machines对应的 key vector，则计算 Thinking 的 Attention Score的时候需要计算 $q_1$ 与 $k_1,k_2$ 的点乘，同理，我们计算Machines 的 Attention Score的时候需要计算 $q_2$ 与 $k_1,k_2$ 的点乘。如下图所示，我们分别得到 $q_1$ 与 $k_1,k_2$ 的点乘积，然后进行尺度缩放，再进行softmax归一化。
在这里插入图片描述

虽然，当前单词与其自身的Attention Score一般最大，其他单词根据与当前单词重要程度有相应的Attention Score。然后我们再用这些Attention Score与V向量相乘，得到加权的向量。最后图中Sum之后的结果所表达的就是每个单词在这个句子当中的意思。
在这里插入图片描述

4.3 QKV矩阵的概念

如果将输入的所有向量合并为矩阵形式，则所有QKV向量可以合并为QKV矩阵形式表示：
在这里插入图片描述

其中， $W^{Q},W^{K},W^{V}$ 是模型训练过程学习到的合适的参数。

需要知道一个数学的先验知识，两个向量a和b同向， $a * b = ∣ a ∣∣ b ∣$ ；如果a和b垂直，则 $a * b = 0$ ；如果a和b反向，则 $a * b = - ∣ a ∣∣ b ∣$ 。所以，两个向量的点乘（点积）可以表示两个向量的相似度，越相似则方向越趋于一致，a点乘b数值越大。则Self-Attention计算过程可以简化为：
在这里插入图片描述

上式是Self-Attention的公式，Q和K的点乘表示Q和K矩阵之间的相似程度，但是这个相似度不是归一化的，所以需要一个softmax将Q和K的结果进行归一化，那么softmax后的结果就是一个所有数值为0-1的mask矩阵（可以理解为Attention Score矩阵），而V矩阵表示输入线性变化后的特征，那么将mask矩阵乘上V矩阵就能得到加权后的特征。总结一下，Q和K矩阵的引入是为了得到一个所有数值为0-1的mask矩阵，V矩阵的引入是为了保留输入的特征（原始特征）。

QKV来自于同一个句子表征，Q是目标词矩阵，K是关键词矩阵，V是原始特征，通过三步计算：

Q和K进行点积计算，得到相似度矩阵；
softmax归一化相似度矩阵，得到相似度权重；
将相似度权重和V矩阵加权求和，得到强化表征Z。

4.4 Multihead Attention单元

而multihead就是有不同的Q,K,V表示，最后将其结果结合起来，如下图表示：
在这里插入图片描述

这就是基本的Multihead Attention单元，对于encoder来说，就是利用这些基本单元叠加。其中K,Q,V均来自前一层encoder的输出，即encoder的每个位置都可以注意到之前一层encoder的所有位置。

对于decoder来说，有两个与encoder不同的地方。一个是第一级的Masked Multihead，另一个是第二级的Multi-Head Attention不仅接收来自前一级的输出，还要接收encoder的输出。
在这里插入图片描述

第一级decoder的key,query,value均来自前一层decoder的输出，但加入了Mask操作，即我们只能attend到前面已经翻译过的输出的词语，因为当前的翻译过程并不知道下一个输出词语，这是之后才会推测到的。

第二级decoder也被称作encoder-decoder attention layer，即它的Q来自于之前一级的decoder层的输出，但其key和value来自于encoder的输出，这使得decoder的每一个位置都可以attend到输入序列的每一个位置。
在这里插入图片描述

总结一下，key和value的来源总是相同的，q在encoder以及第一级decoder中与key,value来源相同，在encoder-decoder attention layer中与key,value来源不同。

5. Self-Attention的缺陷

在self-attention模型中，输入是一整排tokens，对于人类来说，我们很容易知道tokens的位置信息，比如：

绝对位置信息。a1是第一个token，a2是第二个token…
相对位置信息。a2在a1的后面一位，a4在a2的后面两位…
不同位置间的距离。a1和a3相差两个位置，a1和a4相差三个位置…

这些对于self-attention来说，是无法分辨的信息，因为self-attention的运算是无向的。

6. Attention与Self-Attention对比

6.1 Attention layer

Attention layer： $\mathrm{C}=\mathrm{Attn}(\mathbf{X},\mathbf{X}^{\prime})$ .
Query： $\mathbf{q}_{:j}=\mathbf{W}_Q\mathbf{x}_j^{\prime}$ .
Key： $\mathbf{k}_{:i}=\mathbf{W}_K\mathbf{x}_i$ .
Value： $\mathbf{v}_{:i}=\mathbf{W}_V\mathbf{x}_i$ .
Output： $\mathrm{c}_{;j}=\mathrm{V}\cdot\mathrm{Softmax}(\mathbb{K}^T\mathbf{q}_{:j})$ .

6.2 Self-Attention Layer

Attention layer: $\mathcal{C}=\mathrm{Attn}(\mathbf{X},\mathbf{X}^{\prime})$ .
Self-Attention layer: $C=\operatorname{Attn}(\mathbf{X},\mathbf{X})$ .

7. Self-Attention代码实现

这里仅分析核心代码，详细代码请查阅：tensor2tensor/layers/common_attention.py

`multihead_attention()`

def multihead_attention(query_antecedent,
                        memory_antecedent,
                        ...):
    """Multihead scaled-dot-product attention with input/output transformations.
  Args:
    query_antecedent: a Tensor with shape [batch, length_q, channels]
    memory_antecedent: a Tensor with shape [batch, length_m, channels] or None
    ...
  Returns:
    The result of the attention transformation. The output shape is
        [batch_size, length_q, hidden_dim]  
  """
    #计算q, k, v矩阵
    q, k, v = compute_qkv(query_antecedent, memory_antecedent， ...)
    #计算dot_product的attention
    x = dot_product_attention(q, k, v, ...)
    x = common_layers.dense(x, ...)
    return x

`compute_qkv()`

def compute_qkv(query_antecedent,
                memory_antecedent,
                ...):
    """Computes query, key and value.
  Args:
    query_antecedent: a Tensor with shape [batch, length_q, channels]
    memory_antecedent: a Tensor with shape [batch, length_m, channels]
    ...
  Returns:
    q, k, v : [batch, length, depth] tensors
  """
    # 注意这里如果memory_antecedent是None，它就会设置成和query_antecedent一样，encoder的
    # self-attention调用时memory_antecedent 传进去的就是None。
    if memory_antecedent is None:
        memory_antecedent = query_antecedent
        q = compute_attention_component(
            query_antecedent,
            ...)
        # 注意这里k,v均来自于memory_antecedent。
        k = compute_attention_component(
            memory_antecedent,
            ...)
        v = compute_attention_component(
            memory_antecedent,
            ...)
        return q, k, v

    def compute_attention_component(antecedent,
                                    ...):
        """Computes attention compoenent (query, key or value).
  Args:
    antecedent: a Tensor with shape [batch, length, channels]
    name: a string specifying scope name.
    ...
  Returns:
    c : [batch, length, depth] tensor
  """
        return common_layers.dense(antecedent, ...)

`dot_product_attention()`

def dot_product_attention(q,
                          k,
                          v,
                          ...):
    """Dot-product attention.
  Args:
    q: Tensor with shape [..., length_q, depth_k].
    k: Tensor with shape [..., length_kv, depth_k]. Leading dimensions must
      match with q.
    v: Tensor with shape [..., length_kv, depth_v] Leading dimensions must
      match with q.
  Returns:
    Tensor with shape [..., length_q, depth_v].
  """
    # 计算Q, K的矩阵乘积。
    logits = tf.matmul(q, k, transpose_b=True)
    # 利用softmax将结果归一化。
    weights = tf.nn.softmax(logits, name="attention_weights")
    # 与V相乘得到加权表示。
    return tf.matmul(weights, v)

`transformer_encoder()`

def transformer_encoder(encoder_input,
                        hparams,
                        ...):
    """A stack of transformer layers.
  Args:
    encoder_input: a Tensor
    hparams: hyperparameters for model
    ...
  Returns:
    y: a Tensors
  """
    x = encoder_input
    with tf.variable_scope(name):
        for layer in range(hparams.num_encoder_layers or hparams.num_hidden_layers):
            with tf.variable_scope("layer_%d" % layer):
                with tf.variable_scope("self_attention"):
                    # layer_preprocess及layer_postprocess包含了一些layer normalization
                    # 及residual connection, dropout等操作。
                    y = common_attention.multihead_attention(
                        common_layers.layer_preprocess(x, hparams),
                        #这里注意encoder memory_antecedent设置为None
                        None,
                        ...)
                    x = common_layers.layer_postprocess(x, y, hparams)
                    with tf.variable_scope("ffn"):
                        # 前馈神经网络部分。
                        y = transformer_ffn_layer(
                            common_layers.layer_preprocess(x, hparams),
                            hparams,
                            ...)
                        x = common_layers.layer_postprocess(x, y, hparams)
                        return common_layers.layer_preprocess(x, hparams)

`transformer_decoder()`

def transformer_decoder(decoder_input,
                        encoder_output,
                        hparams,
                        ...):
    """A stack of transformer layers.
  Args:
    decoder_input: a Tensor
    encoder_output: a Tensor
    hparams: hyperparameters for model
    ...
  Returns:
    y: a Tensors
  """
    x = decoder_input
    with tf.variable_scope(name):
        for layer in range(hparams.num_decoder_layers or hparams.num_hidden_layers):
            layer_name = "layer_%d" % layer
            with tf.variable_scope(layer_name):
                with tf.variable_scope("self_attention"):
                    # decoder一级memory_antecedent设置为None
                    y = common_attention.multihead_attention(
                        common_layers.layer_preprocess(x, hparams),
                        None,
                        ...)
                    x = common_layers.layer_postprocess(x, y, hparams)
                    if encoder_output is not None:
                        with tf.variable_scope("encdec_attention"):
                            # decoder二级memory_antecedent设置为encoder_output
                            y = common_attention.multihead_attention(
                                common_layers.layer_preprocess(x, hparams),
                                encoder_output,
                                ...)
                            x = common_layers.layer_postprocess(x, y, hparams)
                            with tf.variable_scope("ffn"):
                                y = transformer_ffn_layer(
                                    common_layers.layer_preprocess(x, hparams),
                                    hparams,
                                    ...)
                                x = common_layers.layer_postprocess(x, y, hparams)
                                return common_layers.layer_preprocess(x, hparams)