【考古篇】Attension is all you need

news2025/1/16 20:08:46

Transformer

文章目录

  • Transformer
  • 1. What
  • 2. Why
  • 3. How
    • 3.1 Encoder
    • 3.2 Decoder
    • 3.3 Attention
    • 3.4 Application
    • 3.5 Position-wise Feed-Forward Networks(The second sublayer)
    • 3.6 Embeddings and Softmax
    • 3.7 Positional Encoding
    • 3.8 Why Self-Attention

1. What

A new simple network architecture called Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

2. Why

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder.

Recurrent neural networks have been sota in most sequence modeling tasks, but memory constraints in RNN limit batching across examples.

The goal of reducing sequential computation also forms the use of convolutional neural networks. But it’s hard for them to handle the dependencies between distant positions because convolution can only see a small window in a whole image. If we want to use convolution to see the far-away parts, it needs several convolutions. In the transformer, it reduces to a constant number of operations because it can overview the whole image.

Meanwhile, similar to the idea of using many convolution kernels in RNN, we introduce the Multi-Head Attention to make up for this feature.

3. How

请添加图片描述

3.1 Encoder

The encoder is the block on the left with 6 identical layers. Each layer has two sublayers. Combined with the residual connection, it can be represented as:

LayerNorm(x+Sublayer(x)) \text{LayerNorm(x+Sublayer(x))} LayerNorm(x+Sublayer(x))

Each sub-layer is followed by layer normalization. We will introduce it in detail.

Firstly, we will introduce batch normalization and layer normalization, which are shown below as the blue and yellow squares.

请添加图片描述

In the 2D dimension, the data can be represented as feature × \times × batch. And batch normalization is to normalize one feature in different batches. The layer normalization is equivalent to the transposition of batch normalization, which can be seen as the normalization of one batch with different features.

In the 3D dimension, every sentence is a sequence and each word is a vector. So we can visualize it as below:

请添加图片描述

The blue and yellow squares represent batch normalization and layer normalization in 3D data. Consider if the sequence length is different among sentences, the normalization will be different:

请添加图片描述

The batch normalization will consider all of the data, so if new data has an extreme length, the predicted normalization will be inaccurate. So on the contrary, we will use layer normalization in transform, which only makes sense in its own sequence and will not be affected by global data.

3.2 Decoder

The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. It was added a mask to prevent positions from attending to subsequent positions.

3.3 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key. That is:
请添加图片描述

The key and value are paired. The weight for each value depends on the compatibility between the query and key.

Mathematically,

Attention ( Q , K , V ) = softmax ( Q K T d k ) V . \text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt {d_k}})V. Attention(Q,K,V)=softmax(dk QKT)V.

where the query is a matrix and we use softmax \text{softmax} softmax to gain the relative weights. The scaling factor d k \sqrt {d_k} dk is used to avoid the extreme length.

The matrix multiplication can be represented as:

请添加图片描述

We will also use masks in this block, it will set the value after v t v_t vt to a big negative number. So it will be small after softmax.

请添加图片描述

As for the multi-head attention, some different, learned linear projections are used for the Q , K , V Q,K,V Q,K,V to compress dimension from d m o d e l d_{model} dmodel to d k , d k , d v d_k,d_k,d_v dk,dk,dv. It is shown below:

请添加图片描述

And mathematically,

M u l t i H e a d ( Q , K , V ) = C o n c a t ( h e a d 1 , . . . , h e a d h ) W O w h e r e   h e a d i = A t t e n t i o n ( Q W i Q , K W i K , V W i V ) \begin{aligned}\mathrm{MultiHead}(Q,K,V)&=\mathrm{Concat}(\mathrm{head}_{1},...,\mathrm{head}_{\mathrm{h}})W^{O}\\\mathrm{where~head_{i}}&=\mathrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})\end{aligned} MultiHead(Q,K,V)where headi=Concat(head1,...,headh)WO=Attention(QWiQ,KWiK,VWiV)

Where the projections are parameter matrices W i Q ∈ R d m o d e l × d k , W i K ∈ R d m o d e l × d k , W i V ∈ R d m o d e l × d v W_{i}^{Q}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{K}\in\mathbb{R}^{d_{\mathrm{model}}\times d_{k}},W_{i}^{V}\in\mathbb{R}^{d_{\mathrm{model}}\times d_v} WiQRdmodel×dk,WiKRdmodel×dk,WiVRdmodel×dv and W O ∈ R h d v × d m o d e l W^{O}\in\mathbb{R}^{hd_{v}\times d_{\mathrm{model}}} WORhdv×dmodel.

Practically, d k = d v = d m o d e l / h = 64 d_{k}=d_{v}=d_{\mathrm{model}}/h=64 dk=dv=dmodel/h=64 and h = 8 h=8 h=8.

In this way, we also have more parameters in liner layers to learn compared with single attention.

3.4 Application

There are three types of multi-head attention in the model. For the first two, as shown below:

请添加图片描述

All of the keys, values, and queries come from the same place and have the same size. The output size is n × d n \times d n×d.

As for the third one, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder.

请添加图片描述

The K K K and V V V’s sizes are n × d n \times d n×d and the Q Q Q’s size is m × d m \times d m×d. So the final output’s size is m × d m \times d m×d. From a semantic point of view, it means to put forward the words in the output sequence that have similar semantics to the input word sequence.

请添加图片描述

3.5 Position-wise Feed-Forward Networks(The second sublayer)

Actually, it is a MLP:

F F N ( x ) = max ⁡ ( 0 , x W 1 + b 1 ) W 2 + b 2 . \mathrm{FFN}(x)=\max(0,xW_1+b_1)W_2+b_2. FFN(x)=max(0,xW1+b1)W2+b2.

The input x x x is d m o d e l d_{model} dmodel(512), W 1 W_1 W1 is R 512 × 2048 \mathbb{R}^{512\times2048} R512×2048 and W 2 W_2 W2 is R 2048 × 512 \mathbb{R}^{2048\times512} R2048×512.

Position-wise means it is a reflection of every word in the sequence and all of them use the same MLP.

请添加图片描述

This is also the difference between him and RNN. The latter needs the output of the last MLP to be the input.

3.6 Embeddings and Softmax

Embeddings are the map from word tokens to vectors of dimension d m o d e l d_{model} dmodel. The linear transformation and softmax function will convert the decoder output to predicted next-token probabilities. All of them use the same weights.

3.7 Positional Encoding

In order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the
tokens in the sequence.

Every PC will be d m o d e l d_{model} dmodel and be added to the input embedding. The formula is:

P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) , PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\mathrm{model}}}), PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel),

where p o s pos pos is the position and i i i is the dimension.

3.8 Why Self-Attention

请添加图片描述

Use this table to compare different models. Three metrics were used.

As for the Complexity per Layer of Self-Attention, O ( n 2 d ˙ ) O(n^2 \dot d) O(n2d˙) is the multiplication of matrix Q Q Q and K K K. The Self-Attention (restricted) means only use some near Q Q Q as quary.

Ref:

Transformer论文逐段精读【论文精读】_哔哩哔哩_bilibili

Transformer常见问题与回答总结

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1659007.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

labview技术交流-字符串数组连接成字符串

应用场景 我们可能需要将一维的字符串数组转换成一整条字符串,然后方便记录在数据库或表格中的一个单元格中。 代码展示 方案一 我们使用for循环完成这样的功能需求,见下图: 这种方案可能相对基础和普通,但是它更方便和易于扩展…

JavaSwing技术实现一个电子表程序

使用JavaSwing技术实现一个简单的电子表盘,如下所示。 下载链接 有兴趣的读者可以点击链接下载,博主承诺绝对真实有效。

PyQt5中的事件与信号处理

文章目录 1. 简介1.1事件(Event)1.2 信号(Signal)与槽(Slot)1.3 自定义信号 2. 一个信号与槽的简单示例13. 一个信号与槽的简单示例24. 事件发送者5. 创建自定义信号6. 一个简单计算器 1. 简介 在PyQt5中,事件和信号处理是GUI编程的核心概念。事件是指用户操作或系…

2024年数维杯数维杯数学建模A题思路+论文+代码+结果

A 题 多源机会信号建模与导航分析 尽管全球卫星定位系统下的定位导航技术已成熟,但考虑到室内、 隧道、建筑密集区等复杂环境或全球卫星定位系统被毁失灵等突发场 景,会发生全球卫星定位系统拒止情况,无法有效定位导航。因此, 需要…

Coze扣子开发指南:用免费API自己创建插件

虽然Coze扣子现在插件商店已经有几百个插件了,但相对于海量人群的众多差异化需求,还是远远不够的。如果插件商店没有合适的插件,其实完成可以自己创建,过程也很简单,不需要编写任何代码。 首先打开个人空间&#xff0…

JavaSwing课程设计-实现一个计算器程序

通过JavaSwing技术来实现计算器小程序,效果如下。 源码下载链接 源码下载 博主承诺真实有效,私信可提供支持

Linux系统搭建Gitlab开源仓库管理系统并实现公网环境访问本地私有库

文章目录 前言1. 下载Gitlab2. 安装Gitlab3. 启动Gitlab4. 安装cpolar5. 创建隧道配置访问地址6. 固定GitLab访问地址6.1 保留二级子域名6.2 配置二级子域名 7. 测试访问二级子域名 前言 GitLab 是一个用于仓库管理系统的开源项目,使用Git作为代码管理工具&#xf…

内网渗透windows命令绕过

①选项字符替换 有一些命令是unix移植来的,可以使用-(横短线),Windows原生选项连接符为/(正斜杠) 例: ping -h ping /h 像 find之类的命令无法兼容- windows中打开文件,支持直接打开和\.(反斜杠)点的形…

Spark云计算平台Databricks使用,第一个Spark应用程序WordCount

1 上传文件 上传words.txt文件:Spark云计算平台Databricks使用,上传文件-CSDN博客 上传的文件的路径是/FileStore/tables/words.txt,保存在AWS的S3 hello world hello hadoop hello world hello databricks hadoop hive hbase yarn spark …

ESP32-C3模组上跑通MQTT(3)

接前一篇文章:ESP32-C3模组上跑通MQTT(2) 本文内容参考: 《ESP32-C3 物联网工程开发实战》 MQTT协议及使用_mqtt endpoint-CSDN博客 你不得不看的图文并茂的MQTT协议通信过程!!!_mqtt流程图-…

Linux(centos)安装 MySQL 8 数据库(图文详细教程)

前言 前几天写了个window系统下安装Mysql的博客,收到很多小伙伴私信需要Linux下安装Mysql的教程,今天这边和大家分享一下,话不多说,看教程。 一、删除以前安装的MySQL服务 一般安装程序第一步都需要清除之前的安装痕迹&#xff…

EPAI手绘建模APP分析、灯光、相机

(13) 分析 标题 图 280 分析工具栏-1 标题 图 281 分析工具栏-2 ① 分析工具栏包括测量、信息、标注、注释、分析功能。 ② 测量顶点、角度、半径、直径、距离、长度、面积、体积。 标题 图 282 测量顶点 标题 图 283 测量角度 标题 图 284 测量半径 标题 图 285 测量直径 标…

LeNet-5上手敲代码

LeNet-5 LeNet-5由Yann LeCun在1998年提出,旨在解决手写数字识别问题,被认为是卷积神经网络的开创性工作之一。该网络是第一个被广泛应用于数字图像识别的神经网络之一,也是深度学习领域的里程碑之一。 LeNet-5的整体架构: 总体…

3分钟快速了解VR全景编辑器

说到VR全景,想必大多数人都见过那种可以360旋转拖动观看的图片。虽然这种技术已经不算新鲜,如果你以为这就是VR全景的全部,那就大错特错了! 上面看到的这种形式,只能算VR全景的第一层形态。现在的VR全景已经发展成为了…

vue+canvas实现根据数据展示不同高度,不同j渐变颜色的长方体效果

文章目录 不一样的长方体1. 实现效果预览2.实现思路2.1效果难点2.2 实现思路 3.实现3.1 测试数据编写3.2 编写canvas绘制函数 不一样的长方体 1. 实现效果预览 俗话说的好,没有实现不了的页面效果,只有禁锢的思想, 这不ui又给整了个新奇的页…

模型查询器在使用别名后不能使用tp6

在我们定义了模型的查询器时,再通过模型进行连表加别名的时候,使用查询器,查询器会没办法使用; 那我们可以将查询器前缀增加表名或者__TABLE__ 以上两种方式都可以,个人建议使用__TABLE__,因为这个查询器可…

单单单单单の刁队列

在数据结构的学习中,队列是一种常用的线性数据结构,它遵循先进先出(FIFO)的原则。而单调队列是队列的一种变体,它在特定条件下保证了队列中的元素具有某种单调性质,例如单调递增或单调递减。单调队列在处理…

Linux -- > vim

vi和vim是什么 vi和vim是两款流行的文本编辑器,广泛用于Unix和类Unix系统中。它们以其强大的功能和灵活的编辑能力而闻名,特别是在编程和系统管理中非常受欢迎。 vi(Visual Interface) vi是最初的文本编辑器之一,由…

AI赋能EasyCVR视频汇聚/视频监控平台加快医院安防体系数字化转型升级

近来,云南镇雄一医院发生持刀伤人事件持续发酵,目前已造成2人死亡21人受伤。此类事件在医院层出不穷,有的是因为医患纠纷、有的是因为打架斗殴。而且在每日大量流动的人口中,一些不法分子也将罪恶的手伸到了医院,实行扒…

不要错过!实景三维倾斜摄影在3D引擎的丝滑用法

在3D领域,倾斜摄影模型的应用是一个常见的瓶颈。工程建设、工业制造、科学分析、古建遗产,倾斜摄影是占主导地位的处理对象,但模型数据量大、精度要求高以及线上线下同步困难等,会导致生成的三维项目出现瑕疵。 所以在行业内&…