探究布局模型：从LayoutLM到LayoutLMv2与LayoutXLM

LAYOUT LM V2

与Layout LM的区别：

预训练阶段，使用Transformer建模文本、布局、图像多模态信息，而Layout LM是在微调阶段利用图像信息；
使用空间相对注意力机制，表征token对，而Layout LM使用绝对2-D位置；
新增引入文本、图像对齐策略，文本、图像匹配策略，学习文本与图像是否相关；

模型架构

以文本、布局、图像作为输入，建模交叉模态：

文本嵌入

与BERT一致，

TokEmb

(

)

PosEmb1D

(

)

SegEmb

(

)

\bm t_i=\text{TokEmb}(w_i) + \text{PosEmb1D}(i) + \text{SegEmb}(s_i)

TokEmb

(

)

PosEmb1D

(

)

SegEmb

(

)

视觉嵌入

将图像缩放至224x224，喂入ResNeXt-FPN编码（参数在预训练时更新），平均池化为

W×H

的特征图（3维），展开为2维序列；

Proj

(

VisTokEmb

(

)

PosEmb1D

(

)

SegEmb

(

[

]

)

≤

\bm v_i=\text{Proj}(\text{VisTokEmb}(I)_i) + \text{PosEmb1D}(i) + \text{SegEmb}([\text{C}]),\quad 0\leq i \leq WH=L

Proj

(

VisTokEmb

(

)

PosEmb1D

(

)

SegEmb

(

[

]

)

≤

版面嵌入

标准化点位至

[

10000

]

[0, 10000]

[

]

，

x, y

点位各使用一个嵌入层，对于边界框

box

(

)

\text{box}_i=(x_{min},x_{max},y_{min},y_{max},w,h)

box

(

)

，

Concat

(

PosEmb2D

(

)

PosEmb2D

(

)

\bm l_i=\text{Concat}(\text{PosEmb2D}_x(x_{min},x_{max},w),\text{PosEmb2D}_y(y_{min},y_{max},h))

Concat

(

PosEmb2D

(

)

PosEmb2D

(

)

使用

box

PAD

(

)

\text{box}_\text{PAD}=(0,0,0,0,0,0)

box

PAD

(

)

，表示特殊token
[CLS]
、
[SEP]
和
[PAD]
。

空间感知多模态编码器

拼接视觉嵌入和文本嵌入，并加上版面嵌入，生成第一层输入

(

)

{

−

}

\bm x_i^{(0)}=X_i+\bm l_i,\quad X={\bm v_0,…,\bm v_{WH-1}, \bm t_0, …,\bm t_{L-1}}

(

)

{

−

}

第一层输入只考虑到绝对位置，为建模版面局部不变性，引入空间感知相对注意力，

(

)

(

)

⊤

′

−

∑

exp

⁡

′

∑

exp

⁡

′

\alpha_{ij}=\frac{1}{\sqrt{d_{head}}}(\bm x_i\bm W^Q)(\bm x_j\bm W^K){\top},\quad \alpha_{i,j}‘=\alpha_{ij}+\bm b_{j-i}^{1D}+\bm b_{x_j-x_i}^{2D_x}+\bm b_{y_j-y_i}^{2D_y}, \quad \bm h_i=\sum_j\frac{\exp\alpha_{ij}’}{\sum_k\exp\alpha_{ik}'}\bm x_j\bm W^V

(

)

(

)

⊤

′

−

∑

exp

′

exp

′

预训练任务

数据集与LayoutLM使用的一致

MVLM, Masker Visual-Language Model

: 随机一些掩盖文本tokens，促使模型利用版面信息对其复原，为避免模型利用视觉线索，掩盖tokens对应的图像区域也应该掩盖；

TIA, Text-Image Alignment

: 随机选择一些文本行，覆盖对应的图像区域，使模型预测token对应的图像区域是否被掩盖，即
[Covered]
或
[Not Covered]
，促使模型学习边界框坐标与图像之间的关系；

TIM, Text-Image Matching

: 粗粒度的模态对齐任务，预测文本和图像的来源是否一致（当前文本是否来自于当前图像）。通用随机替换或删除图像构造负样本，负样本对应TIA任务的所有标签均为
[Covered]
*

TIA任务为什么要整行覆盖？

文档中某些元素（signs, bars）看起来很像是覆盖区域，图像中寻找词级别的覆盖区域噪音较大，整行覆盖可避免噪音。

实验细节

使用UniLMv2模型初始化网络参数；
ResNeXt-FPN的backbone: MaskRCNN，基于PubLayNet训练；
使用随机滑窗的方法随机截取长文本中的512个token；
视觉编码器平局池化层输出维度W×H=7×7，即总共有49个视觉token；
MVLM，token mask的概率及方式与LayoutLM一致；
TIA，15%替换图像；
TIM，15%替换图像，5%删除图像；

LayoutXLM

文章简介

作为LayoutLMv2的扩展，适用于多语言任务；
与LayoutLMv2架构相同，参数基于SOTA多语言模型
InfoXLM
初始化参数；
使用
IIT-CDIP
数据集和开源多语言PDF文件作为数据集；
开源多语言（中文、日文、西班牙语、意大利语、德语）信息抽取数据集
XFUND

不使用LayoutLMv2初始化参数的原因？

LayoutLMv2不覆盖多语言，词典不一致。

模型架构

与LayoutLMv2一致。

模型预训练

使用与LayoutLMv2一致的三个任务：MVLM、TIA、TIM。

预训练数据

含53种语言文件；
使用PyMuPDF解析、清洗数据集，获取页面文字、布局、图像；
使用BlingFire检测文件语言；
以

(

)

(n_l/n)^\alpha

(

)

概率采样某一种语言文件，

0.7

\alpha=0.7

，共获得2200w富文档；

XFUND: 多语言票据理解基准数据集

扩充FUNSD至7种语言。

任务描述

语义实体识别任务，关系抽取任务。

Semantic Entity Recognition	Relation Extraction

Baselines

Semantic Entity Recognition

基于BIO标注模式，构建特定任务层建模LayoutXLM的文本部分。

Relation Extraction

识别所有关系候选实体，对任一实体对，拼接头尾实体第一个token的语义向量，经投影变换、双仿射层，获得关系分类。

实验

预训练base和large模型；
微调XFUND，验证不同语言迁移学习、零样本学习、多任务微调，并与两种多语言预训练模型（XLM-R、InfoXLM）作对比；

1️⃣
language-specific fine-tuning
: 语言X上微调，语言X上测试；

2️⃣
Zero-shot transfer learning
: 英文上微调，其他语言上测试；

3️⃣
Multitask fine-tuning
: 所有语言上训练模型

特定语言微调

零样本微调

多任务微调

LAYOUTLMV3: Pre-training for Document AI with Unified Text and Image Masking

现有方法

DocFormer: 通过CNN解码器学习重建图像像素，任务倾向于学习噪声细节，而不是学习文档布局这种高层级特征；
SelfDoc: 回归掩盖的区域特征，任务噪声大，相比于小词表的离散特征分类任务更难；

The different granularities of image (e.g., dense image pixels or contiguous region features) and text (i.e., discrete tokens) objectives further add difficulty to cross-modal alignment learning.

LAYOUTLMV3特点

不依赖于预训练的CNN或者Faster R-CNN提取视觉特征，降低了网络参数和区域监督标注；
实用MLM、MIM任务降低了文本和视觉模态特征差异，使用WPA任务对齐交叉模态；
可同时应用到文本（MLM任务）和图像任务（MIM任务）的通用预训练模型；

LayouLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.

https://blog.csdn.net/sinat_34072381/article/details/106993856

模型架构

Text Embeddings

word embeddings:
from a pre-trained model RoBERTa.
position embeddings:
- 1D position:
  the index of tokens within the text sequence.
- 2D position:
  the bounding box cordinates of the text sequence (like LayouLM, using x-axis, y-axis, width and height), but adopt segment-level layout positions that words in a segment share the same 2D-position since the words usually express the same semantic meaning.

Image Embedding

Represent document images with linear projection features of image patches, as following steps:

resize image to

H\times W

and denote image with

∈

\pmb I\in \R^{C\times H \times W}

∈

W
2. split image to a sequence of uniform

P\times P

patches
3. linear project the patches to

dimensions and flatten them into a sequence of vectors, which length is

M=HW/P^2

2
4. add standard learnable 1D position embeddings to eatch patch.

We insert semantic 1D relative position and spatial 2D relative position as bias term in self-attention networds for text and imga modalities following LayoutLMv2.

预训练任务

LayoutLMv3 learns to reconstruct masked word tokens of the text modality and symmetrically reconstruct masked patch tokens of the image modality.

I. Masked Language Modeling (MLM)

Inspired BERT, mask 30% of text tokens with a span masking strategy with span lengths drawn from a Possion distribution (

\lambda=3

利用布局信息和掩盖的文本、图像上下文序列，预测被掩盖的token，从而建模布局、文本、图像模态之间的相关性。

II. Maksed Image Modeling (MIM)

Making 40% of image tokens randomly with blockwise masking strategy that is a symmetry to the MLM objective.

MIM objective can transform dense image pixels into discrete tokens according to a visual vocabulary, that facilitates learning high-level layout structures rather than low-level noisy details.