多模态之论文笔记BEiT, BEiT V2, BEiT V3

news2025/4/18 22:20:54

文章目录

Overview
BEiT
- 1.0. Summary
- 1.1. BEiT VS BERT
- 2.1. Two Views: visual tokens
- 2.1. Two Views: image patches
- 3. Results
BEiT V2
- 1.0. Summary
- 1.1. Motivation
- 2.1. Methods -- VQ-KD
- 2.2. Methods -- patch aggregation
- 3.1. Results -- image classification & semantic segmentation
- 3.2. Results -- Ablation studies about VQ-KD
- 3.3. Results -- Ablation studies about patch aggregation
- 3.4. Results -- Visualization
VLMO
- 1.0. Summary
- 2.1. Contribution 1: MoME
- 2.1. Contribution 2: Stagewise Pre-Training
BEiT V3
- 1.0. Summary
- 1.1. Motivations & Contributions
- 2.1. Method -- Multiway Transformers
- 2.2. Method -- Masked Data Modeling
- 2.3. Method -- Scaling up
- 2.4. Method -- Transfer to downstream tasks
- 3. Experiments

Overview

BEiT

1.0. Summary

题目: BEiT: BERT Pre-Training of Image Transformers
机构：微软
论文: https://arxiv.org/abs/2106.08254
代码：https://github.com/microsoft/unilm/tree/master/beit
任务: CV的BERT，图像单模态预训练
特点:
方法:
前置相关工作：BERT

1.1. BEiT VS BERT

BEiT: Bidirectional Encoder representation from Image Transformer
BERT: Bidirectional Encoder Representation from Transformer

Model	Pretraining Task	Mask Method	Special Tokens
BERT	masked language modeling	masking 15% tokens 80% [MASK], 10% random, 10% origin	[CLS], [SEP]
BEiT	masked image modeling	blockwise masking	[CLS]

2.1. Two Views: visual tokens

tokenizer需要在pretrain之前先训练好，或者使用网上开源项目的权重(比如DALLE)
作用:为pretrain阶段提供监督信息。类比于NLP中的tokenizer

2.1. Two Views: image patches

将图片分成patches
blockwise masking一些patches
拼接[CLS]，+ position embedding
经过transformer encoder预测masked patches对应的visual tokens

3. Results

image classification和semantic segmentation任务上效果好于
(1) training from scratch(ViT, DeiT)
(2) Supervised Pre-Training on ImageNet-22K(ViT)
(3) Self-Supervised Pre-Training on ImageNet-1K(ViT, iGPT, MoCo v3, DINO)

BEiT V2

1.0. Summary

题目: BEIT V2: Masked Image Modeling with Vector-Quantized Visual Tokenizers
机构：微软
论文: https://arxiv.org/pdf/2208.06366.pdf
代码：https://github.com/microsoft/unilm/tree/master/beit2
任务:
特点:
方法:
前置相关工作：

1.1. Motivation

(1) 当前MIM任务更多关注低层图片元素(像素值)，较少考虑高层图片元素(语义信息) NLP中都是挖掘高层的语义信息，所以需要挖掘MIM探索语义信息的能力
(2) MIM任务重视patch的重构，而较少关注对图片全局表征的学习

2.1. Methods – VQ-KD

使用其他已有模型的feature map作为重构对象，teather模型有CLIP和DINO
Encoder输出和Codebook Embedding都用L2-norm

2.2. Methods – patch aggregation

增加一个MIM的损失函数，使用第l层的patch tokens和第L层的CLS token，浅层网络
促进CLS学习到图片全局的信息

3.1. Results – image classification & semantic segmentation

3.2. Results – Ablation studies about VQ-KD

decoder越复杂，重构loss越小，codebook的利用率越小，下游任务上表现变差 codebook维度越大，利用率越小

3.3. Results – Ablation studies about patch aggregation

3.4. Results – Visualization

VLMO

1.0. Summary

题目: VLMO: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts
机构：微软
论文: http://export.arxiv.org/pdf/2111.02358
代码：https://github.com/microsoft/unilm/tree/master/vlmo
任务:
特点:
方法:
前置相关工作：

2.1. Contribution 1: MoME

motivation
(1) dual encoder models
比如:CLIP，ALIGN
优势:检索任务上高效 (T2I、I2T)
不足:信息融合简单(cosine similarity或linear proj)，在VR和VQA等任务表现不好
BLIP model

(2) fusion encoder models
比如:ViLT，ALBEF 优势:在推理任务上表现较好，VR和VQA 不足:检索任务上速度较慢

检索任务上性能比较
BLIP model

MoME:Mixture-of-Modality-Experts Transformer
pretraining
BLIP model

fine-tuning

2.1. Contribution 2: Stagewise Pre-Training

motivation
(1) image-text pairs较少，而且文本大多简短
(2) image-only或text-only的数据较多
提出Stagewise Pre-Training策略，为多模态预训练获得较好的初始化权重

BEiT V3

1.0. Summary

题目: Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
机构：微软
论文: https://arxiv.org/pdf/2208.10442v1.pdf
代码：https://github.com/microsoft/unilm/tree/master/beit3
任务:
特点:
方法:
前置相关工作：