BLIP 小结

news2025/7/15 1:19:40

论文：Bootstrapping Language-Image Pre-training (BLIP)

代码：https://github.com/salesforce/BLIP

1 motivation

目前多模态模型在图片理解类任务、生成类任务表现卓越主要源于Scale up model and scale up dataset（更大的模型，更多的数据）。但是VLP（vision-language pre-training）数据集大多是网络爬取而来（称之为web datasets），里面的caption包含很多噪声，并不是一个理想的监督来源。BLIP这篇文章提出一种boostrapping caption的方案来“提纯”带噪声web datasets，从而进一步提升多模态模型的能力。

概括来说：这篇文章设计了一种去噪方案，来提纯web datasets，以此带来精度提升。

2 method

2.1 模型架构

BLIP多模态架构为双塔架构。论文中用3个vision language pretraining(VLP) task来激发模型的多模态能力。

在这里插入图片描述

2.2 多模态预训练任务

2.2.1 Image-Text Contrastive Loss (ITC)

和CLIP训练任务一致。核心思想是：给定图片-文本向量对 $\{(\mathrm{fea}_{\mathrm{img}}^{(1)}, \mathrm{fea}_{\mathrm{text}}^{(1)}), (\mathrm{fea}_{\mathrm{img}}^{(2)}, \mathrm{fea}_{\mathrm{text}}^{(2)}), \cdots, (\mathrm{fea}_{\mathrm{img}}^{(N)}, \mathrm{fea}_{\mathrm{text}}^{(N)}) \}$ .其训练目标为，同pair的 $(\mathrm{fea}_{\mathrm{img}}^{(i)}, \mathrm{fea}_{\mathrm{text}}^{(i)})$ 相似度越接近越好，非同pair的 $(\mathrm{fea}_{\mathrm{img}}^{(i)}, \mathrm{fea}_{\mathrm{text}}^{(j)}, i \neq j)$ 相似度越远越好，形如
$\frac{1}{2} \left(\sum_{i=0}^{N} CE(\frac { \exp ( \mathrm{fea}^{(i)}_{\mathrm{img}} \cdot \mathrm{fea}^{(i)}_{\mathrm{text}})} {\sum_{j=0}^{N}{\exp( \mathrm{fea}^{(i)}_{\mathrm{img}} \cdot \mathrm{fea}^{(j)}_{\mathrm{text}})} } ) + \sum_{j=0}^{N} CE(\frac { \exp ( \mathrm{fea}^{(j)}_{\mathrm{img}} \cdot \mathrm{fea}^{(j)}_{\mathrm{text}})} {\sum_{i=0}^{N}{\exp( \mathrm{fea}^{(j)}_{\mathrm{img}} \cdot \mathrm{fea}^{(i)}_{\mathrm{text}})} } ) \right)$
伪代码如下（from CLIP paper）

在这里插入图片描述

有了上面的背景知识，ITC的步骤就很好理解了：

STEP1: 图片经过image encoder得到image embedding $\mathbb{R}^{B\times 3 \times H \times W} \stackrel{\mathrm{Encoder}_{\mathrm{img}}} \longrightarrow \mathbb{R}^ {B\times L_{\mathrm{img}} \times d}$

STEP2: 文本经过text encoder得到text embedding $\mathbb{R}^{B\times L \times d'} \stackrel{\mathrm{Encoder}_{\mathrm{text}}} \longrightarrow \mathbb{R}^ {B\times L_{\mathrm{text}} \times d}$

STEP3: 分别拿到image embedding中[CLS]token对应的embedding $\mathrm{fea}_{\mathrm{img}} \in \mathbb{R} ^{B \times d}$ , 与文本text embedding中[CLS]token对应的embedding $\mathrm{fea}_{\mathrm{text}} \in \mathbb{R} ^{B \times d}$ .

STEP4: 将 $\mathrm{fea}_{\mathrm{img}}$ 与 $\mathrm{fea}_{\mathrm{text}}$ 投影到同一维度

STEP5: 同pair的 $(\mathrm{fea}_{\mathrm{img}}^{(i)}, \mathrm{fea}_{\mathrm{text}}^{(i)})$ 相似度约接近越好，非同pair的 $(\mathrm{fea}_{\mathrm{img}}^{(i)}, \mathrm{fea}_{\mathrm{text}}^{(j)}, i \neq j)$ 相似度越远越好。
在这里插入图片描述

2.2.2 Image-text matching (ITM)

ITM也是VLP的常用任务，它的实现形式有很多，核心思想是：给定图片-文本向量对 $\{(\mathrm{fea}_{\mathrm{img}}^{(1)}, \mathrm{fea}_{\mathrm{text}}^{(1)}), (\mathrm{fea}_{\mathrm{img}}^{(2)}, \mathrm{fea}_{\mathrm{text}}^{(2)}), \cdots, (\mathrm{fea}_{\mathrm{img}}^{(N)}, \mathrm{fea}_{\mathrm{text}}^{(N)}) \}$ .其训练目标为预测 $(\mathrm{fea}_{\mathrm{img}}^{(i)}, \mathrm{fea}_{\mathrm{text}}^{(j)})$ 是否来自同一个pair。是为1，否则为0。形如：
$\mathrm{Loss} = \sum_{i} \sum_{j} \begin{cases} \mathrm{CE}(\mathrm{Logit}_{} (\mathrm{fea}^{(j)}_{\mathrm{img}}, \mathrm{fea}^{(i)}_{\mathrm{text}}), 1) \quad \mathrm{if \, i = j} \\ \mathrm{CE}(\mathrm{Logit}_{} (\mathrm{fea}^{(j)}_{\mathrm{img}}, \mathrm{fea}^{(i)}_{\mathrm{text}}), 0) \quad \mathrm{if \, i \neq j} \end{cases}$
下面来看具体是如何实现的。

STEP1: 图片经过image encoder得到 image embedding $\mathbb{R}^{B\times 3 \times H \times W} \stackrel{\mathrm{Encoder}_{\mathrm{img}}} \longrightarrow \mathbb{R}^ {B\times L_{\mathrm{img}} \times d}$

STEP2: 文本经过text encoder得到text embedding $\mathbb{R}^{B\times L \times d'} \stackrel{\mathrm{Encoder}_{\mathrm{text}}} \longrightarrow \mathbb{R}^ {B\times L_{\mathrm{text}} \times d}$ ,和ITC有所区别的是，此处将image embedding 作为encoder_hidden_states也送入到text encoder（image embedding与text embedding在cross-attent层进行特征交互，image se quence embedding作为key，value。text embedding作为query）。因此最后输出的text embedding也同时蕴含了image sequence embedding的信息。作者将此时的text encoder称为 image-ground text encoder

image embedding与text embedding在cross-attention的特征交互实现可见transformers库BertSelfAttention函数

STEP3: 取文本text embedding中[CLS]token对应的embedding $\mathrm{fea}_{\mathrm{text}} \in \mathbb{R} ^{B \times d}$ 。

当送入image sequence embedding与text是pair时， $\mathrm{fea}_{\mathrm{text}}$ 的类别标签为1
当送入image sequence embedding与text非pair时， $\mathrm{fea}_{\mathrm{text}}$ 的类别标签为0

随后用cross-entropy计算损失。

训练完成后：得到 image-ground text encoder

在这里插入图片描述

2.2.3 Language modeling loss（LM）

LM时GPT系列的预训练任务。简单来说就是根据前面的词来预测下一个词。与NLP的LM有所不同的是VLP同时将image-embedding引入到上下文信息。

STEP1： 图片输入到image encoder中的得到 image embedding（ $\mathrm{fea}_{\mathrm{img}}$ ）。 $\mathbb{R}^{B\times 3 \times H \times W} \stackrel{\mathrm{Encoder}_{\mathrm{img}}} \longrightarrow \mathbb{R}^ {B\times L_{\mathrm{img}} \times d}$

STEP2: 将 image embedding（ $\mathrm{fea}_{\mathrm{img}}$ ）作为key，value送入到text-decoder的cross-attention中与text embedding进行特征交互。作者将此时的text-decoder称为image-ground text decoder

STEP3：最大化自回归序列的似然概率进行训练。训练完成后：得到 image-ground text decoder
$\sum_{i=1} ^ {L} \log (p(y_i|y_{<i}, \mathrm{fea}_{img}; \Theta))$
在这里插入图片描述

通过以上预训练任务：得到

image encoder
image-ground text encoder
image-ground text decoder

2.3 boostrapping caption

通过2.2节的预训练任务我们得到3个模型：1）image encoder；2）image-ground text encoder ；3） image-ground text decoder

下面来看如何结合上述上个模型来对web dataset进行“提纯”,主要步骤如下：

STEP1: 用人工标注的数据集 ${(I_h, T_h)|h=1,2, ...\}$ 对预训练模型image-ground text encoder与 image-ground text decoder 进行微调。 $I_h, T_h)$ 为图文对。

STEP2: 遍历web datasets ${(I_w, T_w)|w=1,2... \}$ .进行下面操作

STEP2.1 通过 image-ground text decoder （论文称之为Captioner）预测 $I_w$ 的caption $T_s$ 。此时对于图片 $I_w$ 有两个图文对 $I_w, T_w)$ 和 $I_w, T_s)$
STEP2.2 通过image-ground text encoder（论文称之为Filter）来判别图文对 $I_w, T_w)$ 和 $I_w, T_s)$ 是否matching。过滤不matching的图文对 $(I_w, T^{\mathrm{Not \, matching}}_w), (I_w, T^{\mathrm{Not \, matching}}_s)$ 。

STEP3: 汇总所有图文对 $\{(I_w, T_w^{\mathrm{match}})+(I_w, T_s^{\mathrm{match}}) + (I_h, T_h) \}$

用提纯后的数据集用2.2的预训练任务重新训练。

在这里插入图片描述

3 result

从作者给出的实验来看，boostrapping caption不论在retrieval还是在caption任务上都能带来一定的提升。但当scale up dataset and scale up model后，boostrapping caption的提升就很有限了（在caption任务上尤为明显），如下表的最后一行。

在这里插入图片描述