Faithful Vision-Language Interpretation via Concept Bottleneck Models (FVLC)

本篇文章发表于ICLR 2024。

文章链接：https://openreview.net/attachment?id=rp0EdI8X4e&name=pdf

一、概述

这篇文章也是CBMs“大家庭”的一员。众所周知，CBMs需要大量的人工annotation，Label-Free CBM借用pre-trained多模态模型自动生成concepts在一定程度上解决了这一问题。但使用pre-trained模型存在unstable的问题，因此作者在Label-Free CBM的基础上提出了更加stable的模型——Faithful Vision-Language Concept (FVLC) models。

作者指出，faithful concept应该具备四个特性：

Faithful concept应该尽可能与original concept一致：Significant overlap between the top-k indices of the “faithful concept” and the original concept, ensuring interpretability.
在concept generation过程中可以抵抗噪声和干扰：Inherent stability, with the concept vector remaining robust against random noise and perturbations during LLM concept set generation.
预测结果要与vanilla CBMs保持一致：A prediction distribution close to that of the vanilla CBMs, preserving its outstanding performance.
Output distribution具备稳定性(stable)：Stable output distribution, remaining robust during self-supervised learning and LLM concept set generation, even in the presence of perturbations.

二、方法

在具体介绍本文提出的方法之前，我们先来回顾一些知识点。

1. Concept Bottleneck Models (CBMs)

首先是概念瓶颈模型CBMs，这一部分已经写过很多篇博客了。如果大家对CBMs熟悉的话，应该知道CBMs有两大主要缺点：1. 因为原始数据特征的不完全提取而导致的性能损失； 2. 需要大量的人工标注。针对这两个问题，已经有大量文献提出了潜在的解决方法，比如SENN、PCBM、Label-Free CBM等。

回顾一下CBMs的notation：We consider a classification task with a concept set denoted as $\mathcal{C}=\left \{ p_1,...,p_k \right \}$ and a training dataset represented as $\left \{ (x_i,y_i,c_i) \right \}_{i=1}^N$ , where for $i \in [N]$ , $x_i \in \mathbb{R}^d$ is the feature vector, $y_i \in \mathbb{R}^{d_z}$ denotes the label, where $d_z$ corresponds to the number of classes, and $c_i \in \mathbb{R}^k$ denotes the concept vector whose $j$ -th entry represents the weight of the concept $p_j$ . In CBMs, we aim to learn two representations, one transforms from the input space to the concept space, which is represented by $g:\mathbb{R}^d\rightarrow \mathbb{R}^{k}$ . The other one maps from the concept space to the prediction space, which can be denoted by $f:\mathbb{R}^k\rightarrow \mathbb{R}^{d_z}$ . For any input $x$ , we aim to make its predicted concept vector $\hat{c}=g(x)$ and prediction $\hat{y}=f(g(x))$ to be close to its underlying ones.

2. Label-free CBMs

Label-free CBMs有四个步骤：

Step 1: Concept set creation and filtering.

询问GPT-3一系列问题并做筛选，产生概念集合 $\mathcal{C}$ ；

Step 2 and 3: Learning the Concept Bottleneck Layer (CBL).

学习从特征空间到概念空间的prejection weights $W_c$ 。具体的做法是首先使用CLIP生成concept activation matrix $M_{i,j}=E_I(x_i)\cdot E_T(P_j)$ ，其中 $E_I$ 与 $E_T$ 分别为CLIP中的image encoder与text encoder，矩阵 $M$ 的行代表不同的图片，列代表不同的概念，其中的元素代表图片 i 中概念 j 的存在情况（表示为乘积）。 $W_c$ 是一个 $k \times d$ 的矩阵，代表了特征空间到概念空间的映射， $y(x,\boldsymbol{c})=W_{F}g(x)$ 。用 $l \in [d]$ 表示我们关注的神经元，所有图片在该神经元上对应的activation pattern可以表示为 $q_l=\left[g_l(bf(x_1)),\cdots,g_l(bf(x_N))\right]$ ，优化目标是使得第 i 个神经元与第 i 个concept尽可能对齐/匹配，由以下式子给出：

$\mathcal{L}(W_c)=\sum_{i=1}^k-\sin(P_i,q_i)=\sum_{i=1}^k-\frac{\bar{q_i}^3\cdot\bar{M_{:,i}}^3}{||\bar{q_i}^3||_2||\bar{M_{:,i}}^3||_2}$

Step 4: After successfully learning the Concept Bottleneck Layer, the next step involves training the final predictor using the fully connected layer.

学习从概念到类别的映射 $W_F\in\mathbb{R}^{d_z\times k}$ ， $y(x,\boldsymbol{c})=W_{F}g(x)$

接下来介绍本文提出的FVLC。

3. Faithful Vision-Language Concept

由于Label-free CBMs概念集合的产生依赖于GPT-3，因此可能会引入不稳定(instability)和扰动(perturbation)。此外，不仅概念会被干扰，输入图片也会不可避免地存在被干扰的风险，因此在以上情况下更需要保持概念的stability，也就是所谓的“faithful concept”。

那么什么是faithful concept？由上所述我们可以知道，faithful concept要具备当输入或概念集本身被扰动时概念向量仍然能够保持稳定的能力。我们应该对此进行合理的定义。（图片截取自原论文）

定义一：

两个概念向量按激活值从大到小的顺序排列后前 k 个concepts的overlap程度 $V_k(x,x^{\prime})$

此处是为了后面比较faithful concepts与original concepts之间的差异所作出的定义。

(注: $T_k$ 是一个包含了concept索引的集合，而并不是具体的concept，因此后面对concept进行perturbation后，对于stable and faithful concept而言，这个索引集 $T_k$ 是不会发生太大变化的，即使concept本身发生了变化。)

定义二：

Similarity of Explanation: faithful concept $\tilde{g}(x)$ 与original concept $g(x)$ 的 top-k1 overlap 程度大于等于 $\beta_1$ ，易知 $\beta_1=1$ 对应于二者的top-k1 concepts完全相同。这一点是为了保证faithful concept要尽可能与original concept在前 k1 个concepts上保持一致；
Stability of Explanation: 进行 $\rho$ 的扰动后的概念 $\tilde{g}(x)+\rho$ 与扰动前的概念 $\tilde{g}(x)$ 的top-k2 overlap程度大于等于 $\beta_2$ ，易知 $\beta_2=1$ 对应于二者完全相同。这一点是为了保证扰动后概念向量仍然不会发生太大变化(具体来说是扰动后概念的rank尽可能与原来保持一致)；
Closeness of Prediction: 用faithful concept与original concept产生的结果要尽可能一致， $D$ 代表某种距离度量比如KL散度， $\alpha_1=0$ 时对应于二者的预测结果完全一致；
Stability of Prediction: 对faithful concept进行扰动 $\delta$ 后的预测结果不会发生太大变化， $\alpha_2=0$ 时对应于二者的预测结果完全一致；

整体上，我们可以说：

$\color{blue}{for~any~given~x,~\tilde{c}=\tilde{\boldsymbol{g}}(x)~is~a~(D,R,\alpha,\beta,k_1,k_2)\text{-Faithful-Vision-Language~Concept}}$

4. FVLC Framework

这一节的写作上有点乱，领会精神吧......

Sensitivity: 除了上面讨论的similarity与stability，sensitivity敏感性指的是，当我们排除掉(exclude)关键的concep时预测应该表现出敏感性，而对其进行微小扰动时应该表现出稳定性。

让我们再次回到定义二，总结一下各个参数的理想值应该是什么：

Top-k approach: $\beta_1$ 尽可能接近于1；

Stability: $R_1$ 应该尽可能大， $\beta_2$ 尽可能接近于1；

Prediction: $R_2$ 应该尽可能大， $\alpha_1,\alpha_2$ 尽可能接近于0；

网络整体示意图：

整体的做法和Label-free CBM基本是一致的，只是使用 $\mathcal{L}_1,\mathcal{L}_2,\mathcal{L}_3,\mathcal{L}_4$ 来限制网络以产生faithful concepts。总体的目标函数为：

$\begin{aligned}&\min_{\tilde{W}_c}\mathbb{E}_x[\lambda_1D(y(x,\tilde{\boldsymbol{c}}),y(x,\boldsymbol{c}))-\lambda_2V_{k_1}(\tilde{\boldsymbol{g}}(x),\boldsymbol{g}(x))+\lambda_3\max_{||\delta||\leq R_2}D(y(x,\tilde{\boldsymbol{c}}),y(x,\tilde{\boldsymbol{c}}+\delta))\\&{-\lambda_4\max_{||\rho||\leq R_1}V_{k_2}(\tilde{\boldsymbol{g}}(x),\tilde{\boldsymbol{g}}(x)+\rho)}],\end{aligned}$

这四项 $\mathcal{L}_1,\mathcal{L}_2,\mathcal{L}_3,\mathcal{L}_4$ 分别对应于：prediction closeness，concept similarity，prediction stability，concept stability。

可以使用PSGD解决这个优化问题，但是因为top-k overlap function $V_k$ 是不可微的，所以要用surrogate loss来替代。

具体来说，只优化前k个entries并简单地使用 $\ell_{1}\operatorname{-norm}$ 使得它们尽可能接近，见下：

（然而，从交集变为 $\ell_{1}\operatorname{-norm}$ 的“逐点匹配”，虽然使损失函数可微了，但对concept的rank也进行了限制。也就是说，如果是使用原本的交集操作，只要top-k中的concepts存在就行了，对顺序没有要求——比如perturbation之前top-k concepts的indices是{1,3,5,7}，perturbation之后是{3,1,7,5}，交集的结果是二者“完全重合”，但用 $\ell_{1}\operatorname{-norm}$ 则不是。）

从而，放宽后的目标函数变为：

$\begin{aligned}&\min_{\tilde{W}_c}\mathbb{E}_x[D(y(x,\tilde{\boldsymbol{c}}),y(x,\boldsymbol{c}))+\lambda_1\underbrace{\mathcal{L}_{k_1}(\tilde{\boldsymbol{g}}(x),\boldsymbol{g}(x))}_{\mathcal{L}_2}+\lambda_2\underbrace{\max_{||\delta||\leq R_2}D(y(x,\tilde{\boldsymbol{c}}),y(x,\tilde{\boldsymbol{c}}+\boldsymbol{\delta}))}_{\mathcal{L}_3}\\&+\lambda_3\underbrace{\max_{\|\rho\|\leq R_1}\left.\mathcal{L}_{k_2}(\tilde{\boldsymbol{g}}(x),\tilde{\boldsymbol{g}}(x)+\boldsymbol{\rho})\right]}_{\mathcal{L}_4}.\end{aligned}$