[ICLR 2025]Biologically Plausible Brain Graph Transformer

news2025/4/8 21:59:27

论文网址：Biologically Plausible Brain Graph Transformer

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Preliminaries

2.3.1. Problem Definition

2.3.2. Graph Transformers

2.4. Biologically Plausible Brain Graph Transformer

2.4.1. Network Entanglement-based Node Importance Encoding

2.4.2. Functional Module-Aware Self-Attention

2.5. Experiments

2.5.1. Experimental Setup

2.5.2. Results

2.5.3. Ablation Studies

2.5.4. Comparative Analysis of Node Importance Measurement

2.5.5. Biological Plausibility Analysis

2.6. Related Work

2.6.1. Brain Graph Analysis

2.6.2. Graph Transformers

2.7. Conclusion

1. 心得

（1）慎看，感觉很物理，需要一定基础，什么量子纠缠。我不是很懂纠缠

（2）（题外话）我将diss所有沙壁审稿人，我论文和这篇ICLR2025效果几乎一模一样（对比模型一样数据集一样结果一样），审稿人说我“你的模型二分类才70多太糟糕了”/“几乎没有提升”/“性能平平”，对我复现的论文，从这篇可以看到BrainGNN在ABIDE上就是五十多，GAT也是。审稿人说我“故意压低基线”/“和原论文极大不符，疑似学术造假~”/“怀疑结果真实性”。先四格🐎吗审稿人宝宝们？已读不回是审稿人宝宝们论文都被拒了是吧~

（3）先放表。能不能告诉全世界ABIDE数据集在2025年就是这个β样子：

（4）文尾有drama事件。BNT惨遭无辜炮轰。

2. 论文逐段精读

2.1. Abstract

①Existing works fail to represent brain framework

2.2. Introduction

①(a) Hub and functional modules, and (b) functional connectivity (FC) in different brain regions of ADHD:

2.3. Preliminaries

2.3.1. Problem Definition

①Graph: $G=(V,E,\mathbf{X})$ , with node set $V$ , dege set $E$ , feature matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ , and $n$ ROIs (nodes)

2.3.2. Graph Transformers

①Transformer block: an attention module and a feed-forward network (FFN)

②Attention mechanism with $\mathbf{W}_{Q}\in\mathbb{R}^{d\times d_{\mathcal{K}}},\mathbf{W}_{K}\in\mathbb{R}^{d\times d_{\mathcal{K}}},\mathbf{W}_{V}\in\mathbb{R}^{d\times d_{\mathcal{K}}}$ :

$\mathbf{Q}=\mathbf{X}\mathbf{W}_Q,\quad\mathbf{K}=\mathbf{X}\mathbf{W}_K,\quad\mathbf{V}=\mathbf{X}\mathbf{W}_V$

$\mathbf{A}=\frac{\mathbf{QK}^{\mathsf{T}}}{\sqrt{d_{\mathcal{K}}}},\quad Attn(\mathbf{X})=softmax(\mathbf{A})\mathbf{V}$

where $\mathbf{A}$ denotes similarity between queries and keys

③Output of attention blocks:

$\tilde{\mathbf{X}}=\mathbf{X}+Attn(\mathbf{X}),\quad\hat{\mathbf{X}}=\mathbf{W}_{2}ReLU(\mathbf{W}_{1}\tilde{\mathbf{X}})$

2.4. Biologically Plausible Brain Graph Transformer

①Rewrite $\tilde{\mathbf{X}}=\mathbf{X}+Attn(\mathbf{X})$ to:

$\tilde{\mathbf{x}}_i=\Phi(\mathbf{x}_i)+\mathrm{FM-}Attn(i)$

（FM后面是短横线，不是减号）where $\Phi(\cdot)$ denotes a network entanglement-based node importance encoding method

②Overall framework of BioBGT:

2.4.1. Network Entanglement-based Node Importance Encoding

①Normalized information diffusion propagator:

$\rho_{G}=\frac{e^{-\gamma\mathbf{L}}}{Z}$

where $e^{-\gamma\mathbf{L}}$ denotes information diffusion propagator, $\gamma$ denotes positive parameter, $\mathbf{L}$ is Laplacian matrix, $Z=\mathrm{Tr}(e^{-\gamma\mathrm{L}})$ is the partition function

②von Neumann entropy, to capture global topology and information diffusion process of graphs:

$\mathcal{S}(G)=-Tr(\rho_G\log_2\rho_G)$

where $\mathcal{S}(G)$ is the density matrix-based spectral entropy, $Tr\left ( \cdot \right )$ denotes the trace operation computing the trace of the product of the density matrix $\rho _G$ and its natural logarithm

③Node importance (node entanglement value (VE value)):

$\mathcal{NE}(i)=\|\mathcal{S}(G_{i})-\mathcal{S}(G)\|$

where $G_i$ is the $i$ -control graph obtained after the perturbation of node $i$

④To approximate NE value:

$\mathcal{NE}(i)\approx \begin{Vmatrix} \frac{2m\gamma n^2}{\ln2(n-\alpha)^2}\frac{\Delta Z}{ZZ_i}+\log_2(\frac{Z_i}{Z}) \end{Vmatrix}$

where $n$ and $m$ is node number and edge number respectively, $\Delta Z=Z_{i}-Z$

⑤Node representation:

$\mathbf{x^{\prime}}_{i}=\Phi(\mathbf{x}_{i})=\mathbf{x}_{i}+\mathbf{x}_{\mathcal{NE}(i)}$

where $\mathbf{x}_{\mathcal{NE}(i)}$ denotes learnable embedding vector specified by $\mathcal{N}\mathcal{E}(i)$

2.4.2. Functional Module-Aware Self-Attention

（1）Community Contrastive Strategy-based Functional Module Extractor

①Updating $\mathbf{x^{\prime}}_{i}$ by $\mathbf{h}_i:=\psi(i,\mathcal{M}_i)$ , where $\psi \left ( \cdot \right )$ denotes functional module extractor and $\mathcal{M}_i$ is functional module node $i$ belongs to

②Augment graph $G$ to $G^1$ and $G^2$ by edge drop

③Employing contrastive learning by regarding nodes in the same functional module as positive sample and in the different functional module as negative. They use InfoNCE loss:

$\mathcal{L}=-\frac{1}{n}\sum_{i=1}^{n}\log\frac{exp(Sim(\mathbf{h}_{i}^{1},\mathbf{h}_{i}^{pos}))}{\sum_{j=1}^{n^{\mathrm{Neg}}}exp(Sim(\mathbf{h}_{i}^{1},\mathbf{h}_{j}^{1}))+\sum_{j=1}^{n^{\mathrm{Neg}}}exp(Sim(\mathbf{h}_{i}^{1},\mathbf{h}_{j}^{2}))}$

where node features are represented as $\mathbf{h}_i^j$ in graph $G^j$

（2）Updated Self-Attention Mechanism

①Attention module with exponential kernels:

$\mathrm{FM-}Attn(i)=\sum_{j\in V}\frac{exp\left(\langle\mathbf{W}_{Q}\mathbf{h}_{i},\mathbf{W}_{K}\mathbf{h}_{j}\rangle/\sqrt{d_{\mathcal{K}}}\right)}{\sum_{u\in V}exp\left(\langle\mathbf{W}_{Q}\mathbf{h}_{i},\mathbf{W}_{K}\mathbf{h}_{u}\rangle/\sqrt{d_{\mathcal{K}}}\right)}f(\mathbf{h}_{j})$

where $exp\left(\langle\mathbf{W}_{Q}\mathbf{h}_{a},\mathbf{W}_{K}\mathbf{h}_{b}\rangle/\sqrt{d_{\mathcal{K}}}\right)$ denotes non-negative kernel, $\langle\cdot,\cdot\rangle$ is dot product, $f\left ( \cdot \right )$ is linear value function

②Functional module-aware self-attention bound:

$\|F\boldsymbol{M}Attn(a)-F\boldsymbol{M}Attn(b)\|\leq C_{\mathcal{M}}\|\mathbf{h}_a-\mathbf{h}_b\|$

where $\mathbf{h}_a:=\psi(a,\mathcal{M}_a)$ and $\mathbf{h}_{b}:=\psi(b,\mathcal{M}_{b})$ are representations of nodes $a$ and $b$ after the functional module extractor