[论文精读]Do Transformers Really Perform Bad for Graph Representation?

论文网址:[2106.05234] Do Transformers Really Perform Bad for Graph Representation? (arxiv.org)


英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

1. 省流版

1.1. 心得

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

        ①Transformer did not achieve ideal performance comparing with mainstream GNN variants

        ②The authors put forward Graphormer to change this situation

leaderboard  n. 排行榜;通栏广告

2.2. Introduction

        ①Graphormer performs outstanding on Open Graph Benchmark Large-Scale Challenge (OGB-LSC), and several popular leaderboards such as OGB and Benchmarking-GNN 

        ②Transformer only takes node similarity into consideration, whereas dose not focus on structural relationship. Ergo, Graphormer add structural encoding

        ③They capture node importance by Centrality Encoding, extract centrality by Degree centrality and present structural relationship by Spatial Encoding

        ④Graphormer occupies the top spot on the OGB-LSC, MolHIV, MolPCBA, ZINC and other rankings

de-facto  实际上的:指在实际上拥有某种地位或权力,而不是在法律上或正式上拥有

canonical  adj. 根据教规的,按照宗教法规的;真经的,正经的;标准的,典范的;准确的,权威的;公认的,依据科学法则的;(数学表达式)最简洁的;(与)公理(或标准公式)(有关)的;(与)教会(或教士)(有关)的

2.3. Preliminary

(1)Graph Neural Network (GNN)

        ①Presenting graph as G=\left ( V,E \right ), where V=\{v_{1},v_{2},\cdots,v_{n}\} denotes the node set, n=\left | V \right | denotes the number of nodes. Define the feature vector of v_i named x_i and node representation of v_i at the l-th layer is h_i^{\left ( l \right )}h_i^{\left ( 0 \right )}=x_i

        ②The usual GNN is representated as:

a_i^{(l)}=\text{AGGREGATE}^{(l)}\left(\left\{h_j^{(l-1)}:j\in\mathcal{N}(v_i)\right\}\right)\\\quad h_i^{(l)}=\text{COMBINE}^{(l)}\left(h_i^{(l-1)},a_i^{(l)}\right)\\h_G=\text{READOUT}\left(\left\{h_i^{(L)}\mid v_i\in G\right\}\right)

where \mathcal{N}(v_i) denotes the neighbors (unknow hops) of v_i


        ①Each layer in Transformer contains a self-attention module and a position-wise feed-forward network (FFN)

        ②The input of self-attention module is {H}=\left[h_{1}^{\top},\cdots,h_{n}^{\top}\right]^{\top}\in\mathbb{R}^{n\times d}, where d represents the hidden dimension, h_{i}\in\mathbb{R}^{1\times d} denotes the hidden representation at position i

        ③The function of attention mechanism:

\begin{aligned}Q&=HW_Q,W_{Q}\in\mathbb{R}^{d\times d_{K}} \quad K=HW_K,W_{K}\in\mathbb{R}^{d\times d_{K}} \quad V=HW_V,W_{V}\in\mathbb{R}^{d\times d_{V}} \\A&=\frac{QK^\top}{\sqrt{d_K}},\quad\mathrm{Attn}\left(H\right)=\mathrm{softmax}\left(A\right)V\end{aligned}

where A is a similarity matrix of queries and keys

        ④They apply simple single-head self-attention mechanism and define d_K=d_V=d. Moreover, they eliminate bias in multi-head attenton part

2.4. Graphormer

2.4.1. Structural Encodings in Graphormer

        The overall framework of Graphormer, which contains three modules:

(1)Centrality Encoding

        ①For directed graph, their centrality encoding for input will be:


where z_{\deg^{-}(v_{i})}^{-}\in \mathbb{R}^d is the learnable embedding vector of indegree \deg^{-}(v_{i})z_{\deg^{+}(v_{i})}^{+}\in \mathbb{R}^d is the learnable embedding vector of outdegree \deg^{+}(v_{i})呃呃我现在不太能想象z是个什么样的玩意儿

        ②For undirected graph, just one \deg(v_{i}) replaces \deg^{+}(v_{i}) and \deg^{-}(v_{i})

(2)Spatial Encoding

        ①There is no sequence in graph presentation. To this end, they provide a new spatial encoding method to present spatial relations between v_i and v_j:

{\phi\left(v_{i},v_{j}\right):V\times V\rightarrow\mathbb{R}}

where \phi\left(v_{i},v_{j}\right) they choose there is the shortest path (SPD). If there is no path, then set value as -1.

        ②⭐Assigning a learnable scalar to each feasible output value as bias term in self-attention part(鼠鼠注意力学得太菜了捏

        ③The Q-K product matrix A can be calculated by:


where b_{\phi(v_i,v_j)} denotes learnable scalar


(3)Edge Encoding in the Attention

        ①In the previous works, adding edge features into corresponding node features or adding edge features and aggregated node features into corresponding node features are two traditional edge encoding methods. However, it is too superficial and limited in that it can just express the adjacent relationships rather than global relationships.

        ②The SP of each pair node can be SP_{ij}=\left ( e_1,e_2,...,e_N \right )

        ③They calculate the average of the dot-products of the edge feature x_{e_{n}} in the n-th edge e_n and a learnable embedding weight w_{n}^{E}\in\mathbb{R}^{d_{E}} in the n-th along the path:


where d_E denotes the dimensionality of edge feature

2.4.2. Implementation Details of Graphormer

(1)Graphormer Layer

(2)Special Node

2.4.3. How Powerful is Graphormer?

(1)Fact 1

(2)Fact 2

2.5. Experiments

2.5.1. OGB Large-Scale Challenge




2.5.2. Graph Representation




2.5.3. Ablation Studies

(1)Node Relation Encoding

(2)Centrality Encoding

(3)Edge Encoding

2.6. Related Work

2.6.1. Graph Transformer

2.6.2. Structural Encodings in GNNs

(1)Path and Distance in GNNs

(2)Positional Encoding in Transformer on Graph

(3)Edge Feature

2.7. Conclusion

2.8. Proofs

2.8.1. SPD can Be Used to Improve WL-Test

2.8.2. Proof of Fact 1

2.8.3. Proof of Fact 2

2.9. Experiment Details

2.9.1. Details of Datasets

2.9.2. Details of Training Strategies





2.9.3. Details of Hyper-parameters for Baseline Methods




2.10. More Experiments

2.11. Discussion & Future Work

3. 知识补充

3.1. Permutation invariant

参考学习1:【神经网络】Deep Sets:满足置换不变性(permutation-invariant)与置换同变性(permutation-equivariant)的通用网络结构 - 知乎 (zhihu.com)

4. Reference List

Ying C. et al. (2021) 'Do Transformers Really Perform Bad for Graph Representation?', NeurIPS 2021. doi: https://doi.org/10.48550/arXiv.2106.05234





