论文网址:[2106.05234] Do Transformers Really Perform Bad for Graph Representation? (arxiv.org)
论文代码:https://github.com/Microsoft/Graphormer
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!
1. 省流版
1.1. 心得
1.2. 论文总结图
2. 论文逐段精读
2.1. Abstract
①Transformer did not achieve ideal performance comparing with mainstream GNN variants
②The authors put forward Graphormer to change this situation
leaderboard n. 排行榜;通栏广告
2.2. Introduction
①Graphormer performs outstanding on Open Graph Benchmark Large-Scale Challenge (OGB-LSC), and several popular leaderboards such as OGB and Benchmarking-GNN
②Transformer only takes node similarity into consideration, whereas dose not focus on structural relationship. Ergo, Graphormer add structural encoding
③They capture node importance by Centrality Encoding, extract centrality by Degree centrality and present structural relationship by Spatial Encoding
④Graphormer occupies the top spot on the OGB-LSC, MolHIV, MolPCBA, ZINC and other rankings
de-facto 实际上的:指在实际上拥有某种地位或权力,而不是在法律上或正式上拥有
canonical adj. 根据教规的,按照宗教法规的;真经的,正经的;标准的,典范的;准确的,权威的;公认的,依据科学法则的;(数学表达式)最简洁的;(与)公理(或标准公式)(有关)的;(与)教会(或教士)(有关)的
2.3. Preliminary
(1)Graph Neural Network (GNN)
①Presenting graph as , where denotes the node set, denotes the number of nodes. Define the feature vector of named and node representation of at the -th layer is ,
②The usual GNN is representated as:
where denotes the neighbors (unknow hops) of
(2)Transformer
①Each layer in Transformer contains a self-attention module and a position-wise feed-forward network (FFN)
②The input of self-attention module is , where represents the hidden dimension, denotes the hidden representation at position
③The function of attention mechanism:
where is a similarity matrix of queries and keys
④They apply simple single-head self-attention mechanism and define . Moreover, they eliminate bias in multi-head attenton part
2.4. Graphormer
2.4.1. Structural Encodings in Graphormer
The overall framework of Graphormer, which contains three modules:
(1)Centrality Encoding
①For directed graph, their centrality encoding for input will be:
where is the learnable embedding vector of indegree , is the learnable embedding vector of outdegree (呃呃我现在不太能想象z是个什么样的玩意儿)
②For undirected graph, just one replaces and
(2)Spatial Encoding
①There is no sequence in graph presentation. To this end, they provide a new spatial encoding method to present spatial relations between and :
where they choose there is the shortest path (SPD). If there is no path, then set value as -1.
②⭐Assigning a learnable scalar to each feasible output value as bias term in self-attention part(鼠鼠注意力学得太菜了捏)
③The Q-K product matrix can be calculated by:
where denotes learnable scalar
④Paraphrase不动了这里上一句中文,文中的意思大概是上式比起传统的GNN可以体现全局视野,每个节点都开天眼。然后还能体现一下学习性,作者举例说如果和负相关的话可能每个节点会更在意邻近节点。
(3)Edge Encoding in the Attention
①In the previous works, adding edge features into corresponding node features or adding edge features and aggregated node features into corresponding node features are two traditional edge encoding methods. However, it is too superficial and limited in that it can just express the adjacent relationships rather than global relationships.
②The SP of each pair node can be
③They calculate the average of the dot-products of the edge feature in the -th edge and a learnable embedding weight in the -th along the path:
where denotes the dimensionality of edge feature
2.4.2. Implementation Details of Graphormer
(1)Graphormer Layer
(2)Special Node
2.4.3. How Powerful is Graphormer?
(1)Fact 1
(2)Fact 2
2.5. Experiments
2.5.1. OGB Large-Scale Challenge
(1)Baselines
(2)Settings
(3)Results
2.5.2. Graph Representation
(1)Baselines
(2)Settings
(3)Results
2.5.3. Ablation Studies
(1)Node Relation Encoding
(2)Centrality Encoding
(3)Edge Encoding
2.6. Related Work
2.6.1. Graph Transformer
2.6.2. Structural Encodings in GNNs
(1)Path and Distance in GNNs
(2)Positional Encoding in Transformer on Graph
(3)Edge Feature
2.7. Conclusion
2.8. Proofs
2.8.1. SPD can Be Used to Improve WL-Test
2.8.2. Proof of Fact 1
2.8.3. Proof of Fact 2
2.9. Experiment Details
2.9.1. Details of Datasets
2.9.2. Details of Training Strategies
(1)PCQM4M-LSC
(2)OGBG-MolPCBA
(3)OGBG-MolHIV
(4)ZINC
2.9.3. Details of Hyper-parameters for Baseline Methods
(1)PCQM4M-LSC
(2)OGBG-MolPCBA
(3)OGBG-MolHIV
2.10. More Experiments
2.11. Discussion & Future Work
3. 知识补充
3.1. Permutation invariant
参考学习1:【神经网络】Deep Sets:满足置换不变性(permutation-invariant)与置换同变性(permutation-equivariant)的通用网络结构 - 知乎 (zhihu.com)
4. Reference List
Ying C. et al. (2021) 'Do Transformers Really Perform Bad for Graph Representation?', NeurIPS 2021. doi: https://doi.org/10.48550/arXiv.2106.05234