论文原文:[1710.10903] Graph Attention Networks (arxiv.org)
论文代码:https://github.com/PetarV-/GAT
英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!
1. 省流版
1.1. 心得
(1)Intro里面就包含了related work的样子?
(2)狠狠赞扬Datasets的表格,我都不用总结了
1.2. 论文框架图
2. 论文逐段精读
2.1. Abstract
①They proposed a graph attention networks (GATs), which is both suitable for inductive and transductive problems
②There is no need for special and costly matrix operation
③They test their model in Cora, Citeseer, Pubmed citation network datasets and proteinprotein interaction dataset
upfront adj.预付的;坦率的;诚实的;直爽的;预交的 adv.预付地,先期支付地
2.2. Introduction
①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately
②Early works adopted recursive neural networks to process directed acyclic graphs
③They introduced spectral and non-spectral methods of graph processing
④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP
⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning
acyclic adj.无环的;非循环的;非周期的;非环状的
reminiscent adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的 n.回忆者;追记前事者
2.3. GAT architecture
2.3.1. Graph attention layer
①Input matrix:
where denotes the number of nodes, denotes the number of features
②Then transfer node features to higher level with shared weight matrix:
where is a attention mechanism;
is the neighbor node in the neighborhood of ;
also, indicates the importance of node 's features to node .
③Normalize neighbors:
where denotes neighborhood of node and the order is set by 1, i.e. first-order neighbors.
④Further expanding function :
which is a single-layer feedforward neural network,
and where denotes a weight vector;
negative slope ;
|| denotes concatenation.
⑤Applying nonlinearity to get final output:
⑥They further introduce multi-head attention with concatenation:
where denotes normalized attention coefficients caculated by the -th attention mechanism
⑦In prediction layer, averaging is much more sensible than multi-head:
⑧The figure of this model:
where the left is attention mechanism and the right is multi-head attention mechanism with
2.3.2. Comparisons to related work
(1)Their improvements:
①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, multi-head operations can also be parallelized
②GAT allows to assign weights to neighbors
③Adopting to directed graph with imiting when there is no edge in
④Applicable to inductive
⑤GraphSAGE can not process the whole neighborhood but GAT can
⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations
on par with 与...相当
par n.(股票的)面值,票面价值;<高尔夫>标准杆数;平均量,常态,一般水平(或标准);标准(尤指某人的工作或健康)水准 adj.平价的,与票面价值相等的;平均的,正常的 vt.<高尔夫>标准杆数得分
2.4. Evaluation
Datasets information:
2.4.1. Datasets
(1)Transductive learning
①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document
②Class: 20 node
(2)Inductive learning
①Pre-processing: provided by Hamilton et al. (GraphSAGE)
2.4.2. State-of-the-art method
(1)Transductive learning
Comparison table:
(2)Inductive learning
Comparison table:
where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor
(3)Summary
They provide MLP for each node
2.4.3. Experimental setup
(1)Transductive learning
①They adopted 2 layers model. The first layer uses , and for each multi-head. Then follows ELU. The second layer sets (number of classes) features in one attention head, then follows Softmax.
②Moreover, L2 regularization with
③Dropout rate: 0.6
(2)Inductive learning
①They chose 3 layer model. In the first two layers, and with a latter ELU. The third layer adopts followed by logistic Sigmoid.
②The set is large enough to ignore L2 regularization and dropout
③Adopting skip connections in the middle attention layers
(3)Summary
①Initialization: Glorot
②Optimizer: Adam SGD
③Learning rate: 0.01 for Pubmed, and 0.005 for others
④Early stopping strategy: 100 epochs
2.4.4. Results
①They tune and adjust other model to be similar to GAT for fair
②Transformed feature representations visualization:
2.5. Conclusions
3. 知识补充
3.1. Spectral and non-spectral approaches for GNN
3.2. Spectral domain and frequency domain
(1)Spectral domain: mainly used in GNN, adopting Fourier transform on space dimensionality
(2)Frequency domain: mainly used in signal and image processing, adopting Fourier transform on temporal dimensionality
4. Reference List
Velickovic, P. et al. (2018) 'Graph Attention Networks', ICLR 2018. doi: https://doi.org/10.48550/arXiv.1710.10903