[论文精读]Graph Attention Networks

论文原文:[1710.10903] Graph Attention Networks (arxiv.org)


英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

1. 省流版

1.1. 心得

(1)Intro里面就包含了related work的样子?


1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

        ①They proposed a graph attention networks (GATs), which is both suitable for inductive and transductive problems

        ②There is no need for special and costly matrix operation

        ③They test their model in Cora, Citeseer, Pubmed citation network datasets and proteinprotein interaction dataset

upfront  adj.预付的;坦率的;诚实的;直爽的;预交的  adv.预付地,先期支付地

2.2. Introduction

        ①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately

        ②Early works adopted recursive neural networks to process directed acyclic graphs

        ③They introduced spectral and non-spectral methods of graph processing

        ④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP

        ⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning

acyclic  adj.无环的;非循环的;非周期的;非环状的

reminiscent  adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的  n.回忆者;追记前事者

2.3. GAT architecture

2.3.1. Graph attention layer

        ①Input matrix:

\mathbf{h}=\{\vec{h}_{1},\vec{h}_{2},\ldots,\vec{h}_{N}\},\vec{h}_{i}\in\mathbb{R}^{F},\mathbf{h}\in \mathbb{R}^{F\times N}

where N denotes the number of nodes, F denotes the number of features

        ②Then transfer node features to higher level with shared weight matrix:


where a:\mathbb{R}^{F^{'}}\times \mathbb{R}^{F^{'}}\rightarrow \mathbb{R} is a attention mechanism;

j is the neighbor node in the neighborhood of i;

also, indicates the importance of node j's features to node i.

        ③Normalize neighbors:


where \mathcal{N}_i denotes neighborhood of node i and the order is set by 1, i.e. first-order neighbors.

        ④Further expanding function a\left ( \right ):


which is a single-layer feedforward neural network,

and where \vec{\mathbf{a}}\in\mathbb{R}^{2F^{\prime}} denotes a weight vector;

negative slope \alpha =0.2 ;

|| denotes concatenation.

        ⑤Applying nonlinearity to get final output:


        ⑥They further introduce multi-head attention with concatenation:

\vec h'_i=\parallel _{k=1}^{K}\sigma\left(\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

where \alpha_{ij}^k denotes normalized attention coefficients caculated by the k-th attention mechanism a^{k}

        ⑦In prediction layer, averaging is much more sensible than multi-head:

\vec h'_i=\sigma\left(\frac1K\sum_{k=1}^K\sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

        ⑧The figure of this model:

where the left is attention mechanism and the right is multi-head attention mechanism with K=3 

2.3.2. Comparisons to related work

(1)Their improvements:

        ①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, K multi-head operations can also be parallelized

        ②GAT allows to assign weights to neighbors

        ③Adopting to directed graph with imiting a_{ij} when there is no edge in j\rightarrow i

        ④Applicable to inductive

        ⑤GraphSAGE can not process the whole neighborhood but GAT can

        ⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations

on par with  与...相当 

par  n.(股票的)面值,票面价值;<高尔夫>标准杆数;平均量,常态,一般水平(或标准);标准(尤指某人的工作或健康)水准  adj.平价的,与票面价值相等的;平均的,正常的  vt.<高尔夫>标准杆数得分

2.4. Evaluation

        Datasets information:

2.4.1. Datasets

(1)Transductive learning

        ①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document

        ②Class: 20 node

(2)Inductive learning

        ①Pre-processing: provided by Hamilton et al. (GraphSAGE)

2.4.2. State-of-the-art method

(1)Transductive learning

        Comparison table:

(2)Inductive learning

        Comparison table:

where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor


        They provide MLP for each node

2.4.3. Experimental setup

(1)Transductive learning

        ①They adopted 2 layers model. The first layer uses K=8, and F{}'=8 for each multi-head. Then follows ELU. The second layer sets C (number of classes) features in one attention head, then follows Softmax.

        ②Moreover, L2 regularization with \lambda =0.0005

        ③Dropout rate: 0.6

(2)Inductive learning

        ①They chose 3 layer model. In the first two layers, K=4 and F{}'=256 with a latter ELU. The third layer adopts K=6 followed by logistic Sigmoid.

        ②The set is large enough to ignore L2 regularization and dropout

        ③Adopting skip connections in the middle attention layers


        ①Initialization: Glorot

        ②Optimizer: Adam SGD

        ③Learning rate: 0.01 for Pubmed, and 0.005 for others

        ④Early stopping strategy: 100 epochs

2.4.4. Results

        ①They tune and adjust other model to be similar to GAT for fair

        ②Transformed feature representations visualization:


2.5. Conclusions

3. 知识补充

3.1. Spectral and non-spectral approaches for GNN

3.2. Spectral domain and frequency domain

(1)Spectral domain: mainly used in GNN, adopting Fourier transform on space dimensionality

(2)Frequency domain: mainly used in signal and image processing, adopting Fourier transform on temporal dimensionality

4. Reference List

Velickovic, P. et al. (2018) 'Graph Attention Networks', ICLR 2018. doi: https://doi.org/10.48550/arXiv.1710.10903




