【论文阅读】On clustering using random walks

news2025/4/12 10:38:04

《On clustering using random walks》阅读笔记

1. 问题建模

1.1 问题描述

let $G(V,E,\omega)$ be a weighted graph, $V$ is the set of nodes, $E$ is the edge between nodes in $V$ , $\omega$ is the function $\omega：E \to \mathbb{R}^n$ , that measures the simularity between pairs of items(a higher value means more similar).

$p_{ij} = \frac{\omega(i,j)}{d_i}$
$d_i = \sum_{k=1}^n\omega(i,k)$

$M^G \in \mathbb{R}^{n \times n}$ is the associated transition matrix,
$M^G_{ij} = \begin{cases} p_{ij} & \langle i,j \rangle \in E \\ 0 & \textrm{otherwise} \end{cases}$

Question:

$\omega$ 表示节点之间的相似性，实际上我们只有无向图，表示节点之间是否有连接，怎么通过已有的信息构建 $\omega$
answer: 这里的相似度可以认为是节点之间边的权值，所以 $M^G_{ij}$ 可以认为是认为是以邻接矩阵操作后的数据。

这里的内容比较坑，我在论文中一直找不到关于 $P^{k}_{\textrm{visit}}(i)$ 是怎么计算的，在这里卡了好久好久。

在原文中的描述是这样的：

Now, denote by $P^k_{visit}(i) \in \mathbb{R}^n$ the vector whose j-th component is the probability that a random walk originating at i will visit node j in its k-th step. Thus, $P^k_{visit}(i)$ is the i-th row in the matrix $M^G)^k$ , the k’th power of $M^G$ .

现在我们知道 $M^G$ 是怎样计算的，但是 $M^G)^k$ 呢，在原文中的描述是’'the k’th power of $M^G$ ", 我理解的应该是原有矩阵 $M^G$ 的k次方（矩阵的乘法）。

$P^k_{visit}(i)$ is the i-th row in the matrix $M^G)^k$ ,

$P^k_{visit}(i) = (M^G)^k_i$
$(M^G)^k=\{P^k_{visit}(1)^{\mathbf{T}}, P^k_{visit}(2)^{\mathbf{T}}, \dots, P^k_{visit}(n)^{\mathbf{T}}\}$

Notice: 其实到这里，和马尔可夫聚类算法（MCL）是一样的。MCL是不断迭代，知道矩阵不再改变，这里作者考虑到计算复杂，采用前k次计算结果的和来作为替代。

We now offer two methods for performing the edge separation, both based on deterministic analysis of random walks.

边缘分离，锐化

NS: Separation by neighborhood similarity.

CE: Separation by circular escape.

the weighted neighborhood ：加权领域
bipartite subgraph

$P^{\leq k}_{\textrm{visit}}(v) = \sum_{i=1}^kP^{i}_{\textrm{visit}}(v)$

2. NS: Separation by neighborhood similarity.

Now, in order to estimate the closeness of the two node $v$ and $u$ , we fix some small k(eg. k = 3) and compare $P^{\leq k}_{\textrm{visit}}(v)$ and $P^{\leq k}_{\textrm{visit}}(u)$ . The smaller the difference, the greater the intimacy between $u$ and $v$ .

$\xlongequal{dfn} G_s(V, E, \omega_s)$ ,
where $\forall \langle v, u \rangle \in E, \omega_s(u, v) = sim^k(P^{\leq k}_{visit}(v),P^{\leq k}_{visit}(u))$

$sim^k(x,y)$ is some similarity measure of the vectors $\mathrm{x}$ and $\mathrm{y}$ , whose value increases as $\mathrm{x}$ and $\mathrm{y}$ are more similar.

$sim^k(x,y)$ the suitable choose:
$f^k(x,y) \xlongequal{dfn} \exp(2k − \|x − y\|_{L_1}) − 1 \tag{1}$
$\|x − y\|_{L_1} = \sum_{i=1}^n|x_i-y_i|$

another choose is:
$\cos(x,y)= \frac{(x,y)}{\sqrt{(x,x)}.\sqrt{(y,y)}} \tag{2}$
where (·,·) denotes inner-product.(内积)

3.2 CE: Separation by circular escape.

3.3 代码实现

无向带权图

import numpy as np


def markovCluster(adjacencyMat, dimension, numIter, power=2, inflation=2):
    columnSum = np.sum(adjacencyMat, axis=0)
    probabilityMat = adjacencyMat / columnSum

    # Expand by taking the e^th power of the matrix.
    def _expand(probabilityMat, power):
        expandMat = probabilityMat
        for i in range(power - 1):
            expandMat = np.dot(expandMat, probabilityMat)
        return expandMat

    expandMat = _expand(probabilityMat, power)

    # Inflate by taking inflation of the resulting
    # matrix with parameter inflation.
    def _inflate(expandMat, inflation):
        powerMat = expandMat
        for i in range(inflation - 1):
            powerMat = powerMat * expandMat
        inflateColumnSum = np.sum(powerMat, axis=0)
        inflateMat = powerMat / inflateColumnSum
        return inflateMat

    inflateMat = _inflate(expandMat, inflation)

    for i in range(numIter):
        expand = _expand(inflateMat, power)
        inflateMat = _inflate(expand, inflation)
    print(inflateMat)
    print(np.zeros((7, 7)) != inflateMat)


if __name__ == "__main__":
    dimension = 4
    numIter = 10
    adjacencyMat = np.array([[1, 1, 1, 1],
                             [1, 1, 0, 1],
                             [1, 0, 1, 0],
                             [1, 1, 0, 1]])

    # adjacencyMat = np.array([[1, 1, 1, 1, 0, 0, 0],
    #                          [1, 1, 1, 1, 1, 0, 0],
    #                          [1, 1, 1, 1, 0, 0, 0],
    #                          [1, 1, 1, 1, 0, 0, 0],
    #                          [0, 1, 0, 0, 1, 1, 1],
    #                          [0, 0, 0, 0, 1, 1, 1],
    #                          [0, 0, 0, 0, 1, 1, 1],
    #                          ])
    markovCluster(adjacencyMat, dimension, numIter)

[[1.00000000e+000 1.00000000e+000 1.00000000e+000 1.00000000e+000]
 [5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]
 [0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000]
 [5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]]
[[ True  True  True  True]
 [ True  True  True  True]
 [False False False False]
 [ True  True  True  True]]