谱聚类
文章目录
- 谱聚类
- 1. 信息增益的度量
- 2. 谱聚类: 寻找最优的函数向量 f \boldsymbol{f} f
- 2.1 : 寻找一个最优的函数向量 f \boldsymbol{f} f
- 2.2 寻找鲁棒性更强的多个函数向量
- 2.3 谱聚类(spectral clustering)算法
- 小结
1. 信息增益的度量
由于数据集 X = [ x 1 , x 2 , ⋯ , x n ] X=[\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots, \boldsymbol{x}_n] X=[x1,x2,⋯,xn] 的图结构不像图像的像素有像素值, 数据样本没有规整的结构, 也没有特定的函数 f f f, 因此, 我们需要为每个样本赋予一个函数值 { f i } i = 1 n \{f_i\}_{i=1}^n {fi}i=1n, 或表示成函数向量 f \boldsymbol{f} f. 为获得整个数据集的凸凹性(或差异性)
这种差异性可以通过拉普拉斯变换表示为
h
=
L
f
=
(
D
−
W
)
f
=
D
f
−
W
f
\boldsymbol{h} = L\boldsymbol{f}=(D-W)\boldsymbol{f}=D\boldsymbol{f}-W\boldsymbol{f}
h=Lf=(D−W)f=Df−Wf
则
h
\boldsymbol{h}
h 的第
i
i
i 个 分量
h
i
h_i
hi 可表示为
h
i
=
d
i
f
i
−
w
i
,
:
f
=
∑
j
∈
N
i
w
i
j
f
i
−
∑
j
∈
N
i
w
i
j
f
j
=
∑
j
∈
N
i
w
i
j
(
f
i
−
f
j
)
h_i = d_if_i-w_{i,:}\boldsymbol{f}=\sum_{j\in N_i}w_{ij}f_i-\sum_{j\in N_i}w_{ij}f_j=\sum_{j\in N_i}w_{ij}(f_i-f_j)
hi=difi−wi,:f=j∈Ni∑wijfi−j∈Ni∑wijfj=j∈Ni∑wij(fi−fj)
即,
h
i
h_i
hi 是第
i
i
i 个顶点
v
i
v_i
vi 与其近邻点的差的和. 若构造一个量来表示所有顶点与其近邻点的差异的总和, 则可以由以下几种方式:
- 第一种(计算困难): ∑ i = 1 n h i 2 = ∑ i = 1 n ( ∑ j ∈ N i w i j ( f i − f j ) ) 2 \sum_{i=1}^n h_i^2 =\sum_{i=1}^n\Big(\sum_{j\in N_i}w_{ij}(f_i-f_j)\Big)^2 ∑i=1nhi2=∑i=1n(∑j∈Niwij(fi−fj))2
- 第二种(正负值会抵消): ∑ i = 1 n h i = ∑ i = 1 n ∑ j ∈ N i w i j ( f i − f j ) \sum_{i=1}^n h_i =\sum_{i=1}^n\sum_{j\in N_i}w_{ij}(f_i-f_j) ∑i=1nhi=∑i=1n∑j∈Niwij(fi−fj)
- 第三种(非负求和): ∑ i = 1 n f i ⋅ h i = ∑ i = 1 n f i ∑ j ∈ N i w i j ( f i − f j ) \sum_{i=1}^n f_i\cdot h_i =\sum_{i=1}^nf_i\sum_{j\in N_i}w_{ij}(f_i-f_j) ∑i=1nfi⋅hi=∑i=1nfi∑j∈Niwij(fi−fj)
综合上述三种情况, 第三种度量总的差异是最适合的.
∑ i = 1 n f i ⋅ h i = ∑ i = 1 n f i ∑ j ∈ N i w i j ( f i − f j ) = ∑ i = 1 n ∑ j ∈ N i w i j f i ( f i − f j ) = 1 2 ∑ i = 1 n ∑ j ∈ N i w i j ( 2 f i f i − 2 f i f j ) = 1 2 ∑ i = 1 n ∑ j ∈ N i w i j ( f i f i − 2 f i f j + f j f j ) = 1 2 ∑ i = 1 n ∑ j ∈ N i w i j ( f i − f j ) 2 \begin{array}{ll} \sum_{i=1}^n f_i\cdot h_i &=\sum_{i=1}^nf_i\sum_{j\in N_i}w_{ij}(f_i-f_j)\\\;\\ &=\sum_{i=1}^n\sum_{j\in N_i}w_{ij}f_i(f_i-f_j)\\\;\\ &=\frac{1}{2}\sum_{i=1}^n\sum_{j\in N_i}w_{ij}(2f_if_i-2f_if_j)\\\;\\ &=\frac{1}{2}\sum_{i=1}^n\sum_{j\in N_i}w_{ij}(f_if_i-2f_if_j+f_jf_j)\\\;\\ &=\frac{1}{2}\sum_{i=1}^n\sum_{j\in N_i}w_{ij}(f_i-f_j)^2 \end{array} ∑i=1nfi⋅hi=∑i=1nfi∑j∈Niwij(fi−fj)=∑i=1n∑j∈Niwijfi(fi−fj)=21∑i=1n∑j∈Niwij(2fifi−2fifj)=21∑i=1n∑j∈Niwij(fifi−2fifj+fjfj)=21∑i=1n∑j∈Niwij(fi−fj)2
因此, 数据集信息的总增益的度量可表示为
∑ i = 1 n f i ⋅ h i = 1 2 Σ i = 1 n Σ j ∈ N i w i j ( f i − f j ) 2 = 1 2 Σ i = 1 n Σ j ∈ N i w i j ( f i 2 + f j 2 − 2 f i f j ) = Σ i = 1 n Σ j ∈ N i w i j f i 2 − Σ i = 1 n Σ j ∈ N i w i j f i f j = Σ i = 1 n ( w i 1 f 1 2 + w i 2 f 2 2 + ⋯ + w i n f n 2 ) − f ⊤ W f = ( f 1 , f 2 , ⋯ , f n ) ( d 1 d 2 ⋱ d n ) ( f 1 f 2 ⋮ f n ) − ( f 1 , f 2 , ⋯ , f n ) ( w 11 w 12 ⋯ w 1 n w 21 w 22 ⋯ w 2 n ⋯ ⋯ ⋯ ⋯ w n 1 w n 2 ⋯ w n n ) ( f 1 f 2 ⋮ f n ) = f ⊤ D f − f ⊤ W f = f ⊤ ( D − W ) f = f ⊤ L f \begin{array}{l} \sum_{i=1}^n f_i\cdot h_i=\frac{1}{2}\Sigma_{i=1}^n\Sigma_{j\in N_i}w_{ij}(f_i-f_j)^2\\\;\\ =\frac{1}{2}\Sigma_{i=1}^n\Sigma_{j\in N_i}w_{ij}(f_i^2+f_j^2-2f_if_j)\\\;\\ =\Sigma_{i=1}^n\Sigma_{j\in N_i}w_{ij}f_i^2-\Sigma_{i=1}^n\Sigma_{j\in N_i}w_{ij}f_if_j\\\;\\ =\Sigma_{i=1}^n(w_{i1}f_1^2+w_{i2}f_2^2+\cdots+w_{in}f_n^2)-\boldsymbol f^{\top}W\boldsymbol f\\\;\\ =(f_1,f_2,\cdots,f_n)\begin{pmatrix} d_1 & & & \\ & d_2 & & \\ & & \ddots & \\ & & &d_n \end{pmatrix}\begin{pmatrix} f_1\\ f_2\\ \vdots \\ f_n \end{pmatrix}- (f_1,f_2,\cdots,f_n)\begin{pmatrix} w_{11} & w_{12} & \cdots & w_{1n} \\ w_{21} & w_{22}& \cdots & w_{2n}\\ \cdots &\cdots & \cdots & \cdots \\ w_{n1} & w_{n2} & \cdots &w_{nn} \end{pmatrix}\begin{pmatrix} f_1\\ f_2\\ \vdots \\ f_n \end{pmatrix}\\\;\\ =\boldsymbol{f}^{\top}D\boldsymbol{f}-\boldsymbol{f}^{\top}W\boldsymbol{f}\\\;\\ =\boldsymbol{f}^{\top}(D-W)\boldsymbol{f}\\\;\\ =\boldsymbol{f}^{\top}L\boldsymbol{f} \end{array} ∑i=1nfi⋅hi=21Σi=1nΣj∈Niwij(fi−fj)2=21Σi=1nΣj∈Niwij(fi2+fj2−2fifj)=Σi=1nΣj∈Niwijfi2−Σi=1nΣj∈Niwijfifj=Σi=1n(wi1f12+wi2f22+⋯+winfn2)−f⊤Wf=(f1,f2,⋯,fn) d1d2⋱dn f1f2⋮fn −(f1,f2,⋯,fn) w11w21⋯wn1w12w22⋯wn2⋯⋯⋯⋯w1nw2n⋯wnn f1f2⋮fn =f⊤Df−f⊤Wf=f⊤(D−W)f=f⊤Lf
综上可知, 对于离散样本点的函数 f = ( f 1 , f 2 , ⋯ , f n ) ⊤ \boldsymbol{f}=(f_1,f_2,\cdots,f_n)^\top f=(f1,f2,⋯,fn)⊤, w i j w_{ij} wij 为两个样本点之间的权重, 则数据集上的信息总增益为
f ⊤ L f = 1 2 Σ i = 1 n Σ j ∈ N i w i j ( f i − f j ) 2 \boldsymbol{f}^{\top}L\boldsymbol{f}=\frac{1}{2}\Sigma_{i=1}^n\Sigma_{j\in N_i}w_{ij}(f_i-f_j)^2 f⊤Lf=21Σi=1nΣj∈Niwij(fi−fj)2
2. 谱聚类: 寻找最优的函数向量 f \boldsymbol{f} f
2.1 : 寻找一个最优的函数向量 f \boldsymbol{f} f
为数据集的每个样本点寻找一个合适的函数
f
i
f_i
fi, 完成函数的求值
{
f
i
(
x
i
)
}
i
=
1
n
\Big\{f_i(\boldsymbol{x}_i)\Big\}_{i=1}^n
{fi(xi)}i=1n. 为使得样本点之间可以在同一尺度下进行大小的比较, 我们将函数值限定在0,1之间, 则可转化为约束 优化问题
min
f
f
⊤
L
f
s
.
t
f
⊤
f
=
1
\min_{\boldsymbol{f}}\boldsymbol{f}^{\top}L\boldsymbol{f}\\ s.t \quad \boldsymbol{f}^{\top}\boldsymbol{f}=1
fminf⊤Lfs.tf⊤f=1
利用拉格朗日乘子法将等式约束优化问题转化成无约束优化问题
min
f
Q
(
f
)
=
f
⊤
L
f
−
λ
f
⊤
f
\min_{\boldsymbol{f}}Q(\boldsymbol{f})=\boldsymbol{f}^{\top}L\boldsymbol{f}-\lambda\boldsymbol{f}^{\top}\boldsymbol{f}
fminQ(f)=f⊤Lf−λf⊤f
求极值
∂
Q
(
f
)
∂
f
=
2
L
f
−
2
λ
f
=
2
(
L
f
−
λ
f
)
=
0
\frac{\partial Q(\boldsymbol{f})}{\partial \boldsymbol{f}} =2L\boldsymbol{f}-2\lambda\boldsymbol{f}=2(L\boldsymbol{f}-\lambda\boldsymbol{f})=0
∂f∂Q(f)=2Lf−2λf=2(Lf−λf)=0
即
L
f
=
λ
f
(
转化为求拉普拉斯特征向量问题
)
L\boldsymbol{f}=\lambda\boldsymbol{f}(转化为求拉普拉斯特征向量问题)
Lf=λf(转化为求拉普拉斯特征向量问题)
这是拉普拉斯矩阵 L L L 的特征方程. 由于拉普拉斯矩阵的行和为 0 0 0, 上式有一个平凡解 λ = 0 \lambda=0 λ=0, 其对应的特征向量为 1 \boldsymbol{1} 1. 此解显然与数据集无关, 不是优化问题的最优解. 又因为拉普拉斯矩阵是半正定矩阵, 特征值非负, 因此, f \boldsymbol{f} f 是拉普拉斯矩阵第二小的特征值对应的特征向量.
将上式两边左乘
f
⊤
\boldsymbol{f}^{\top}
f⊤知
m
i
n
f
⊤
L
f
⇔
m
i
n
λ
f
⊤
f
=
m
i
n
λ
=
λ
m
i
n
min\boldsymbol{f}^{\top}L\boldsymbol{f}\Leftrightarrow min\lambda\boldsymbol{f}^{\top}\boldsymbol{f}=min\lambda=\lambda_{min}
minf⊤Lf⇔minλf⊤f=minλ=λmin
即
min
f
f
⊤
L
f
f
⊤
f
=
λ
m
i
n
\min_{\boldsymbol{f}}\frac{\boldsymbol{f}^{\top}L\boldsymbol{f}}{\boldsymbol{f}^{\top}\boldsymbol{f}}=\lambda_{min}
fminf⊤ff⊤Lf=λmin
称为瑞利熵
2.2 寻找鲁棒性更强的多个函数向量
为获得更加鲁棒的结果, 可以寻找多个向量函数, 然后进行信息的融合. 则数据集上的信息总增益表示为
min f 1 , ⋯ , f k Σ i = 1 k f i ⊤ L f i \min_{\boldsymbol{f_1,\cdots,f_k}}\Sigma_{i=1}^k\boldsymbol{f_i}^{\top}L\boldsymbol{f_i} f1,⋯,fkminΣi=1kfi⊤Lfi
Σ i = 1 k f i ⊤ L f i = min f 1 , ⋯ , f k ( f 1 ⊤ ⋮ f k ⊤ ) ( L 11 L 12 ⋯ L 1 n L 21 L 22 ⋯ L 2 n ⋯ ⋯ ⋯ ⋯ L n 1 L n 2 ⋯ L n n ) ( f 1 ⋮ f k ) = min F t r ( F ⊤ L F ) \Sigma_{i=1}^k\boldsymbol{f_i}^{\top}L\boldsymbol{f_i}=\min_{\boldsymbol{f_1,\cdots,f_k}} \begin{pmatrix} \boldsymbol{f}_1^{\top}\\ \vdots \\ \boldsymbol{f}_k^{\top} \end{pmatrix} \begin{pmatrix} L_{11} & L_{12} & \cdots & L_{1n} \\ L_{21} & L_{22}& \cdots & L_{2n}\\ \cdots &\cdots & \cdots & \cdots \\ L_{n1} & L_{n2} & \cdots &L_{nn} \end{pmatrix} \begin{pmatrix} \boldsymbol{f}_1\\ \vdots \\ \boldsymbol{f}_k \end{pmatrix}\\ =\min_{\boldsymbol{F}}tr(\boldsymbol{F}^{\top}L\boldsymbol{F}) Σi=1kfi⊤Lfi=f1,⋯,fkmin f1⊤⋮fk⊤ L11L21⋯Ln1L12L22⋯Ln2⋯⋯⋯⋯L1nL2n⋯Lnn f1⋮fk =Fmintr(F⊤LF)
结论:此问题等价于求多个较小的特征向量。
2.3 谱聚类(spectral clustering)算法
通过构造连接矩阵的方式获取拉普拉斯矩阵, 然后进行最优化求解
- 图拉普拉斯版本, 此版本效率有点儿低
class spectralClust_graph:
def __init__(self, nClust=2, gamma=13.5, tau=0.1):
self.nClust = nClust # 初始化类数
self.gamma = gamma # 径向基核函数参数
self.tau = tau # 近邻半径参数
# 计算距离矩阵
def pairwise_distances(self, X):
n = X.shape[1]
G = X.T@X
H = np.diag(G).reshape(-1,1)@np.ones((1,n))
dist = H+H.T-2*G
Dist = np.sqrt(dist)
print(Dist.shape)
return Dist
return Dist
# 计算权重矩阵
def create_graph_weights(self, X, gamma, tau):
# YOUR CODE HERE
distance_matrix = self.pairwise_distances(X.T)
n = distance_matrix.shape[0]
graph_weights = []
weights = np.exp(-gamma*distance_matrix)
weights = weights*(weights>=tau)
for i in range(n):
for j in range(n):
if i != j:
graph_weights.append(weights[i,j])
graph_weights = np.array(graph_weights)
return graph_weights
# 计算连接矩阵
def construct_incidence_matrix(self, X):
weights = self.create_graph_weights(X, self.gamma, self.tau)
no_of_samples = X.shape[0]
# 为数据集连接边分配索引
edges = []
for i in range(no_of_samples):
for j in range(no_of_samples):
if i == j:
continue
else:
edges.append([i,j])
no_of_edges = len(edges)
incidence_matrix = np.zeros(shape=(no_of_edges,no_of_samples))
for index in range(no_of_edges):
indices = edges[index]
incidence_matrix[index, indices[0]] = -np.sqrt(weights[index])
incidence_matrix[index, indices[1]] = np.sqrt(weights[index])
id0 = (incidence_matrix==0).all(1)
incidence_matrix = np.delete(incidence_matrix,id0,axis=0)
return incidence_matrix
def fit(self, X):
incidence_matrix = self.construct_incidence_matrix(X)
graph_laplacian = incidence_matrix.T@incidence_matrix
eigenvalues, eigenvectors = linalg.eigs(graph_laplacian,k=self.nClust,which='SM')
Labels = k_means(eigenvectors.real,self.nClust)
return Labels[1]
import numpy as np
from scipy.sparse import linalg
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
from sklearn.cluster import k_means
import matplotlib.pyplot as plt
if __name__ == "__main__":
# 构造数据集
seed = 13
np.random.seed(seed)
no_of_samples = 1000
X, y = make_moons(n_samples=no_of_samples, noise=0.1, random_state=seed)
scg = spectralClust_graph()
unsupervised_labels = scg.fit(X)
# 画图
colormap_bright = ListedColormap(['#FF0000', '#0000FF']) # 设置颜色
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.subplot(122)
plt.scatter(X[:, 0], X[:, 1], c=unsupervised_labels, cmap=colormap_bright, edgecolors='k')
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show()
- 通用的普聚类版本
也可以通过度矩阵构造拉普拉斯矩阵 L = D − W L=D-W L=D−W, 然后进行优化.
class spectralCLUST:
def __init__(self, nClust=2, gamma=13.5, tau=0.1, affinity=None):
self.nClust = nClust # 初始化类数
self.gamma = gamma # 径向基核函数参数
self.tau = tau # 近邻半径参数
self.affinity = affinity
# 计算距离矩阵, 矩阵的每一列为一个样本点
def pairwise_distances(self, X):
n = X.shape[1]
G = X.T@X
H = np.diag(G).reshape(-1,1)@np.ones((1,n))
dist = H+H.T-2*G
Dist = np.sqrt(dist)
print(Dist.shape)
return Dist
# 计算拉普拉斯矩阵,矩阵的每一列为一个样本点
def create_Weights(self, X, gamma, tau):
#
distance_matrix = self.pairwise_distances(X)
n = distance_matrix.shape[0]
weights = np.exp(-gamma*distance_matrix)
Weights = weights*(weights>=tau)
return Weights
def clustering(self,CKSym):
N = CKSym.shape[1]
n = self.nClust
DN = np.diag(np.divide(1, np.sqrt(np.sum(CKSym, axis=0) + np.finfo(float).eps)))
LapN = identity(N).toarray().astype(float) - np.matmul(np.matmul(DN, CKSym), DN)
_, _, vN = np.linalg.svd(LapN)
vN = vN.T
kerN = vN[:, N - n:N]
normN = np.sqrt(np.sum(np.square(kerN), axis=1))
kerNS = np.divide(kerN, normN.reshape(len(normN), 1) + np.finfo(float).eps)
km = KMeans(n_clusters=n).fit(kerNS)
return km.labels_
# 拟合函数,矩阵 X 的每一列为数据点
def fit(self, X):
if self.affinity == None:
gamma = self.gamma
tau = self.tau
W = self.create_Weights(X, gamma, tau)
Labels = self.clustering(W)
if self.affinity == 'precomputed':
Labels = self.clustering(X)
return Labels
import numpy as np
from scipy.sparse import linalg
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
from sklearn.cluster import KMeans
from scipy.sparse import identity
import matplotlib.pyplot as plt
if __name__ == "__main__":
# 构造数据集
seed = 13
np.random.seed(seed)
no_of_samples = 1000
X, y = make_moons(n_samples=no_of_samples, noise=0.1, random_state=seed)
clt = spectralCLUST()
unsupervised_labels = clt.fit(X.T)
# 画图
colormap_bright = ListedColormap(['#FF0000', '#0000FF']) # 设置颜色
plt.figure(figsize=(12,6))
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.subplot(122)
plt.scatter(X[:, 0], X[:, 1], c=unsupervised_labels, cmap=colormap_bright, edgecolors='k')
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.show()
小结
第二种编程的方法可以实现通用的功能, 既可以处理样本数据集, 同时也可以处理权重矩阵, 以备后续章节使用