1. 程式人生 > >原型聚類(一)k均值演算法和python實現

原型聚類(一)k均值演算法和python實現

原型聚類

原型聚類演算法假設聚類結構能通過一組原型刻畫,在現實聚類任務中極為常用。通常情形下,演算法先對原型進行初始化,然後對原型進行迭代更新求解。這裡的“原型”我認為實際上就是“原來的模型”,這類演算法企圖模擬出生成資料集的模型。

k均值演算法(k-means)

若存在一個樣本集D={x1,x2,...,xm}D=\begin{Bmatrix} x_{1},x_{2},...,x_{m} \end{Bmatrix},k均值演算法要求給出類別個數“k”,我們需要將這個樣本集D中的樣本劃分到類別集合C={C1,C2,...,Ck}C=\begin{Bmatrix} C_{1},C_{2},...,C_{k} \end{Bmatrix}

中,目的是最小化平方誤差: E=i=1kxCixμi22E=\sum_{i=1}^{k}\sum_{x\in C_{i}}\left \| x-\mu _{i} \right \|^{2}_{2} 其中μi=1CixCix\mu _{i}=\frac{1}{\left | C_{i} \right |}\sum_{x\in C_{i}}x 是簇CiC_{i}的均值向量,E越小,說明簇內樣本相似度越高,聚類效果越好。

最小化E是一個NP難問題(NP是指非確定性多項式(non-deterministic polynomial,縮寫NP)),因此k均值演算法採用貪心策略,通過不斷迭代來近似求解上式 在這裡插入圖片描述

(圖片來自《機器學習》周志華) 上面的虛擬碼中對於初始μ\mu的選取是隨機的,這導致了演算法的不穩定,本人實現的k-means演算法的初值是選擇k個相對距離最遠的樣本點作為初始μ\mu

演算法缺點:

  1. k值由使用者確定,不同的k值會獲得不同的結果
  2. 對初始簇中心的選擇敏感
  3. 不適合發現非凸形狀的簇或大小差別較大的簇
  4. 特殊值(離群值)對模型的影響較大

演算法優點:

  1. 容易理解,聚類效果不錯
  2. 處理大資料集時,演算法可以保證較好的伸縮性(在處理各種規模的資料時都有很好的效能。隨著資料的增大,效率不會下降很快)和高效率
  3. 當簇近似高斯分佈時,分類效果很好

python3.6實現:

# -*- coding: gbk -*-
import numpy as np from sklearn.datasets import make_moons from sklearn.datasets.samples_generator import make_blobs import matplotlib.pyplot as plt from collections import Counter import copy class KMeans(): def __init__(self, k=3, max_iter=300): self.k = k self.max_iter = max_iter def dist(self, x1, x2): return np.linalg.norm(x1 - x2) def get_label(self, x): min_dist_with_mu = 999999 label = -1 for i in range(self.mus_array.shape[0]): dist_with_mu = self.dist(self.mus_array[i], x) if min_dist_with_mu > dist_with_mu: min_dist_with_mu = dist_with_mu label = i return label def get_mu(self, X): index = np.random.choice(X.shape[0], 1, replace=False) mus = [] mus.append(X[index]) for _ in range(self.k - 1): max_dist_index = 0 max_distance = 0 for j in range(X.shape[0]): min_dist_with_mu = 999999 for mu in mus: dist_with_mu = self.dist(mu, X[j]) if min_dist_with_mu > dist_with_mu: min_dist_with_mu = dist_with_mu if max_distance < min_dist_with_mu: max_distance = min_dist_with_mu max_dist_index = j mus.append(X[max_dist_index]) mus_array = np.array([]) for i in range(self.k): if i == 0: mus_array = mus[i] else: mus[i] = mus[i].reshape(mus[0].shape) mus_array = np.append(mus_array, mus[i], axis=0) return mus_array def init_mus(self): for i in range(self.mus_array.shape[0]): self.mus_array[i] = np.array([0] * self.mus_array.shape[1]) def fit(self, X): self.mus_array = self.get_mu(X) iter = 0 while(iter < self.max_iter): old_mus_array = copy.deepcopy(self.mus_array) Y = [] # 將X歸類 for i in range(X.shape[0]): y = self.get_label(X[i]) Y.append(y) self.init_mus() # 將同類的X累加 for i in range(len(Y)): self.mus_array[Y[i]] += X[i] count = Counter(Y) # 計算新的mu for i in range(self.k): self.mus_array[i] = self.mus_array[i] / count[i] diff = 0 for i in range(self.mus_array.shape[0]): diff += np.linalg.norm(self.mus_array[i] - old_mus_array[i]) if diff == 0: break iter += 1 self.E = 0 for i in range(X.shape[0]): self.E += self.dist(X[i], self.mus_array[Y[i]]) print('E = {}'.format(self.E)) return np.array(Y) if __name__ == '__main__': fig = plt.figure(1) plt.subplot(221) center = [[1, 1], [-1, -1], [1, -1]] cluster_std = 0.35 X1, Y1 = make_blobs(n_samples=1000, centers=center, n_features=3, cluster_std=cluster_std, random_state=1) plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1) plt.subplot(222) km1 = KMeans(k=3) km_Y1 = km1.fit(X1) mus = km1.mus_array plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=km_Y1) plt.scatter(mus[:, 0], mus[:, 1], marker='^', c='r') plt.subplot(223) X2, Y2 = make_moons(n_samples=1000, noise=0.1) plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2) plt.subplot(224) km2 = KMeans(k=2) km_Y2 = km2.fit(X2) mus = km2.mus_array plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=km_Y2) plt.scatter(mus[:, 0], mus[:, 1], marker='^', c='r') plt.show()

執行分類圖片如下: 在這裡插入圖片描述

圖中左側為原始生成的資料,右邊為k-means聚類效果,紅色三角為最終的均值向量

參考