原型聚類(一)k均值演算法和python實現
阿新 • • 發佈:2018-12-11
原型聚類
原型聚類演算法假設聚類結構能通過一組原型刻畫,在現實聚類任務中極為常用。通常情形下,演算法先對原型進行初始化,然後對原型進行迭代更新求解。這裡的“原型”我認為實際上就是“原來的模型”,這類演算法企圖模擬出生成資料集的模型。
k均值演算法(k-means)
若存在一個樣本集,k均值演算法要求給出類別個數“k”,我們需要將這個樣本集D中的樣本劃分到類別集合中,目的是最小化平方誤差: 其中 是簇的均值向量,E越小,說明簇內樣本相似度越高,聚類效果越好。
最小化E是一個NP難問題(NP是指非確定性多項式(non-deterministic polynomial,縮寫NP)),因此k均值演算法採用貪心策略,通過不斷迭代來近似求解上式
演算法缺點:
- k值由使用者確定,不同的k值會獲得不同的結果
- 對初始簇中心的選擇敏感
- 不適合發現非凸形狀的簇或大小差別較大的簇
- 特殊值(離群值)對模型的影響較大
演算法優點:
- 容易理解,聚類效果不錯
- 處理大資料集時,演算法可以保證較好的伸縮性(在處理各種規模的資料時都有很好的效能。隨著資料的增大,效率不會下降很快)和高效率
- 當簇近似高斯分佈時,分類效果很好
python3.6實現:
# -*- coding: gbk -*-
import numpy as np
from sklearn.datasets import make_moons
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from collections import Counter
import copy
class KMeans():
def __init__(self, k=3, max_iter=300):
self.k = k
self.max_iter = max_iter
def dist(self, x1, x2):
return np.linalg.norm(x1 - x2)
def get_label(self, x):
min_dist_with_mu = 999999
label = -1
for i in range(self.mus_array.shape[0]):
dist_with_mu = self.dist(self.mus_array[i], x)
if min_dist_with_mu > dist_with_mu:
min_dist_with_mu = dist_with_mu
label = i
return label
def get_mu(self, X):
index = np.random.choice(X.shape[0], 1, replace=False)
mus = []
mus.append(X[index])
for _ in range(self.k - 1):
max_dist_index = 0
max_distance = 0
for j in range(X.shape[0]):
min_dist_with_mu = 999999
for mu in mus:
dist_with_mu = self.dist(mu, X[j])
if min_dist_with_mu > dist_with_mu:
min_dist_with_mu = dist_with_mu
if max_distance < min_dist_with_mu:
max_distance = min_dist_with_mu
max_dist_index = j
mus.append(X[max_dist_index])
mus_array = np.array([])
for i in range(self.k):
if i == 0:
mus_array = mus[i]
else:
mus[i] = mus[i].reshape(mus[0].shape)
mus_array = np.append(mus_array, mus[i], axis=0)
return mus_array
def init_mus(self):
for i in range(self.mus_array.shape[0]):
self.mus_array[i] = np.array([0] * self.mus_array.shape[1])
def fit(self, X):
self.mus_array = self.get_mu(X)
iter = 0
while(iter < self.max_iter):
old_mus_array = copy.deepcopy(self.mus_array)
Y = []
# 將X歸類
for i in range(X.shape[0]):
y = self.get_label(X[i])
Y.append(y)
self.init_mus()
# 將同類的X累加
for i in range(len(Y)):
self.mus_array[Y[i]] += X[i]
count = Counter(Y)
# 計算新的mu
for i in range(self.k):
self.mus_array[i] = self.mus_array[i] / count[i]
diff = 0
for i in range(self.mus_array.shape[0]):
diff += np.linalg.norm(self.mus_array[i] - old_mus_array[i])
if diff == 0:
break
iter += 1
self.E = 0
for i in range(X.shape[0]):
self.E += self.dist(X[i], self.mus_array[Y[i]])
print('E = {}'.format(self.E))
return np.array(Y)
if __name__ == '__main__':
fig = plt.figure(1)
plt.subplot(221)
center = [[1, 1], [-1, -1], [1, -1]]
cluster_std = 0.35
X1, Y1 = make_blobs(n_samples=1000, centers=center,
n_features=3, cluster_std=cluster_std, random_state=1)
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=Y1)
plt.subplot(222)
km1 = KMeans(k=3)
km_Y1 = km1.fit(X1)
mus = km1.mus_array
plt.scatter(X1[:, 0], X1[:, 1], marker='o', c=km_Y1)
plt.scatter(mus[:, 0], mus[:, 1], marker='^', c='r')
plt.subplot(223)
X2, Y2 = make_moons(n_samples=1000, noise=0.1)
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=Y2)
plt.subplot(224)
km2 = KMeans(k=2)
km_Y2 = km2.fit(X2)
mus = km2.mus_array
plt.scatter(X2[:, 0], X2[:, 1], marker='o', c=km_Y2)
plt.scatter(mus[:, 0], mus[:, 1], marker='^', c='r')
plt.show()
執行分類圖片如下:
圖中左側為原始生成的資料,右邊為k-means聚類效果,紅色三角為最終的均值向量