K-Means演算法之K值的選擇

K-means 演算法 · 發表 2018-09-17 23:52:39

摘要： K-Means 是一個超級簡單的聚類方法，說他簡單，主要原因是使用它時只需設定一個K值（設定需要將資料聚成幾類）。但問題是，有時候我們拿到的資料根本不知道要分為幾類，對於二維的資料，我們還能通過肉眼觀察法進行確定，超過二維的資料怎麼辦？今天就一起來學習下。拍腦袋法一個...

K-Means 是一個超級簡單的聚類方法，說他簡單，主要原因是使用它時只需設定一個K值（設定需要將資料聚成幾類）。但問題是，有時候我們拿到的資料根本不知道要分為幾類，對於二維的資料，我們還能通過肉眼觀察法進行確定，超過二維的資料怎麼辦？今天就一起來學習下。

拍腦袋法

一個非常快速的，拍腦袋的方法是將樣本量除以2再平方出來的值作為K值，具體公式為：

肘部法則（Elbow Method）

Elbow Method ：Elbow意思是手肘，如下圖左所示，此種方法適用於 K 值相對較小的情況，當選擇的k值小於真正的時，k每增加1，cost值就會大幅的減小；當選擇的k值大於真正的K時， k每增加1，cost值的變化就不會那麼明顯。這樣，正確的k值就會在這個轉折點，類似elbow的地方。如下圖：

通過畫K與cost function的關係曲線圖，如左圖所示，肘部的值(cost function開始時下降很快，在肘部開始平緩了)做為K值，K=3。並不是所有的問題都可以通過畫肘部圖來解決，有的問題如右邊的那個圖，肘點位置不明顯（肘點可以是3，4，5），這時就無法確定K值了。故肘部圖是可以嘗試的一種方法，但是並不是對所有的問題都能畫出如左邊那麼好的圖來確定K值。

Elbow Method公式：

Python實現：

# clustering dataset
# determine k using elbow method

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6, 7, 8, 9, 8, 9, 9, 8])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, 1, 2, 1, 2, 3, 2, 3])

plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()

# create new plot and data
plt.plot()
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
colors = ['b', 'g', 'r']
markers = ['o', 'v', 's']

# k means determine k
distortions = []
K = range(1, 10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])

# Plot the elbow
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

間隔統計量 Gap Statistic

根據肘部法則選擇最合適的K值有事並不是那麼清晰，因此斯坦福大學的Robert等教授提出了Gap Statistic方法。

Gap Statistic的基本思路是：引入參考的測值，這個參考值可以有Monte Carlo取樣的方法獲得。

B是sampling的次數。為了修正MC帶來的誤差，我們計算sk也即標準差來矯正Gap Statistic。

選擇滿足的最小的k作為最優的聚類個數。下圖闡釋了Gap Statistic的過程。

Python實現：

import scipy
fromscipy.spatial.distance import euclidean
from sklearn.cluster import KMeans as k_means

dst = euclidean

k_means_args_dict = {
'n_clusters': 0,
# drastically saves convergence time
'init': 'k-means++',
'max_iter': 100,
'n_init': 1,
'verbose': False,
# 'n_jobs':8
}


def gap(data, refs=None, nrefs=20, ks=range(1, 11)):
"""
I: NumPy array, reference matrix, number of reference boxes, number of clusters to test
O: Gaps NumPy array, Ks input list

Give the list of k-values for which you want to compute the statistic in ks. By Gap Statistic
from Tibshirani, Walther.
"""
shape = data.shape

if not refs:
tops = data.max(axis=0)
bottoms = data.min(axis=0)
dists = scipy.matrix(scipy.diag(tops - bottoms))
rands = scipy.random.random_sample(size=(shape[0], shape[1], nrefs))
for i in range(nrefs):
rands[:, :, i] = rands[:, :, i] * dists + bottoms
else:
rands = refs

gaps = scipy.zeros((len(ks),))

for (i, k) in enumerate(ks):
k_means_args_dict['n_clusters'] = k
kmeans = k_means(**k_means_args_dict)
kmeans.fit(data)
(cluster_centers, point_labels) = kmeans.cluster_centers_, kmeans.labels_

disp = sum(
[dst(data[current_row_index, :], cluster_centers[point_labels[current_row_index], :]) for current_row_index
in range(shape[0])])

refdisps = scipy.zeros((rands.shape[2],))

for j in range(rands.shape[2]):
kmeans = k_means(**k_means_args_dict)
kmeans.fit(rands[:, :, j])
(cluster_centers, point_labels) = kmeans.cluster_centers_, kmeans.labels_
refdisps[j] = sum(
[dst(rands[current_row_index, :, j], cluster_centers[point_labels[current_row_index], :]) for
current_row_index in range(shape[0])])

# let k be the index of the array 'gaps'
gaps[i] = scipy.mean(scipy.log(refdisps)) - scipy.log(disp)

return ks, gaps

輪廓係數（Silhouette Coefficient）

Silhouette method 會衡量物件和所屬簇之間的相似度——即內聚性（cohesion）。當把它與其他簇做比較，就稱為分離性（separation）。該對比通過 silhouette 值來實現，後者在 [-1, 1] 範圍內。Silhouette 值接近 1，說明物件與所屬簇之間有密切聯絡；反之則接近 -1。若某模型中的一個數據簇，生成的基本是比較高的 silhouette 值，說明該模型是合適、可接受的。

方法：

1）計算樣本i到同簇其他樣本的平均距離a(i)。a(i)越小，說明樣本i越應該被聚類到該簇。將a(i)稱為樣本i的簇內不相似度。簇C中所有樣本的a(i)均值稱為簇C的簇不相似度。

2）計算樣本i到其他某簇C(j)的所有樣本的平均距離b(ij)，稱為樣本i與簇C(j)的不相似度。定義為樣本i的簇間不相似度：b(i) =min{bi1, bi2, …, bik}，b(i)越大，說明樣本i越不屬於其他簇。

3）根據樣本i的簇內不相似度a i 和簇間不相似度b i ，定義樣本i的輪廓係數：

4）判斷：

s(i)接近1，則說明樣本i聚類合理
s(i)接近-1，則說明樣本i更應該分類到另外的簇
若s(i) 近似為0，則說明樣本i在兩個簇的邊界上

所有樣本的s(i )的均值稱為聚類結果的輪廓係數，是該聚類是否合理、有效的度量。但是，其缺陷是計算複雜度為O(n^2)，需要計算距離矩陣，那麼當資料量上到百萬，甚至千萬級別時，計算開銷會非常巨大。

Python實現：

import numpy as np
from sklearn.cluster import KMeans
from sklearn import metrics
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 10))
plt.subplot(3, 2, 1)
x1 = np.array([1, 2, 3, 1, 5, 6, 5, 5, 6, 7, 8, 9, 7, 9])
x2 = np.array([1, 3, 2, 2, 8, 6, 7, 6, 7, 1, 2, 1, 1, 3])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Sample')
plt.scatter(x1, x2)
colors = ['b', 'g', 'r', 'c', 'm', 'y', 'k', 'b']
markers = ['o', 's', 'D', 'v', '^', 'p', '*', '+']
tests = [2, 3, 4, 5, 8]
subplot_counter = 1
for t in tests:
subplot_counter += 1
plt.subplot(3, 2, subplot_counter)
kmeans_model = KMeans(n_clusters=t).fit(X)
for i, l in enumerate(kmeans_model.labels_):
plt.plot(x1[i], x2[i], color=colors[l], marker=markers[l],ls='None')
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('K = %s, Silhouette method = %.03f' % (t, metrics.silhouette_score(X, kmeans_model.labels_,metric='euclidean')))
plt.show()

Canopy演算法

肘部法則（Elbow Method）和輪廓係數（Silhouette Coefficient）來對k值進行最終的確定，但是這些方法都是屬於“事後”判斷的，而Canopy演算法的作用就在於它是通過事先粗聚類的方式，為k-means演算法確定初始聚類中心個數和聚類中心點。

與傳統的聚類演算法(比如K-Means)不同，Canopy聚類最大的特點是不需要事先指定k值(即clustering的個數)，因此具有很大的實際應用價值。與其他聚類演算法相比，Canopy聚類雖然精度較低，但其在速度上有很大優勢，因此可以使用Canopy聚類先對資料進行“粗”聚類，得到k值，以及大致的k箇中心點，再使用K-Means進行進一步“細”聚類。所以Canopy+K-Means這種形式聚類演算法聚類效果良好。

Canopy演算法解析：

原始資料集合List按照一定的規則進行排序（這個規則是任意的，但是一旦確定就不再更改），初始距離閾值為T1、T2，且T1>T2（T1、T2的設定可以根據使用者的需要，或者使用交叉驗證獲得）。
在List中隨機挑選一個數據向量A，使用一個粗糙距離計算方式計算A與List中其他樣本資料向量之間的距離d。
根據第2步中的距離d，把d小於T1的樣本資料向量劃到一個canopy中，同時把d小於T2的樣本資料向量從候選中心向量名單（這裡可以理解為就是List）中移除。
重複第2、3步，直到候選中心向量名單為空，即List為空，演算法結束。

演算法原理比較簡單，就是對資料進行不斷遍歷，T2<dis<T1的可以作為中心名單，dis<T2的認為與canopy太近了，以後不會作為中心點，從list中刪除，這樣的話一個點可能屬於多個canopy。

canopy效果圖如下：

Canopy演算法優勢：

Kmeans對噪聲抗干擾較弱，通過Canopy對比較小的NumPoint的Cluster直接去掉有利於抗干擾。
Canopy選擇出來的每個Canopy的centerPoint作為Kmeans比較科學。
只是針對每個Canopy的內容做Kmeans聚類，減少相似計算的數量。

Canopy演算法缺點：

演算法中 T1、T2（T2 < T1）的確定問題

Python實現：

# -*- coding: utf-8 -*-
import math
import random
import numpy as np
import matplotlib.pyplot as plt

class Canopy:
def __init__(self, dataset):
self.dataset = dataset
self.t1 = 0
self.t2 = 0

# 設定初始閾值
def setThreshold(self, t1, t2):
if t1 > t2:
self.t1 = t1
self.t2 = t2
else:
print('t1 needs to be larger than t2!')

# 使用歐式距離進行距離的計算
def euclideanDistance(self, vec1, vec2):
return math.sqrt(((vec1 - vec2)**2).sum())

# 根據當前dataset的長度隨機選擇一個下標
def getRandIndex(self):
return random.randint(0, len(self.dataset) - 1)

def clustering(self):
if self.t1 == 0:
print('Please set the threshold.')
else:
canopies = []# 用於存放最終歸類結果
while len(self.dataset) != 0:
rand_index = self.getRandIndex()
current_center = self.dataset[rand_index]# 隨機獲取一箇中心點，定為P點
current_center_list = []# 初始化P點的canopy類容器
delete_list = []# 初始化P點的刪除容器
self.dataset = np.delete(
self.dataset, rand_index, 0)# 刪除隨機選擇的中心點P
for datum_j in range(len(self.dataset)):
datum = self.dataset[datum_j]
distance = self.euclideanDistance(
current_center, datum)# 計算選取的中心點P到每個點之間的距離
if distance < self.t1:
# 若距離小於t1，則將點歸入P點的canopy類
current_center_list.append(datum)
if distance < self.t2:
delete_list.append(datum_j)# 若小於t2則歸入刪除容器
# 根據刪除容器的下標，將元素從資料集中刪除
self.dataset = np.delete(self.dataset, delete_list, 0)
canopies.append((current_center, current_center_list))
return canopies


def showCanopy(canopies, dataset, t1, t2):
fig = plt.figure()
sc = fig.add_subplot(111)
colors = ['brown', 'green', 'blue', 'y', 'r', 'tan', 'dodgerblue', 'deeppink', 'orangered', 'peru', 'blue', 'y', 'r',
'gold', 'dimgray', 'darkorange', 'peru', 'blue', 'y', 'r', 'cyan', 'tan', 'orchid', 'peru', 'blue', 'y', 'r', 'sienna']
markers = ['*', 'h', 'H', '+', 'o', '1', '2', '3', ',', 'v', 'H', '+', '1', '2', '^',
'<', '>', '.', '4', 'H', '+', '1', '2', 's', 'p', 'x', 'D', 'd', '|', '_']
for i in range(len(canopies)):
canopy = canopies[i]
center = canopy[0]
components = canopy[1]
sc.plot(center[0], center[1], marker=markers[i],
color=colors[i], markersize=10)
t1_circle = plt.Circle(
xy=(center[0], center[1]), radius=t1, color='dodgerblue', fill=False)
t2_circle = plt.Circle(
xy=(center[0], center[1]), radius=t2, color='skyblue', alpha=0.2)
sc.add_artist(t1_circle)
sc.add_artist(t2_circle)
for component in components:
sc.plot(component[0], component[1],
marker=markers[i], color=colors[i], markersize=1.5)
maxvalue = np.amax(dataset)
minvalue = np.amin(dataset)
plt.xlim(minvalue - t1, maxvalue + t1)
plt.ylim(minvalue - t1, maxvalue + t1)
plt.show()


if __name__ == "__main__":
dataset = np.random.rand(500, 2)# 隨機生成500個二維[0,1)平面點
t1 = 0.6
t2 = 0.4
gc = Canopy(dataset)
gc.setThreshold(t1, t2)
canopies = gc.clustering()
print('Get %s initial centers.' % len(canopies))
showCanopy(canopies, dataset, t1, t2)

參考資料：

K-Means演算法之K值的選擇

拍腦袋法

肘部法則（Elbow Method）

間隔統計量 Gap Statistic

輪廓係數（Silhouette Coefficient）

Canopy演算法

您可能也會喜歡…