python機器學習：K-means聚類演算法

阿新 • • 發佈：2019-01-01

為了更好構建關於機器學習的整體架構，多快好省的學好機器學習，計劃提綱挈領的總結一遍，從演算法的執行流程、虛擬碼流程構建、python程式碼實現、呼叫sklearn機器學習庫相關函式實現功能等方面論述，以便以後自己複習和備查，下面先從k-means演算法開始。
一、K-means演算法流程
首先，隨機確定k個初始點作為質心，然後為資料集中的每一個點找距其最近的質心，將其分配給該質心對應的簇，最後，更新每個簇的質心，新的質心為所有點的平均值。
二、虛擬碼流程構建

建立k個點為起始質心
當任意一個點的簇分配結果發生改變時
	對資料集中的每一個數據點
		每一個質心
			計算質心與資料點之間的距離
		將資料點分配到距離其最近的簇
對每一個簇，計算簇中所有點的均值並將其均值作為質心

三、python程式碼實現

import numpy as np
from sklearn import datasets

def prepare_data(data):
    region = np.zeros((2, data.shape[1]))
    region[0, :] = np.min(data, axis=0)
    region[1, :] = np.max(data, axis=0)
    return region
    
def initial_centers(region, num_centers):
    center_raw = np.random.rand(num_centers, region.shape[1])
    interval = region[1, :] - region[0, :]
    inter_mat = np.repeat(np.expand_dims(interval, axis=0), num_centers, axis=0)
    min_mat = np.repeat(np.expand_dims(region[0, :], axis=0), num_centers, axis=0)
    centers = min_mat + center_raw * inter_mat
    return centers

def compute_distance(data, centers):
    # data: ndarray with shape [n_sample, n_feature]
    # centers: ndarray with shape [n_center, n_feature]
    dis_x2c = np.zeros((data.shape[0], centers.shape[0]))
    for i in range(data.shape[0]):
        for j in range(centers.shape[0]):
            dis_x2c[i, j] = distance(data[i, :], centers[j, :])
    return dis_x2c

def distance(v1, v2):
    return np.sqrt(np.sum((v1 - v2)**2))

def assign_nodes(dis_mat):
    # dis_mat: ndarray with shape [n_sample, n_center] which includes the distance between sample and center
    x_predict = np.zeros((dis_mat.shape[0],))
    for i in range(dis_mat.shape[0]):
        x_predict[i] = np.argmin(dis_mat[i, :])
    return x_predict

def re_compute_centers(data, x_predict):
    # data: ndarray with shape [n_sample, n_feature]
    # x_predict: ndarray with with shape (n_sample, )
    n_centers = int(np.max(x_predict) + 1)
    new_centers = np.zeros((n_centers, data.shape[1]))    
    for i in range(n_centers):
        temp = x_predict == i
        mat_temp = np.repeat(np.expand_dims(temp, axis=1), data.shape[1], axis=1)
        cluster_sample = data * mat_temp
        new_centers[i, :] = np.sum(cluster_sample, axis=0) /np.sum(temp)
    return new_centers

def compute_loss(dis_x2c):
    return np.mean(np.min(dis_x2c, axis=1))

def Kmeans(data, num_centers, max_iter=10, threshold=20):
    region = prepare_data(data=data)
    centers = initial_centers(region=region, num_centers=num_centers)
    iter, loss = 0, 10000
    while iter < max_iter and loss > threshold:
        dis_x2c = compute_distance(data=data, centers=centers)
        loss = compute_loss(dis_x2c=dis_x2c)
        x_predict = assign_nodes(dis_mat=dis_x2c)
        centers = re_compute_centers(data=data, x_predict=x_predict)
        iter += 1
    return centers, x_predict

if __name__ == '__main__':
    iris = datasets.load_iris()
    X = iris.data
    # Y = iris.target
    # region = prepare_data(X)
    # centers = initial_centers(region, num_centers=3)
    #
    # print(centers)
    centers, x_predict = Kmeans(data=X, num_centers=5, threshold=1)
    print('centers:', centers)

四、sklearn庫函式呼叫
python的sklearn機器學習庫集成了k-means演算法，在實際應用中直接呼叫即可。
1、主函式KMeans

sklearn.cluster.KMeans(
	n_clusters=8,
    init='k-means++', 
    n_init=10, 
    max_iter=300, 
    tol=0.0001, 
    precompute_distances='auto', 
    verbose=0, 
    random_state=None, 
    copy_x=True, 
    n_jobs=1, 
    algorithm='auto'
    )

引數解析：
n_clusters：聚成簇的個數
init: 初始簇中心的獲取方法
n_init: 獲取初始簇中心的更迭次數，為了彌補初始質心的影響，演算法預設會初始10次質心，實現演算法，然後返回最好的結果。
max_iter: 最大迭代次數（因為kmeans演算法的實現需要迭代）
tol: 容忍度，即kmeans執行準則收斂的條件
precompute_distances：是否需要提前計算距離，這個引數會在空間和時間之間做權衡，如果是True 會把整個距離矩陣都放到記憶體中，auto 會預設在資料樣本大於featurs*samples 的數量大於12e6 的時候False,False 時核心實現的方法是利用Cpython 來實現的
verbose: 冗長模式（不太懂是啥意思，反正一般不去改預設值）
random_state: 隨機生成簇中心的狀態條件。
copy_x: 對是否修改資料的一個標記，如果True，即複製了就不會修改資料。bool 在scikit-learn 很多介面中都會有這個引數的，就是是否對輸入資料繼續copy 操作，以便不修改使用者的輸入資料。這個要理解Python 的記憶體機制才會比較清楚。
n_jobs: 並行設定
algorithm: kmeans的實現演算法，有：’auto’, ‘full’, ‘elkan’, 其中 ‘full’表示用EM方式實現
雖然有很多引數，但是都已經給出了預設值。所以我們一般不需要去傳入這些引數,引數的。可以根據實際需要來呼叫。
舉例：

...
f = KMeans(n_clusters=4, max_iter=2000) #聚成四類，最大迭代次數為2000次
km.fit(...) #訓練資料集
d= km.predict()#預測
print('KMeans聚類：', ...)
...

python機器學習：K-means聚類演算法

python機器學習：K-means聚類演算法

吳恩達老師機器學習筆記K-means聚類演算法（二）

吳恩達老師機器學習筆記K-means聚類演算法（一）

【機器學習】K-means聚類演算法初探

機器學習之K-means聚類演算法

機器學習中K-means聚類演算法原理及C語言實現

吳恩達機器學習第七次作業Part1： K-means聚類演算法

機器學習（1）：K-MEANS聚類演算法

【無監督學習】1：K-means聚類演算法原理

Andrew Ng機器學習課程筆記（十二）之無監督學習之K-means聚類演算法

非監督學習之k-means聚類演算法——Andrew Ng機器學習筆記（九）

機器學習實戰———k均值聚類演算法

python機器學習案例系列教程——聚類演算法總結

scikit-learn學習之K-means聚類演算法與 Mini Batch K-Means演算法

scikit-learn學習之K-means聚類演算法與 Mini Batch K-Means演算法 [轉自別的作者，還有其他sklearn翻譯]

機器學習實戰---K均值聚類演算法

機器學習筆記六：K-Means聚類，層次聚類，譜聚類

機器學習公開課筆記(8)：k-means聚類和PCA降維

機器學習（十二）讓你輕鬆理解K-means 聚類演算法

【機器學習】接地氣地解釋K-means聚類演算法

python機器學習：K-means聚類演算法

相關推薦