【轉】使用scipy進行層次聚類和k-means聚類

阿新 • • 發佈：2018-04-18

歐氏距離 generate https then con method 感覺 long average

scipy cluster庫簡介

scipy.cluster是scipy下的一個做聚類的package, 共包含了兩類聚類方法:
1. 矢量量化(scipy.cluster.vq):支持vector quantization 和 k-means 聚類方法
2. 層次聚類(scipy.cluster.hierarchy):支持hierarchical clustering 和 agglomerative clustering(凝聚聚類)

聚類方法實現:k-means和hierarchical clustering.

###cluster.py
#導入相應的包
import scipy
import 
 scipy.cluster.hierarchy as sch
from scipy.cluster.vq import vq,kmeans,whiten
import numpy as np
import matplotlib.pylab as plt


#生成待聚類的數據點,這裏生成了20個點,每個點4維:
points=scipy.randn(20,4)  

#1. 層次聚類
#生成點與點之間的距離矩陣,這裏用的歐氏距離:
disMat = sch.distance.pdist(points,‘euclidean‘) 
#進行層次聚類:
Z=sch.linkage(disMat,method=‘ 
average‘) 
#將層級聚類結果以樹狀圖表示出來並保存為plot_dendrogram.png
P=sch.dendrogram(Z)
plt.savefig(‘plot_dendrogram.png‘)
#根據linkage matrix Z得到聚類結果:
cluster= sch.fcluster(Z, t=1, ‘inconsistent‘) 

print "Original cluster by hierarchy clustering:\n",cluster

#2. k-means聚類
#將原始數據做歸一化處理
data=whiten(points)

#使用kmeans函數進行聚類,輸入第一維為數據,第二維為聚類個數k. 

#有些時候我們可能不知道最終究竟聚成多少類,一個辦法是用層次聚類的結果進行初始化.當然也可以直接輸入某個數值. 
#k-means最後輸出的結果其實是兩維的,第一維是聚類中心,第二維是損失distortion,我們在這裏只取第一維,所以最後有個[0]
centroid=kmeans(data,max(cluster))[0]  

#使用vq函數根據聚類中心對所有數據進行分類,vq的輸出也是兩維的,[0]表示的是所有數據的label
label=vq(data,centroid)[0] 

print "Final clustering by k-means:\n",label

在Terminal中輸入:python cluster.py
輸出:
Original cluster by hierarchy clustering:
[4 3 3 1 3 3 2 3 2 3 2 3 3 2 3 1 3 3 2 2]
Final clustering by k-means:
[1 2 1 3 1 2 0 2 0 0 0 2 1 0 1 3 2 2 0 0]
數值是隨機標的,不用看,只需要關註同類的是哪些.可以看出層次聚類的結果和k-means還是有區別的.

補充:一些函數的用法

1.linkage(y, method=’single’, metric=’euclidean’)
共包含3個參數:
y是距離矩陣,由pdist得到;method是指計算類間距離的方法,比較常用的有3種:
(1)single:最近鄰,把類與類間距離最近的作為類間距
(2)complete:最遠鄰,把類與類間距離最遠的作為類間距
(3)average:平均距離,類與類間所有pairs距離的平均

其他的method還有如weighted,centroid等等,具體可以參考: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html#scipy.cluster.hierarchy.linkage

2.fcluster(Z, t, criterion=’inconsistent’, depth=2, R=None, monocrit=None)
第一個參數Z是linkage得到的矩陣,記錄了層次聚類的層次信息; t是一個聚類的閾值-“The threshold to apply when forming flat clusters”,在實際中,感覺這個閾值的選取還是蠻重要的.另外,scipy提供了多種實施閾值的方法(criterion):

inconsistent : If a cluster node and all its descendants have an inconsistent value less than or equal to t then all its leaf descendants belong to the same flat cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. (Default)

distance : Forms flat clusters so that the original observations in each flat cluster have no greater a cophenetic distance than t.

……

其他的參數我用的是默認的,具體可以參考:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.fcluster.html#scipy.cluster.hierarchy.fcluster

3.kmeans(obs, k_or_guess, iter=20, thresh=1e-05, check_finite=True)
輸入obs是數據矩陣,行代表數據數目,列代表特征維度; k_or_guess表示聚類數目;iter表示循環次數,最終返回損失最小的那一次的聚類中心;
輸出有兩個,第一個是聚類中心(codebook),第二個是損失distortion,即聚類後各數據點到其聚類中心的距離的加和.

參考頁面:http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html#scipy.cluster.vq.kmeans

4.vq(obs, code_book, check_finite=True)
根據聚類中心將所有數據進行分類.obs為數據,code_book則是kmeans產生的聚類中心.
輸出同樣有兩個:第一個是各個數據屬於哪一類的label,第二個和kmeans的第二個輸出是一樣的,都是distortion

參考頁面:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.vq.html#scipy.cluster.vq.vq
https://blog.csdn.net/elaine_bao/article/details/50242867

【轉】使用scipy進行層次聚類和k-means聚類

歐氏距離 generate https then con method 感覺 long average scipy cluster庫簡介 scipy.cluster是scipy下的一個做聚類的package, 共包含了兩類聚類方法: 1. 矢量量化(scipy.cluste

【轉】使用scipy進行層次聚類和k-means聚類

scipy cluster庫簡介

聚類方法實現:k-means和hierarchical clustering.

補充:一些函數的用法

【轉】使用scipy進行層次聚類和k-means聚類

使用scipy進行層次聚類和k-means聚類

【原創】資料探勘案例——ReliefF和K-means演算法的醫學應用

【機器學習】聚類演算法：層次聚類、K-means聚類

【Python例項第20講】手寫數字識別問題的K-Means聚類

【轉】VC++中的影象型別轉換--使用開源CxImage類庫

【轉】數據結構中棧和堆---內存分配中棧和堆

【轉】Linux系統編程---dup和dup2詳解

【轉】基於localStorage的資源離線和更新技術

AppDomain 詳解二【轉】-C#中動態加載和卸載DLL

【轉】Java學習---Java的鎖和Mysql的鎖機制

【轉】Egret第三方庫的用法和製作

【轉】Java7/8 中的 HashMap 和 ConcurrentHashMap 全解析

【轉】關於java 單元測試Junit4和Mock的一些總結

【轉】Shell指令碼IF條件判斷和判斷條件總結

【轉】USB命令（請求）和USB描述符

【轉】ehcache實現頁面整體快取和頁面區域性快取

【轉】spring中對控制反轉和依賴注入的理解

【轉】正則匹配函式——regcomp和regexec

【轉】詳解 JavaScript的 call() 和 apply()

【轉】使用scipy進行層次聚類和k-means聚類

scipy cluster庫簡介

聚類方法實現:k-means和hierarchical clustering.

補充:一些函數的用法

相關推薦