第11章 K-means（文件聚類分析）

阿新 • • 發佈：2019-01-05

載入資料集

已標記好類別的四個文件資料集：（網路安全，電子學，醫學medical，太空）
在這裡插入圖片描述

import matplotlib.pyplot as plt
import numpy as np

from time import time
from sklearn.datasets import load_files

t = time()
docs = load_files(r'C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data')
print("summary: {0} documents in {1} categories." 
.format(len(docs.data), len(docs.target_names)))
print("done in {0} seconds".format(time() - t))

summary: 3949 documents in 4 categories.
done in 5.72968864440918 seconds

TF-IDF向量化

這裡需要注意 TfidfVectorizer 的幾個引數。

max_df＝0.4 表示如果一個單詞在40％的文件裡都出現過，則認為這是一個高頻詞，對文件聚類沒有幫助，在生成詞典時就會剔除這個詞。

min_df＝2 表示，如果一個單詞的詞頻太低，只在兩個以下（包含兩個）的文件裡出現，則也把這個單詞從詞典裡剔除。

max_features 可以進一步過濾詞典的大小，它會根據 TF-IDF 權重從高到低進行排序，然後取前面權重高的單詞構成詞典。

from sklearn.feature_extraction.text import TfidfVectorizer
max_features = 20000

t = time()
vectorizer = TfidfVectorizer(max_df=0.4, 
                             min_df=2, 
                             max_features=max_features, 
                             encoding= 
'latin-1')
X = vectorizer.fit_transform((d for d in docs.data))
print("n_samples: %d, n_features: %d" % X.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    docs.filenames[0], X[0].getnnz()))
print("done in {0} seconds".format(time() - t))

n_samples: 3949, n_features: 20000

number of non-zero features in sample [C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.electronics\11902-54322]: 56

done in 2.2819015979766846 seconds

從輸出可知，我們的一篇文章構成的向量是一個稀疏向量，其大部分元素都為0這也容易理解，我們的詞典大小為20000個，而示例文章中不重複的單詞卻只有56個。

K-means 聚類

sklearn.cluster.KMeans

Parameters：

n_clusters：我們選擇的聚類個數為4個。

max_iter＝100表示最多進行100次k均值選代。

tol＝0.1表示中心點移動距離小於0.1時就認為演算法已經收斂，停止迭代。

verbose = 1表示輸出送代的過程資訊。

n_init＝3表示進行3次k均值運算後求平均值，前面介紹過，在演算法剛開始送代時，會隨機選擇聚類中心點，不同的中心點可能導致不同的收斂效果，因此多次運算求平均值的方法可以提供演算法的穩定性。

Attributes：

inertia_ : float，Sum of squared distances of samples to their closest cluster center.

cluster_centers_ : array, [n_clusters, n_features]，Coordinates of cluster centers.

labels_ : Labels of each point

from sklearn.cluster import KMeans

t = time()
n_clusters = 4
kmean = KMeans(n_clusters=n_clusters, 
               max_iter=100,
               tol=0.01,
               verbose=1,
               n_init=3)
kmean.fit(X);
print("kmean: k={}, cost={}".format(n_clusters, int(kmean.inertia_)))
print("done in {0} seconds".format(time() - t))

Initialization complete
Iteration 0, inertia 7446.126
Iteration 1, inertia 3842.619
…
Iteration 11, inertia 3822.036
Iteration 12, inertia 3822.010
Converged at iteration 12: center shift 0.000000e+00 within tolerance 4.896692e-07
Initialization complete
Iteration 0, inertia 7589.733
Iteration 1, inertia 3842.690
…
Iteration 49, inertia 3814.997
Iteration 50, inertia 3814.995
Converged at iteration 50: center shift 0.000000e+00 within tolerance 4.896692e-07
Initialization complete
Iteration 0, inertia 7565.903
Iteration 1, inertia 3852.316
…
Iteration 26, inertia 3818.030
Iteration 27, inertia 3818.028
Converged at iteration 27: center shift 0.000000e+00 within tolerance 4.896692e-07
kmean: k=4, cost=3814
done in 50.47910118103027 seconds

從輸出資訊中可以看到，總共進行了3次k均值聚類分析，分別做了12，50，27次迭代後收斂。這樣就把3949個文件進行自動分類了。

kmean.labels_ 裡儲存的就是這些文件的類別資訊。

len(kmean.labels_)

3949

如我們所預料，len(kmean.labels_) 的值是3949，還可以通過 kmean.labels_[1000:1010] 檢視1000~1010這10個文件的分類情況。

kmean.labels_[1000:1010]

array([0, 0, 0, 1, 2, 1, 2, 0, 1, 1])

與真實所在類別進行對比：發現第9個分錯了。實際在sci.space被分在了sci.electronics。

docs.filenames[1000:1010]

array([‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.crypt\10888-15289’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.crypt\11490-15880’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.crypt\11270-15346’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.electronics\12383-53525’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.space\13826-60862’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.electronics\11631-54106’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.space\14235-61437’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.crypt\11508-15928’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.space\13593-60824’,
‘C:\Users\Qiuyi\Desktop\scikit-learn code\code\datasets\clustering\data\sci.electronics\12304-52801’],
dtype=’<U98’)

單詞權重分析

argsort()：
把一個 numpy 陣列進行升序排列，返回的是排序後的索引。［::-1］運算是把升序變為降序。

由於 kmean.cluster_centers_ 是二維陣列，因此 kmean.cluster_centers_.argsort()[:, ::-1]語句的含義就是把聚類中心點的不同分量，按照從大到小的順序進行排序，並且把排序後的元素索引儲存在二維陣列 order_centroids 裡。

vectorizer.get_feature_names() 將得到我們的詞典單詞，根據索引即可得到每個類別裡權重最高的那些單詞了。

from __future__ import print_function

order_centroids = kmean.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Cluster 0: key clipper chip encryption government will keys escrow we nsa
Cluster 1: my any me by your know some do has so
Cluster 2: space henry nasa toronto pat shuttle zoo moon we orbit
Cluster 3: geb pitt banks gordon shameful dsl n3jxp chastity cadre surrender

Cluster 1的效果不好，因為這幾個單詞太沒特點了，可以是任何類別。
Cluster 0的效果比較高，一看就知道是關於網路安全的，對應的是sci.crypt。
Cluster 2是sci.space，Cluster 3是sci.med。

聚類演算法效能評估

分類問題，我們可以直接計算被錯誤分類的樣本數量，這樣可以直接算出分類演算法的準確率。

聚類問題，由於沒有標記，所以不能使用絕對數量的方法進行效能評估。

更典型地，針對k-均值演算法，我們可以選擇k的數值不等於己標記的類別個數。

“熵”，是資訊理論中最重要的基礎概念。熵表示一個系統的有序程度，而聚類問題的效能評估，就是對比經過聚類演算法處理後的資料的有序程度，與人工標記的類別的有序程度之間的差異。

1. Adjust Rand Index（調整蘭德指數）

Adjust Rand Index是一種衡量兩個序列相似性的演算法。

優點：
對任意數量的聚類中心和樣本數，隨機聚類的ARI都非常接近於0；
取值在［-1，1］之間，負數代表結果不好，越接近於1越好；
可用於聚類演算法之間的比較。

缺點：
ARI需要真實標籤

2. Homogeneity，Completeness，V-measure

同質性homogeneity：每個群集只包含單個類的成員；
完整性completeness：給定類的所有成員都分配給同一個群集；
V-measure是同質性homogeneity和完整性completeness的調和平均數。

優點：
分數明確：從0到1反應出最差到最優的表現；
解釋直觀：差的調和平均數可以在同質性和完整性方面做定性的分析；
對簇結構不作假設：可以比較兩種聚類演算法如k均值演算法和譜聚類演算法的結果。

缺點：
以前引入的度量在隨機標記方面沒有規範化，這意味著，根據樣本數，叢集和先驗知識，完全隨機標籤並不總是產生相同的完整性和均勻性的值，所得調和平均值V-measure也不相同。特別是，隨機標記不會產生零分，特別是當簇的數量很大時。
當樣本數大於一千，聚類數小於10時，可以安全地忽略該問題。對於較小的樣本量或更大數量的叢集，使用經過調整的指數（如調整蘭德指數）更為安全。

3. Silhouette Coefficient

輪廓係數適用於實際類別資訊未知的情況。對於單個樣本，設a是與它同類別中其他樣本的平均距離，b是與它距離最近不同類別中樣本的平均距離，其輪廓係數為：
$s=\frac{b-a}{max(a,b)}$ 對於一個樣本集合，它的輪廓係數是所有樣本輪廓係數的平均值。輪廓係數的取值範圍是[-1,1]，同類別樣本距離越相近不同類別樣本距離越遠，分數越高。

from sklearn import metrics

labels = docs.target
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, kmean.labels_))
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, kmean.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, kmean.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, kmean.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, kmean.labels_, sample_size=1000))

Adjusted Rand-Index: 0.237
Homogeneity: 0.375
Completeness: 0.554
V-measure: 0.447
Silhouette Coefficient: 0.004

參考：

argsort()

a = np.array([[20, 10, 30, 40], [100, 300, 200, 400], [1, 5, 3, 2]])
a.argsort()[:, ::-1]

#執行結果：
array([[3, 2, 0, 1],
       [3, 1, 2, 0],
       [1, 2, 3, 0]], dtype=int64)

聚類效能評估（Clustering Evaluation and Assessment）——沙沙的兔子

聚類模型評估——howhigh

第11章 K-means（文件聚類分析）

載入資料集

TF-IDF向量化

K-means 聚類

單詞權重分析

聚類演算法效能評估

1. Adjust Rand Index（調整蘭德指數）

2. Homogeneity，Completeness，V-measure

3. Silhouette Coefficient

參考：

第11章 K-means（文件聚類分析）

【機器學習實戰】第10章 K-Means（K-均值）聚類演算法

從零開始搭建django前後端分離專案系列六（實戰之聚類分析）

二進制安裝kubernetes v1.11.2 （第三章二進制文件下載和kubectl部署）

第11章—常用註解（持續更新中）

第11章 GPIO輸出—使用固件庫點亮LED

【第四章】MySQL日誌文件管理

C++進階（語法篇）—第11章設計模式（3）

C++進階（語法篇）—第11章設計模式（2）

【吳恩達】機器學習第14章k-Means以及ex7-k-means程式設計練習

第六章讀取純文字文件

第4階段——制作根文件系統之分析init_post()如何啟動第1個程序(2)

Linux九陰真經之催心掌殘卷9（文件壓縮與歸檔）

Linux FACL（文件訪問控制列表）

Web開發——HTML基礎（文件和網站結構）

《機器學習實戰》第14章學習筆記（資料約簡工具---SVD）

軟工實踐第七次作業（軟件產品案例分析）

《Java8實戰》-第六章讀書筆記（用流收集資料-01）

information_schema系列三（文件，變量）

敏捷開發FAQ（文件還是要有的）

第11章 K-means（文件聚類分析）

載入資料集

TF-IDF向量化

K-means 聚類

單詞權重分析

聚類演算法效能評估

1. Adjust Rand Index（調整蘭德指數）

2. Homogeneity，Completeness，V-measure

3. Silhouette Coefficient

參考：

相關推薦