sklearn實戰：對文件進行聚類分析（KMeans演算法）

阿新 • • 發佈：2019-01-06

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

from time import time
from sklearn.datasets import load_files

print("loading documents ...")
t = time()
docs = load_files('datasets/clustering/data')
print("summary: {0} documents in {1} categories.".format(
    len(docs.data), len(docs.target_names)))
print("done in {0} seconds" 
.format(time() - t))

loading documents ...
summary: 7898 documents in 4 categories.
done in 1.8740148544311523 seconds

from sklearn.feature_extraction.text import TfidfVectorizer

max_features = 20000
print("vectorizing documents ...")
t = time()
vectorizer = TfidfVectorizer(max_df=0.4, 
                             min_df=2 
, 
                             max_features=max_features, 
                             encoding='latin-1')
X = vectorizer.fit_transform((d for d in docs.data))
print("n_samples: %d, n_features: %d" % X.shape)
print("number of non-zero features in sample [{0}]: {1}".format(
    docs.filenames[0], X[0].getnnz()))
print("done in {0} seconds" 
.format(time() - t))

vectorizing documents ...
n_samples: 7898, n_features: 20000
number of non-zero features in sample [datasets/clustering/data\sci.electronics\._12249-54259]: 0
done in 1.135350227355957 seconds

from sklearn.cluster import KMeans

print("clustering documents ...")
t = time()
n_clusters = 4
kmean = KMeans(n_clusters=n_clusters, 
               max_iter=100,
               tol=0.001,
               verbose=1,
               n_init=3)
kmean.fit(X);
print("kmean: k={}, cost={}".format(n_clusters, int(kmean.inertia_)))
print("done in {0} seconds".format(time() - t))

clustering documents ...
Initialization complete
Iteration  0, inertia 3944.720
Iteration  1, inertia 3846.168
Converged at iteration 1: center shift 0.000000e+00 within tolerance 2.438758e-08
Initialization complete
Iteration  0, inertia 3943.466
Iteration  1, inertia 3845.153
Iteration  2, inertia 3842.399
Iteration  3, inertia 3840.321
Iteration  4, inertia 3839.155
Iteration  5, inertia 3832.527
Iteration  6, inertia 3798.844
Iteration  7, inertia 3773.636
Iteration  8, inertia 3758.090
Iteration  9, inertia 3749.455
Iteration 10, inertia 3745.879
Iteration 11, inertia 3744.561
Iteration 12, inertia 3744.153
Iteration 13, inertia 3744.027
Iteration 14, inertia 3743.978
Iteration 15, inertia 3743.961
Iteration 16, inertia 3743.952
Iteration 17, inertia 3743.950
Iteration 18, inertia 3743.949
Iteration 19, inertia 3743.948
Iteration 20, inertia 3743.947
Converged at iteration 20: center shift 0.000000e+00 within tolerance 2.438758e-08
Initialization complete
Iteration  0, inertia 3943.208
Iteration  1, inertia 3844.309
Iteration  2, inertia 3843.867
Converged at iteration 2: center shift 0.000000e+00 within tolerance 2.438758e-08
kmean: k=4, cost=3743
done in 8.91970944404602 seconds

看出進行了三次KMeans聚類分析

len(kmean.labels_)

kmean.labels_[1000:1010] #第1000開始前十個的文件聚類類別

array([1, 0, 0, 1, 0, 1, 1, 0, 0, 0])

docs.filenames[1000:1010]

array(['datasets/clustering/data\\sci.crypt\\11475-15954',
       'datasets/clustering/data\\sci.med\\._13133-59218',
       'datasets/clustering/data\\sci.med\\._13072-59582',
       'datasets/clustering/data\\sci.crypt\\11228-15855',
       'datasets/clustering/data\\sci.med\\._13131-58806',
       'datasets/clustering/data\\sci.space\\14343-60918',
       'datasets/clustering/data\\sci.space\\14001-60226',
       'datasets/clustering/data\\sci.space\\._14348-61339',
       'datasets/clustering/data\\sci.space\\._14390-61342',
       'datasets/clustering/data\\sci.electronics\\._12203-54305'],
      dtype='<U54')

from __future__ import print_function

print("Top terms per cluster:")

order_centroids = kmean.cluster_centers_.argsort()[:, ::-1]

terms = vectorizer.get_feature_names()
for i in range(n_clusters):
    print("Cluster %d:" % i, end='')
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind], end='')
    print()

Top terms per cluster:
Cluster 0: edu for re on thanks it anyone that this or
Cluster 1: flyback fix whine tv scott spray sony noise princeton repairman
Cluster 2: it that for you be edu this on are have
Cluster 3: ireland astronomy min 48p 0891 uk per 2888 mastercard mir

a = np.array([[20, 10, 30, 40], [100, 300, 200, 400], [1, 5, 3, 2]])
a.argsort()[:, ::-1]

array([[3, 2, 0, 1],
       [3, 1, 2, 0],
       [1, 2, 3, 0]], dtype=int64)

a = np.array([10, 30, 20, 40])
a.argsort()[::-1]

array([3, 1, 2, 0], dtype=int64)

from sklearn import metrics

label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Adjusted Rand-Index for random sample: %.3f"
      % metrics.adjusted_rand_score(label_true, label_pred))
label_true = [1, 1, 3, 3, 2, 2]
label_pred = [3, 3, 2, 2, 1, 1]
print("Adjusted Rand-Index for same structure sample: %.3f"
      % metrics.adjusted_rand_score(label_true, label_pred))

Adjusted Rand-Index for random sample: 0.318
Adjusted Rand-Index for same structure sample: 1.000

from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("Homogeneity score for same structure sample: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [0, 1, 2, 3]
print("Homogeneity score for each cluster come from only one class: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("Homogeneity score for each cluster come from two class: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))
label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Homogeneity score for random sample: %.3f"
      % metrics.homogeneity_score(label_true, label_pred))

Homogeneity score for same structure sample: 1.000
Homogeneity score for each cluster come from only one class: 1.000
Homogeneity score for each cluster come from two class: 0.000
Homogeneity score for random sample: 0.667

from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("Completeness score for same structure sample: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = [0, 1, 2, 3]
label_pred = [1, 1, 2, 2]
print("Completeness score for each class assign to only one cluster: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("Completeness score for each class assign to two class: %.3f"
      % metrics.completeness_score(label_true, label_pred))
label_true = np.random.randint(1, 4, 6)
label_pred = np.random.randint(1, 4, 6)
print("Completeness score for random sample: %.3f"
      % metrics.completeness_score(label_true, label_pred))

Completeness score for same structure sample: 1.000
Completeness score for each class assign to only one cluster: 1.000
Completeness score for each class assign to two class: 0.000
Completeness score for random sample: 0.315

from sklearn import metrics

label_true = [1, 1, 2, 2]
label_pred = [2, 2, 1, 1]
print("V-measure score for same structure sample: %.3f"
      % metrics.v_measure_score(label_true, label_pred))
label_true = [0, 1, 2, 3]
label_pred = [1, 1, 2, 2]
print("V-measure score for each class assign to only one cluster: %.3f"
      % metrics.v_measure_score(label_true, label_pred))
print("V-measure score for each class assign to only one cluster: %.3f"
      % metrics.v_measure_score(label_pred, label_true))
label_true = [1, 1, 2, 2]
label_pred = [1, 2, 1, 2]
print("V-measure score for each class assign to two class: %.3f"
      % metrics.v_measure_score(label_true, label_pred))

V-measure score for same structure sample: 1.000
V-measure score for each class assign to only one cluster: 0.667
V-measure score for each class assign to only one cluster: 0.667
V-measure score for each class assign to two class: 0.000

from sklearn import metrics

labels = docs.target
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, kmean.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, kmean.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, kmean.labels_))
print("Adjusted Rand-Index: %.3f"
      % metrics.adjusted_rand_score(labels, kmean.labels_))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, kmean.labels_, sample_size=1000))

Homogeneity: 0.002
Completeness: 0.004
V-measure: 0.003
Adjusted Rand-Index: 0.001
Silhouette Coefficient: 0.330

sklearn實戰：對文件進行聚類分析（KMeans演算法）

%matplotlib inline import matplotlib.pyplot as plt import numpy as np from time import time from sklearn.datasets import load_fi

python資料分析：聚類分析（cluster analysis）

何為聚類分析聚類分析或聚類是對一組物件進行分組的任務，使得同一組（稱為聚類）中的物件（在某種意義上）與其他組（聚類）中的物件更相似（在某種意義上）。它是探索性資料探勘的主要任務，也是統計資料分析的常用技術，用於許多領域，包括機器學習，模式識別，影象分析，資訊檢索，生物資訊學，資料

無監督分類：聚類分析（K均值）

1.K均值聚類 K均值聚類是最基礎的一種聚類方法。K均值聚類，就是把看起來最集中、最不分散的簇標籤分配到輸入訓練樣本{xi}中。具體而言就是通過下式計算簇y的分散狀況：在這裡，∑i,yi=y表示

怎樣在Linux中用Vim對文件進行密碼保護

linux文件加密Vim 有個 -x 選項，這個選項能讓你在創建文件時用它來加密。一旦你運行下面的 vim 命令，你會被提示輸入一個密鑰：$ vim -x file.txt 警告：正在使用弱加密方法；參見 :help ‘cm‘ 輸入加密密鑰：******* 再次輸入相同密鑰：*******如果第二次輸入的密鑰

awk '!arr[$0]++'對文件進行處理

linuxawk ‘!arr[$0]++‘後跟文件，可以過濾掉重復的行。如下面的文件經過處理。 [[email protected] ~]# cat fstab # # /etc/fstab # /etc/fstab # /etc/fstab # /etc/fstab # /etc/fstab

Python文件操作：同一個文件進行內容替換

size 內容 round b2b pen eno see PE lin 在原文件上進行部分內容的替換，主要用到seek（）函數和truncate（）函數實現，直接上代碼： # coding:utf-8import repath = ‘C:/Users/lenovo\Des

Spark 中文文件分類(一) IKAnalyzer對文件進行分類

原網址：http://lxw1234.com/archives/2015/07/422.htm 程式語言 1年前 (2015-07-22) 5885℃ 0評論關鍵字：中文分詞、IKAnalyzer 最近有個需求，需要對爬到的網頁內容進行分詞，以前沒做過這個，隨便找了

計算機二級-C語言-對二維數組數據進行處理。對文件進行數據輸入。形參與實參。

元素首地址 clu 重難點 style 賦值是否 code *** //函數fun的功能為：計算x所指數組中N個數的平均值（規定所有數都為正數），平均值通過形參返回給主函數，將小於平均值且最接近平均值的數作為函數值返回，並輸出。 //重難點：形參與實參之間，是否進行了值

C#對文件進行加密解密源碼

ide toe flush decrypt file pro pre and provide 如下的代碼段是關於C#對文件進行加密解密的代碼，應該是對小夥伴們有些幫助。 using System;using System.IO;using System.Security.C

sklearn庫：分類、迴歸、聚類、降維、模型優化、文字預處理實現用例（趕緊收藏）

分類演算法 # knn演算法 from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier() ''' __init__函式 def __init__(self, n_neighbors=5,

十月微信小程式導航：官方文件+精品教程+demo集合（10月14日更新）

1：官方工具： 5：微信小程式公測接入指南：導航系列：特別說明： 1：不瞭解微信小程式的同學，請先搜尋一下微信小程式究竟是什麼，有哪些特性； 2：有htmlcssjs基礎者可以直接進入實踐，邊實踐邊學習，尤其是有react與vue基礎

EM 演算法-對鳶尾花資料進行聚類

> **公號：碼農充電站pro** > **主頁：** 之前介紹過[K 均值演算法](https://www.cnblogs.com/codeshell/p/14084190.html)，它是一種聚類演算法。今天介紹**EM 演算法**，它也是聚類演算法，但比**K 均值**演算法更加靈活強大。 **EM

Jenkins構建完成後通過SVN Publisher Plugin上傳文件到指定的SVN（教程收集）

ons stack play pac pla min ack .org isp SVN Publisher Plugin：https://wiki.jenkins-ci.org/display/JENKINS/SVN+Publisher 構建完成後的文件，比如Maven打

文件拆分成指定大小（IO流）

tab 文件大小 sys write exception public each 文件讀取 rac 1 package stream; 2 3 import java.io.File; 4 import java.io.FileInputStream; 5

Linux文件目錄相關命令練習（課堂使用）

linux課堂練習1練習1、在/tmp下創建 6個目錄 dir1 dir2 dir3 dir4 dir5 dir62、在三個目錄中分別創建三個文件 dir1.txt (屬於dir1) dir2.txt (屬於dir2) dir3.txt(屬於dir3) 3、用命令touch （文件名）創

如何查看.java文件的字節碼（原碼）

數據 int new compile from auto 進行 java public 出自於:https://www.cnblogs.com/tomasman/p/6751751.html 直接了解foreach底層有些困難,我們需要從更簡單的例子著手.下面上一個簡單

python作業03-文件操作&函數（未完成）

turn remove col spa 地址輸出 n的階乘 test dict 一、文件處理相關 1、編碼問題　（1）請說明python2 與python3中的默認編碼是什麽？答：Python2是ascii python3是utf-8 （2

Linux常用命令———文件和目錄操作命令（18個）

系統/運維 Linux 文件和目錄操作命令(18個) ls（列出目錄內容和屬性）全拼list，功能是列出目錄的內容及其內容屬性信息。-l（long）長格式註：-l 顯示的時間是mtime-d --directorys當遇到目錄時列出目錄本身而非目錄內的文件

Linux 文件系統於權限（學習記錄）

灰色 pwd 操作符存儲壓縮 inux odi atime 獨立分區 Linux文件與權限 ??在Linux中有著一切皆文件的說法，而文件的權限大小和用戶所擁有的權限決定了用戶對文件的控制程度，因此文件的權限和用戶的權限對Linux中文件和系統的安全有很大的影響。一.

對圖像進行主成分分析（PCV.tools.pca.pca）

div lis 完成 lose 投影 color axis 分類排序 1 引言　　1.1 維度災難　　　　分類為例：如最近鄰分類方法（基本思想：以最近的格子投票分類）　　　　問題：當數據維度增大，分類空間爆炸增長。如圖1所示，　　　　　　　　　　　　　　　　

sklearn實戰：對文件進行聚類分析（KMeans演算法）

相關推薦