資料中聚類個數的確定（Determining the number of clusters in a data set）

阿新 • • 發佈：2019-01-09

本文主要討論聚類中聚類個數的確定問題。

1. K的作用

Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster.

2. 常用方法

2.1 經驗法則（Rule of thumb）

[1]k≈n/2−−−√

2.2 彎形判據 (The Elbow Method）

這裡寫圖片描述
the percentage of variance V.S. the number of clusters

2.3 資訊準則(Information Criterion Approach)

[2][3]如果聚類模型能寫成一個似然函式（likelihood function）考慮使用：Akaike information criterion (AIC), Bayesian information criterion (BIC), or the Deviance information criterion (DIC)
[4]是關於k-meas的例子。

2.4 (An Information Theoretic Approach)

[5] 率失真理論（Rate distortion theory）應用於選擇k，通過資訊理論標準最小化誤差的同時最大化效率。該策略通過執行一個標準的聚類演算法為輸入資料在k值從1到n生成一個失真曲線（distortion curve），接著基於資料維數選擇的a negative power對失真曲線處理，最後尋找跳躍最大的點作為k。

2.5 輪廓(Choosing k Using the Silhouette)

[6][7]

The silhouette of a datum is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighbouring cluster.

2.6 交叉驗證法(Cross-validation)

[8]

2.7 文字資料 (Finding Number of Clusters in Text Databases)

[9] 矩陣D∈Rn×m m：文字數量， n：項數量，t：D中非零項數量（D每行列至少有一個非零項），有：( m × n )/ t。

2.8 核矩陣 (Analyzing the Kernel Matrix)

不像先前的方法要求先驗聚類，[10]直接從資料本身獲得聚類個數。
1.形成核矩陣（資料對映到高維空間線性可分）
2.特徵值分解核矩陣
3.分析特徵值和特徵向量
4.畫圖找彎點（elbow）

參考及引用文獻：
[1] [Kanti Mardia et al. (1979). Multivariate Analysis. Academic Press.]
[2] [David J. Ketchen, Jr & Christopher L. Shook (1996). “The application of cluster analysis in Strategic Management Research: An analysis and critique”. Strategic Management Journal 17 (6): 441–458.]
[3] [Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, Lars Kai Hansen (March 1999). “On Clustering fMRI Time Series” . NeuroImage 9 (3): 298–310.]
[4] [Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). “Feature-Space Clustering for fMRI Meta-Analysis” . Human Brain Mapping 13 (3): 165–183.]
[5] [Catherine A. Sugar and Gareth M. James (2003). “Finding the number of clusters in a data set: An information theoretic approach” . Journal of the American Statistical Association 98 (January): 750–763.]
[6] [Peter J. Rousseuw (1987). “Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis” . Computational and Applied Mathematics 20: 53–65.]
[7] [R. Lleti, M.C. Ortiz, L.A. Sarabia, M.S. Sánchez (2004). “Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes” . Analytica Chimica Acta 515: 87–100.]
[8] [Finding the Right Number of Clusters in kMeans and EM Clustering: v-Fold Cross-Validation” . Electronic Statistics Textbook. StatSoft. 2010. Retrieved 2010-05-03.]
[9] [Can, F.; Ozkarahan, E. A. (1990). “Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases” . ACM Transactions on Database Systems 15 (4): 483.]
[10] [Honarkhah, M and Caers, J (2010). “Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling” . Mathematical Geosciences 42 (5): 487–517.]

資料中聚類個數的確定（Determining the number of clusters in a data set）

1. K的作用

2. 常用方法

2.1 經驗法則（Rule of thumb）

2.2 彎形判據 (The Elbow Method）

2.3 資訊準則(Information Criterion Approach)

2.4 (An Information Theoretic Approach)

2.5 輪廓(Choosing k Using the Silhouette)

2.6 交叉驗證法(Cross-validation)

2.7 文字資料 (Finding Number of Clusters in Text Databases)

2.8 核矩陣 (Analyzing the Kernel Matrix)

資料中聚類個數的確定（Determining the number of clusters in a data set）

利用優先順序佇列找出十萬資料中的最小十個(Find the smallest ten datas in one hundred thousand by priorityQueue)

LeetCode 434.字串中的單詞數（Number of Segments in a String）C C++

【easy】Number of Segments in a String 字符串中的分段數量

[Swift]LeetCode434. 字符串中的單詞數 | Number of Segments in a String

leetcode （Number of Segments in a String）

[LeetCode] Number of Segments in a String 字串中的分段數量

LeetCode演算法題-Number of Segments in a String（Java實現）

sklearn中聚類（部分）

常見的5中聚類算法

實戰--酵母基因表達資料的聚類分析

JAVA中Object類的toString（）方法

多條資料按照某條資料中某個共有屬性排序（氣泡排序）

Java: 獲取jar包中某個類的serialVersionUID（序列版本id）

C++中string類詳解（轉載）(最下面有程式碼實現）

一代測序序列資料批量聚類處理

基於圖的聚類演算法綜述（基於圖的聚類演算法開篇）

Java中String類常用方法（轉）

聚類演算法實踐（一）——層次聚類、K-means聚類

VC++ MFC中CString類完美總結（整理）

資料中聚類個數的確定（Determining the number of clusters in a data set）

1. K的作用

2. 常用方法

2.1 經驗法則（Rule of thumb）

2.2 彎形判據 (The Elbow Method）

2.3 資訊準則(Information Criterion Approach)

2.4 (An Information Theoretic Approach)

2.5 輪廓(Choosing k Using the Silhouette)

2.6 交叉驗證法(Cross-validation)

2.7 文字資料 (Finding Number of Clusters in Text Databases)

2.8 核矩陣 (Analyzing the Kernel Matrix)

相關推薦