Kmeans演算法詳解及實現

阿新 • • 發佈：2019-01-13

今天我們介紹資料探勘領域最基本的一個演算法，Kmeans演算法，並進行演算法的講解及實現。

我們知道聚類問題屬於經典問題，而對於聚類演算法，也是有很多不同的種類，kmeans就是其中一種最基本的聚類演算法。

它的主要演算法流程是：

（1）隨機的取k個點作為k個初始質心；

（2）計算其他點到這個k個質心的距離；

（3）如果某個點p離第n個質心的距離更近，則該點屬於cluster n，並對其打標籤，標註point p.label=n，其中n<=k；

（4）計算同一cluster中，也就是相同label的點向量的平均值，作為新的質心；

（5）迭代至所有質心都不變化為止，即演算法結束。

當然實現的方法有很多，比如在選擇初始質心時，可以隨機選擇k個，也可以隨機選擇k個離得最遠的點等等，方法不盡相同。

對於k值，必須提前知道，這也是kmeans演算法的一個缺點。當然對於k值，我們可有有很多種方法進行估計。本文中，我們採用平均直徑法來進行k的估計。

也就是說，首先視所有點為一個大的整體cluster，計算所有點間距離的平均值作為該cluster的平均直徑。選擇初始質心的時候，先選擇最遠的兩個點，接下來從這最兩個點開始，與這最兩個點距離都很遠的點（遠的程度為，該點到之前選擇的最遠的兩個點的距離都大於整體cluster的平均直徑）可視為新發現的質心，否則不視之為質心。

這樣，我們就能估計出k的值，並且得到k個初始質心，接著，我們便根據上述演算法流程繼續進行迭代，直到所有質心都不變化，從而成功實現演算法。

本文實現程式碼為最基礎的實現方式，如果資料多維，可能會需要做資料預處理，比如歸一化，並且修改程式碼相關函式即可。

下面附上程式碼，關鍵處已有註釋，如有問題請留言：

附上一組測試資料，執行前請將資料copy至c:\\kmeans.txt

下面資料的意義為點座標：

1,1
2,1
1,2
2,2
6,1
6,2
7,1
7,2
1,5
1,6
2,5
2,6
6,5
6,6
7,5
7,6

得到輸出結果為：

There are 4 clusters!
1.0 1.0 belongs to cluster 1
2.0 1.0 belongs to cluster 1
1.0 2.0 belongs to cluster 1
2.0 2.0 belongs to cluster 1
6.0 1.0 belongs to cluster 3
6.0 2.0 belongs to cluster 3
7.0 1.0 belongs to cluster 3
7.0 2.0 belongs to cluster 3
1.0 5.0 belongs to cluster 4
1.0 6.0 belongs to cluster 4
2.0 5.0 belongs to cluster 4
2.0 6.0 belongs to cluster 4
6.0 5.0 belongs to cluster 2
6.0 6.0 belongs to cluster 2
7.0 5.0 belongs to cluster 2
7.0 6.0 belongs to cluster 2

這裡附上適用於n維資料集的kmeans實現程式碼，其實很簡單，還是那句話，這個程式碼對不同的資料集效果可能不同，關鍵因素有很多，主要在於對k的估計，k個初始質心的選擇，以及資料預處理，節點間採用的距離度量方式等等。當然，kmeans本身就是最naive的方法，如果想得到更好的結果，可以採用其他的方法，比如神經網路等等。

注：此程式碼不包括資料預處理，可根據資料集特性選擇相應的預處理方式，比如歸一化等等。預處理後的資料應用下面的程式碼效果一般較為理想。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintStream;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.PriorityQueue;
import java.util.Queue;
import java.util.Comparator;

public class Kmeans {
	class Node {
		int label;// label用來記錄點屬於第幾個cluster
		double[] attributes;

		public Node() {
			attributes = new double[100];
		}
	}

	class NodeComparator {
		Node nodeOne;
		Node nodeTwo;
		double distance;

		public void compute() {
			double val = 0;
			for (int i = 0; i < dimension; ++i) {
				val += (this.nodeOne.attributes[i] - this.nodeTwo.attributes[i])
						* (this.nodeOne.attributes[i] - this.nodeTwo.attributes[i]);
			}
			this.distance = val;
		}
	}

	ArrayList<Node> arraylist;
	ArrayList<Node> centroidList;
	double averageDis;
	int dimension;
	Queue<NodeComparator> FsQueue = new PriorityQueue<NodeComparator>(150,// 用來排序任意兩點之間的距離，從大到小排
			new Comparator<NodeComparator>() {
				public int compare(NodeComparator one, NodeComparator two) {
					if (one.distance < two.distance)
						return 1;
					else if (one.distance > two.distance)
						return -1;
					else
						return 0;
				}
			});

	public Kmeans(String path) {// 建構函式讀入資料
		try {
			BufferedReader br = new BufferedReader(new FileReader(path));
			String str;
			String[] strArray;
			arraylist = new ArrayList<Node>();
			while ((str = br.readLine()) != null) {
				strArray = str.split(",");
				dimension = strArray.length;
				Node node = new Node();
				for (int i = 0; i < dimension; ++i) {
					node.attributes[i] = Double.parseDouble(strArray[i]);
				}
				arraylist.add(node);
			}
			br.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	public void computeTheK() {
		int cntTuple = 0;
		for (int i = 0; i < arraylist.size() - 1; ++i) {
			for (int j = i + 1; j < arraylist.size(); ++j) {
				NodeComparator nodecomp = new NodeComparator();
				nodecomp.nodeOne = new Node();
				nodecomp.nodeTwo = new Node();
				for (int k = 0; k < dimension; ++k) {
					nodecomp.nodeOne.attributes[k] = arraylist.get(i).attributes[k];
					nodecomp.nodeTwo.attributes[k] = arraylist.get(j).attributes[k];
				}
				nodecomp.compute();
				averageDis += nodecomp.distance;
				FsQueue.add(nodecomp);
				cntTuple++;
			}
		}
		averageDis /= cntTuple;// 計算平均距離
		chooseCentroid(FsQueue);
	}

	public double getDistance(Node one, Node two) {// 計算兩點間的歐氏距離
		double val = 0;
		for (int i = 0; i < dimension; ++i) {
			val += (one.attributes[i] - two.attributes[i])
					* (one.attributes[i] - two.attributes[i]);
		}
		return val;
	}

	public void chooseCentroid(Queue<NodeComparator> queue) {
		centroidList = new ArrayList<Node>();
		boolean flag = false;
		while (!queue.isEmpty()) {
			boolean judgeOne = false;
			boolean judgeTwo = false;
			NodeComparator nc = FsQueue.poll();
			if (nc.distance < averageDis)
				break;// 如果接下來的元組，兩節點間距離小於平均距離，則不繼續迭代
			if (!flag) {
				centroidList.add(nc.nodeOne);// 先加入所有點中距離最遠的兩個點
				centroidList.add(nc.nodeTwo);
				flag = true;
			} else {// 之後從之前已加入的最遠的兩個點開始，找離這兩個點最遠的點，
					//如果距離大於所有點的平均距離，則認為找到了新的質心，否則不認定為質心
				for (int i = 0; i < centroidList.size(); ++i) {
					Node testnode = centroidList.get(i);
					if (centroidList.contains(nc.nodeOne)
							|| getDistance(testnode, nc.nodeOne) < averageDis) {
						judgeOne = true;
					}
					if (centroidList.contains(nc.nodeTwo)
							|| getDistance(testnode, nc.nodeTwo) < averageDis) {
						judgeTwo = true;
					}
				}
				if (!judgeOne) {
					centroidList.add(nc.nodeOne);
				}
				if (!judgeTwo) {
					centroidList.add(nc.nodeTwo);
				}
			}
		}
	}

	public void doIteration(ArrayList<Node> centroid) {

		int cnt = 1;
		int cntEnd = 0;
		int numLabel=centroid.size();
		while (true) {// 迭代，直到所有的質心都不變化為止
			boolean flag = false;
			for (int i = 0; i < arraylist.size(); ++i) {
				double dis = 0x7fffffff;
				cnt = 1;
				for (int j = 0; j < centroid.size(); ++j) {
					Node node = centroid.get(j);
					if (getDistance(arraylist.get(i), node) < dis) {
						dis = getDistance(arraylist.get(i), node);
						arraylist.get(i).label = cnt;
					}
					cnt++;
				}
			}
			int j = 0;
			numLabel-=1;
			while (j < numLabel) {
				int c = 0;
				Node node = new Node();
				for (int i = 0; i < arraylist.size(); ++i) {
					if (arraylist.get(i).label == j + 1) {
						for (int k = 0; k < dimension; ++k) {
							node.attributes[k] += arraylist.get(i).attributes[k];
						}
						c++;
					}
				}
				DecimalFormat df = new DecimalFormat("#.###");// 保留小數點後三位
				double[] attributelist = new double[100];
				for (int i = 0; i < dimension; ++i) {
					attributelist[i] = Double.parseDouble(df
							.format(node.attributes[i] / c));
					if (attributelist[i] != centroid.get(j).attributes[i]) {
						centroid.get(j).attributes[i] = attributelist[i];
						flag = true;
					}
				}
				if (!flag) {
					cntEnd++;
					if (cntEnd == numLabel) {// 若所有的質心都不變，則跳出迴圈
						break;
					}
				}
				j++;
			}
			if (cntEnd == numLabel) {// 若所有的質心都不變，則success
				System.out.println("do kmeans success");
				break;
			}
		}
	}

	public void getKmeansResults(String path) {
		try {
			PrintStream out = new PrintStream(path);
			computeTheK();
			doIteration(centroidList);
			out.println("There are " + centroidList.size() + " clusters!");
			for (int i = 0; i < arraylist.size(); ++i) {
				for (int j = 0; j < dimension; ++j) {
					out.print(arraylist.get(i).attributes[j] + " ");
				}
				out.println("belongs to cluster " + arraylist.get(i).label);
			}
			out.close();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

	public static void main(String[] args) {
		Kmeans kmeans = new Kmeans("c:/kmeans.txt");
		kmeans.getKmeansResults("c:/kmeansResults.txt");

	}
}

Kmeans演算法詳解及實現

Kmeans演算法詳解及實現

堆排序演算法詳解及實現-----------c語言

機器學習經典演算法詳解及Python實現--線性迴歸（Linear Regression）演算法

小白之KMP演算法詳解及python實現

迪克斯特拉演算法詳解及C++實現

遺傳演算法詳解及Java實現

機器學習經典演算法詳解及Python實現--決策樹（Decision Tree）

常見9大排序演算法詳解及python3實現

機器學習經典演算法詳解及Python實現--K近鄰(KNN)演算法

RSA演算法詳解及C語言實現

結點對最短路徑之Floyd演算法原理詳解及實現

最小生成樹-MST演算法詳解及程式碼實現

redis配置文件詳解及實現主從同步切換

微信和支付寶支付模式詳解及實現二

Show, attend and tell演算法詳解及原始碼

Kadane演算法詳解及求解最大子數列和問題

各種排序演算法詳解C++實現

KMP演算法詳解及各種應用

nfs詳解及實現全網備份

二叉搜尋樹詳解及實現程式碼（BST）

Kmeans演算法詳解及實現

相關推薦