1. 程式人生 > >聚類分析層次聚類及k-means演算法

聚類分析層次聚類及k-means演算法

參考文獻:

[1]Jure Leskovec,Anand Rajaraman,Jeffrey David Ullman.大資料網際網路大規模資料探勘與分散式處理(第二版) [M]北京:人民郵電出版社,2015.,190-199;

[2]蔣盛益,李霞,鄭琪.資料探勘原理與實踐 [M]北京:電子工業出版社,2015.1,107-114,121;

目錄:

1、測試案例:

2、原理分析:

3、原始碼示例:

4、執行結果:

1、測試案例:

給定國際通用UCI資料庫中FISHERIRIS資料集,其meas集包含150個樣本資料,每個資料含有鶯尾屬植物的4個屬性,即萼片長度、萼片寬度、花瓣長度,單位為cm。上述資料分屬於species集的三種setosa、versicolor和virginica花朵類別。
要求在該資料集上執行:
(1)層次聚類演算法
(2)k-means聚類演算法
得到的聚類結果與species集的Label結果比較,統計這兩類演算法聚類的正確率和執行時間。


圖1.1 Excel測試案例部分內容截圖1


圖1.2 Excel測試案例部分內容截圖2

2、原理分析:

(1)聚類定義:

將資料集劃分為由若干相似物件組成的多個組(group)或簇(cluster)的過程,使得同一組中物件間的相似度最大化,不同組中物件間的相似度最小化。

聚類是一種無監督的機器學習方法,即事先對資料集的分佈沒有任何瞭解,是將物理或抽象物件的集合組成為由類似的物件組成的多個組的過程。

(2)聚類分析任務步驟:

①模式表示(包括特徵提取和選擇)

②適合於資料領域的模式相似性定義

③聚類或劃分演算法

④資料摘要

⑤輸出結果的評估

(3)k-means演算法:

首先,隨機選擇k個物件,每個物件代表一個簇的初始均值或中心;對剩餘的每個物件,根據其與各簇中心的距離,將它指派到最近或最相似的簇,然後計算每個簇的新均值,得到更新後的簇中心;不斷重複,直到準則函式收斂。

(4)自下而上聚合層次聚類方法(凝聚層次聚類):

最初將每個物件作為一個簇,然後將這些簇進行聚合以構造越來越大的簇,直到所有物件均聚合為一個簇,或滿足一定終止條件為止。

(5)k-means演算法的缺點:

①簇個數k需要預先給定。

②演算法對初始值的選取依賴性極大以及演算法常陷入區域性最優解。

③該演算法需要不斷地進行樣本分類調整,不斷地計算調整後的簇中心,因此當資料量非常大時,演算法的時間開銷是非常大的。

④由於將簇的質心(即均值)作為簇中心進行新一輪的聚類計算,遠離資料密集區的離群點和噪聲點會導致聚類中心偏離真正的資料密集區,所以k-means演算法對噪聲點和離群點很敏感。

⑤k-means演算法不能用於發現非凸形狀的簇,或具有各種不同大小或密度的簇,即很難檢測“自然的”簇。

⑥只能用於處理數值屬性的資料集,不能處理包含分類屬性的資料集。

3、原始碼示例:

(1)工程目錄:


圖3.1工程目錄截圖

(2)KMeans.java

package com.remoa.experiment4.service;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

/**
 * K-Means演算法
 * @author Remoa
 *
 */
public class KMeans {
	//定義最大歐式距離為5000
	public static final double MAXLENGTH = 5000.0;

	/**
	 * 計算新的簇中心
	 * @param dataVO DataVO實體類
	 * @return 更新後的DataVO實體
	 */
	public static DataVO countClusterCenter(DataVO dataVO){
		List<ClusterVO> clusterList = dataVO.getClusterList();
		List<ClusterVO> newClusterList = new ArrayList<ClusterVO>();
		int i, j, p;
		for(i = 0; i < clusterList.size(); i++){
			ClusterVO cluster = clusterList.get(i);
			List<PointVO> pointList = cluster.getPointList();
			Double[] countArray = new Double[clusterList.get(0).getPointList().get(0).getPoint().length];
			for(j = 0; j < countArray.length; j++){
				countArray[j] = 0.0;
			}
			for(j = 0; j < pointList.size(); j++){
				PointVO point = pointList.get(j);
				Double[] pointValue = point.getPoint();
				for(p = 0; p < pointValue.length; p++){
					countArray[p] = pointValue[p] + countArray[p];
				}
			}
			for(j = 0; j < countArray.length; j++){
				countArray[j] /= pointList.size();
			}
			cluster.setClusterCenter(countArray);
			newClusterList.add(cluster);
		}
		dataVO.setClusterList(newClusterList);
		return dataVO;
	}
	
	/**
	 * 將物件指派到與其距離最近的簇
	 * @param dataVO dataVO實體
	 * @param point 資料點
	 * @return 修改後的dataVO實體
	 */
	public static DataVO distributeIntoCluster(DataVO dataVO, PointVO point){
		double sum = 0.0, max = MAXLENGTH;
		//loca存放在原先簇中的位置,locaRecord存放是在哪個簇
		int locaRecord = 0, loca = 0;
		int i, j, count, n, m;
		List<ClusterVO> clusterList = dataVO.getClusterList();
		List<PointVO> pointList = dataVO.getPointList();
		List<PointVO> clusterPointList = null;
		Double[] distanceArray = new Double[clusterList.size()];
		//獲取資料點內容
		Double[] pointValueArray = point.getPoint();
		Double[] tempArray = new Double[pointValueArray.length];
		//遍歷每一個簇
		for(i = 0; i < clusterList.size(); i++){
			sum = 0.0;
			//得到該簇的中心點
			Double[] clusterCenter = clusterList.get(i).getClusterCenter();
			//將平方值儲存在一個temp陣列
			for(j = 0; j < pointValueArray.length; j++){
				tempArray[j] = Math.pow(clusterCenter[j] - pointValueArray[j], 2);
			}
			//求歐式距離
			for(j = 0; j < tempArray.length; j++){
				sum += tempArray[j];
			}
			//將結果儲存在距離陣列中
			distanceArray[i] = Math.sqrt(sum);
		}
		//遍歷距離陣列,找到要插入的簇
		for(i = 0; i < distanceArray.length; i++){
			if(distanceArray[i] < max){
				max = distanceArray[i];
				locaRecord = i;
			}
		}
		//獲得該簇
		ClusterVO cluster = clusterList.get(locaRecord);
		//找到簇中的該元素
		for(i = 0; i < pointList.size(); i++){
			if(pointList.get(i).equals(point)){
				loca = i;
				break;
			}
		}
		//在同一個簇,不做任何處理
		if(cluster.getClusterid().equals(point.getClusterid())){
			return dataVO;
		}
		//這個資料不在任何一個簇,加進來
		else if(point.getClusterid() == null){
			clusterPointList = cluster.getPointList();
		}
		//在不同的簇中
		else{
			clusterPointList = cluster.getPointList();
			//遍歷每個簇,找到該元素
			for(i = 0; i < clusterList.size(); i++){
				boolean flag = false;
				//遍歷每個簇中元素
				for(m = 0; m < clusterList.get(i).getPointList().size(); m++){
					PointVO everypoint = clusterList.get(i).getPointList().get(m);
					Double[] everypointValue = everypoint.getPoint();
					count = 0;
					for(n = 0; n < everypointValue.length; n++){
						if(pointValueArray[n].doubleValue() == everypointValue[n].doubleValue()){
							count++;
						}
					}
					if(count == everypointValue.length){
						clusterList.get(i).getPointList().remove(m);
						flag = true;
						break;
					}
				}
				if(flag){
					break;
				}
			}
		}
		//設定資料點的所在簇位置
		point.setClusterid(cluster.getClusterid());
		//更新dataVO中的資料點資訊
		pointList.set(loca, point);
		//將資料點加入到簇的資料點集中
		clusterPointList.add(point);
		//將資料點集加入到簇中
		cluster.setPointList(clusterPointList);
		//更新dataVO中的簇資訊
		clusterList.set(locaRecord, cluster);
		//將簇資訊放入dataVO中
		dataVO.setClusterList(clusterList);
		//將資料點集資訊放入到dataVO中
		dataVO.setPointList(pointList);
		return dataVO;
	}
	
	/**
	 * 初始化DataVO
	 * @param cellList 封裝了Excel表中一行行資料的list
	 * @param k k-means演算法中的k
	 * @return 修改後的DataVO實體
	 */
	public static DataVO initDataVO(List<Cell[]> cellList, int k){
		int i, j;
		DataVO dataVO = new DataVO();
		List<PointVO> pointList = new ArrayList<PointVO>();
		List<ClusterVO> clusterList = new ArrayList<ClusterVO>();
		List<ClusterVO> newClusterList = new ArrayList<ClusterVO>();
		Cell[] cell = new Cell[cellList.get(0).length];
		//將所有元素加入到DataVO中管理以及加入PointVO中
		for(i = 0; i < cellList.size(); i++){
			cell = cellList.get(i);
			Double[] point = new Double[cellList.get(0).length];
			for(j = 0; j < cell.length; j++){
				point[j] = Double.valueOf(cell[j].getContents());
			}
			PointVO pointVO = new PointVO();
			pointVO.setPoint(point);
			pointVO.setPointName(null);
			if(i < k){
				String clusterid = UUID.randomUUID().toString();
				pointVO.setClusterid(clusterid);
				ClusterVO cluster = new ClusterVO();
				cluster.setClusterid(clusterid);
				clusterList.add(cluster);
			}else{
				pointVO.setClusterid(null);
			}
			pointList.add(pointVO);
		}
		dataVO.setPointList(pointList);
		//將前k個點作為k個簇
		for(i = 0; i < k; i++){
			cell = cellList.get(i);
			Double[] point = new Double[cellList.get(0).length];
			for(j = 0; j < cell.length; j++){
				point[j] = Double.valueOf(cell[j].getContents());
			}
			ClusterVO cluster = clusterList.get(i);
			cluster.setClusterCenter(point);
			List<PointVO> clusterPointList = new ArrayList<PointVO>();
			PointVO pointVO = new PointVO();
			pointVO.setPoint(point);
			clusterPointList.add(pointVO);
			cluster.setPointList(clusterPointList);
			newClusterList.add(cluster);
		}
		dataVO.setClusterList(newClusterList);
		return dataVO;
	}
	
}
(3)HierarchicalAlgorithm.java
package com.remoa.experiment4.service;

import java.util.ArrayList;
import java.util.List;
import java.util.UUID;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

/**
 * 層次聚類演算法
 * @author Remoa
 *
 */
public class HierarchicalAlgorithm {
	//定義最大歐式距離為5000
	public static final double MAXLENGTH = 5000.0;
	
	/**
	 * 初始化層次聚類演算法的DataVO實體
	 * @param cellList 封裝了Excel中一行行資料的List
	 * @return
	 */
	public static DataVO initDataVO(List<Cell[]> cellList){
		int i, j;
		DataVO dataVO = new DataVO();
		List<ClusterVO> clusterList = new ArrayList<ClusterVO>();
		List<PointVO> pointList = new ArrayList<PointVO>();
		Cell[] cell = new Cell[cellList.get(0).length];
		for(i = 0; i < cellList.size(); i++){
			cell = cellList.get(i);
			Double[] point = new Double[cellList.get(0).length];
			for(j = 0; j < cell.length; j++){
				point[j] = Double.valueOf(cell[j].getContents());
			}
			List<PointVO> clusterPointList = new ArrayList<PointVO>();
			ClusterVO cluster = new ClusterVO();
			PointVO pointVO = new PointVO();
			pointVO.setPoint(point);
			pointVO.setPointName(null);
			String clusterId = UUID.randomUUID().toString();
			pointVO.setClusterid(clusterId);
			clusterPointList.add(pointVO);
			cluster.setClusterCenter(point);
			cluster.setClusterid(clusterId);
			cluster.setPointList(clusterPointList);
			clusterList.add(cluster);
			pointList.add(pointVO);
		}
		dataVO.setClusterList(clusterList);
		dataVO.setPointList(pointList);
		return dataVO;
	}
	
	/**
	 * 簇合並
	 * @param dataVO DataVO實體
	 * @return 修改後的DataVO實體
	 */
	public static DataVO mergeCluster(DataVO dataVO){
		double max = MAXLENGTH;
		//定義一個臨時陣列
		Double[] tempArray = new Double[dataVO.getClusterList().get(0).getClusterCenter().length];
		//定義要合併的兩個簇的下標
		int clusterLoca1 = 0, clusterLoca2 = 0;
		int j, m, count, n, p;
		double sum;
		//遍歷每個簇
		for(int i = 0; i < dataVO.getClusterList().size(); i++){
			//得到第一個簇的中心點
			Double[] clusterCenter1 = dataVO.getClusterList().get(i).getClusterCenter();
			for(int k = i + 1; k < dataVO.getClusterList().size(); k++){
				sum = 0.0;
				//得到第二個簇的中心點
				Double[] clusterCenter2 = dataVO.getClusterList().get(k).getClusterCenter();
				//將平方值儲存在一個temp陣列,求未開根號的歐式距離
				for(j = 0; j < tempArray.length; j++){
					tempArray[j] = Math.pow(clusterCenter1[j] - clusterCenter2[j], 2);
					sum += tempArray[j].doubleValue();
				}
				if(sum < max){
					max = sum;
					clusterLoca1 = i;//第一個簇的位置
					clusterLoca2 = k;//第二個簇的位置
				}
			}
		}
		//合併兩個簇
		String clusterid = UUID.randomUUID().toString();
		ClusterVO cluster1 = dataVO.getClusterList().get(clusterLoca1);
		//遍歷第一個簇的全集,更新其所在dataVO中的資料點的簇id
		for(m = 0; m < cluster1.getPointList().size(); m++){
			count = 0;
			Double[] pointValueArray = cluster1.getPointList().get(m).getPoint();
			List<PointVO> everypoint = dataVO.getPointList();
			for(n = 0; n < everypoint.size(); n++){
				Double[] everypointValue = everypoint.get(n).getPoint();
				for(p = 0; p < everypointValue.length; p++){
					if(pointValueArray[p].doubleValue() == everypointValue[p].doubleValue()){
						count++;
					}
				}
				if(count == everypointValue.length){
					PointVO newpoint1 = everypoint.get(n);
					newpoint1.setClusterid(clusterid);
					dataVO.getPointList().set(n, newpoint1);
					break;
				}
			}
		}
		//更新簇中的資料的簇id
		for(m = 0; m < cluster1.getPointList().size(); m++ ){
			PointVO point = cluster1.getPointList().get(m);
			point.setClusterid(clusterid);
			cluster1.getPointList().set(m, point);
		}
		ClusterVO cluster2 = dataVO.getClusterList().get(clusterLoca2);
		//遍歷第二個簇的全集,更新其所在dataVO中的簇id
		for(m = 0; m < cluster2.getPointList().size(); m++){
			count = 0;
			Double[] pointValueArray = cluster2.getPointList().get(m).getPoint();
			List<PointVO> everypoint = dataVO.getPointList();
			for(n = 0; n < everypoint.size(); n++){
				Double[] everypointValue = everypoint.get(n).getPoint();
				for(p = 0; p < everypointValue.length; p++){
					if(pointValueArray[p].doubleValue() == everypointValue[p].doubleValue()){
						count++;
					}
				}
				if(count == everypointValue.length){
					PointVO newpoint2 = everypoint.get(n);
					newpoint2.setClusterid(clusterid);
					dataVO.getPointList().set(n, newpoint2);
					break;
				}
			}
		}
		//更新簇中的資料的簇id
		for(m = 0; m < cluster2.getPointList().size(); m++ ){
			PointVO point = cluster2.getPointList().get(m);
			point.setClusterid(clusterid);
			cluster2.getPointList().set(m, point);
		}
		ClusterVO newCluster = new ClusterVO();
		List<PointVO> newPointList = new ArrayList<PointVO>();
		newPointList.addAll(cluster1.getPointList());
		newPointList.addAll(cluster2.getPointList());
		Double[] clusterCenter1 = cluster1.getClusterCenter();
		Double[] clusterCenter2 = cluster2.getClusterCenter();
		Double[] newCenter = new Double[clusterCenter1.length];
		for(int i = 0; i < clusterCenter1.length; i++){
			newCenter[i] = (clusterCenter1[i] * cluster1.getPointList().size() + clusterCenter2[i] * cluster2.getPointList().size()) / (cluster1.getPointList().size() +  cluster2.getPointList().size());
		}
		newCluster.setClusterCenter(newCenter);
		newCluster.setClusterid(clusterid);
		newCluster.setPointList(newPointList);
		dataVO.getClusterList().set(clusterLoca1, newCluster);
		dataVO.getClusterList().remove(clusterLoca2);
		return dataVO;
	}
	
}
(4)CorrectRate.java
package com.remoa.experiment4.service;

import java.text.DecimalFormat;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import com.remoa.experiment4.common.ImportData;
import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

import jxl.Cell;

public class CorrectRate {
	/**
	 * 獲得兩個Excel表之間的正確的對映
	 * @return 返回鍵值對map
	 */
	public static Map<Double[], String> getMap(){
		List<Cell[]> resultCellList = ImportData.importResultData();
		List<Cell[]> testCellList = ImportData.importData();
		Map<Double[], String> map = new HashMap<Double[], String>();
		for(int j = 0; j < testCellList.size(); j++){
			Cell[] testCell = testCellList.get(j);
			Cell[] resultCell = resultCellList.get(j);
			String name = resultCell[0].getContents();
			Double[] cellValue = new Double[testCell.length];
			for(int i = 0; i < testCell.length; i++){
				cellValue[i] = Double.valueOf(testCell[i].getContents());
			}
			map.put(cellValue, name);
		}
		return map;
	}
	
	/**
	 * 獲得正確率
	 * @param dataVO
	 */
	public static void getCorrectRate(DataVO dataVO){
		int maxLoca = 0;//最多項在陣列中出現的位置
		int maxSize = 0;//最多項出現的次數
		int sum = 0;//正確項的總和
		Map<Double[], String> map = getMap();
		//每個簇所獲得的簇名
		String[] clusterNameReal = new String[dataVO.getClusterList().size()];
		Set<String> set = new HashSet<String>();
		for(Iterator<Entry<Double[], String>> iter = map.entrySet().iterator(); iter.hasNext();){
			Map.Entry<Double[], String> entry = iter.next();
			String value = entry.getValue();
			set.add(value);
		}
		//封裝簇名
		String[] clusterNameArray = set.toArray(new String[dataVO.getClusterList().size()]);
		int[] countArray = new int[clusterNameArray.length];
		//每個簇正確項個數的陣列
		int[] correctArray = new int[clusterNameArray.length];
		for(int j = 0; j < countArray.length; j++){
			correctArray[j] = 0;
		}
		List<ClusterVO> clusterList = dataVO.getClusterList();
		//遍歷每個簇,根據簇中元素所屬於正確結果的最多值定一個初始的簇類
		for(int i = 0; i < clusterList.size(); i++){
			//計數器初始化
			for(int j = 0; j < countArray.length; j++){
				countArray[j] = 0;
			}
			//最多項出現的次數初始化
			maxSize = 0;
			//簇中元素的List
			List<PointVO> pointList = clusterList.get(i).getPointList();
			//遍歷簇內元素,得到該簇的真實名字
			for(int j = 0; j < pointList.size(); j++){
				String valueStr = "";
				Double[] testDoubleArray = dataVO.getClusterList().get(i).getPointList().get(j).getPoint();
				Set<Double[]> valueSet = map.keySet();
				int temp = 0;
				for(Iterator<Double[]> iter = valueSet.iterator(); iter.hasNext(); ){
					int countSame = 0;
					Double[] valueArray = iter.next();
					for(int m = 0; m < valueArray.length; m++){
						if(valueArray[m].doubleValue() == testDoubleArray[m].doubleValue()){
							countSame++;
						}
					}
					if(countSame == valueArray.length){
						valueStr = map.get(valueArray);
						dataVO.getPointList().get(temp).setPointName(valueStr);
						dataVO.getClusterList().get(i).getPointList().get(j).setPointName(valueStr);
						break;
					}
					temp++;
				}
				for(int m = 0; m < clusterNameArray.length; m++){
					if(clusterNameArray[m].equals(valueStr)){
						countArray[m]++;
					}
				}
			}
			for(int z = 0; z < countArray.length; z++){
				if(countArray[z] >= maxSize){
					maxSize = countArray[z];
					maxLoca = z;
				}
			}
			clusterNameReal[i] = clusterNameArray[maxLoca];
			correctArray[i] = maxSize;
		}
		System.out.println("###############################");
		for(int i = 0; i < correctArray.length; i++){
			sum += correctArray[i];
			System.out.println("簇" + clusterNameReal[i] + "共有" + dataVO.getClusterList().get(i).getPointList().size() + "項,其中正確項有" + correctArray[i] + "項;");
		}
		System.out.println("項的總數為:" + dataVO.getPointList().size() + "項");
		double result = sum * 1.0 / dataVO.getPointList().size() * 100;
		DecimalFormat df = new DecimalFormat("0.00");
		System.out.println("正確率為:" + df.format(result) + "%");
	}
	
}

(5)ImportData.java

package com.remoa.experiment4.common;

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;

import com.remoa.experiment4.common.util.ExcelUtil;

import jxl.Cell;
import jxl.Workbook;

/**
 * 獲得需要的資料的工具類
 * @author Remoa
 *
 */
public class ImportData {
	/**
	 * 匯入測試資料
	 * @return 返回封裝了測試資料的list
	 */
	public static List<Cell[]> importData(){
		Properties prop = null;
		try {
        	prop = new Properties();
        	InputStream is = new FileInputStream("DataLoadIn.properties");
			prop.load(is);
			is.close();
		} catch (Exception e) {
			System.out.println("未能讀取到Excel檔案,修改配置檔案路徑後重試!");
			e.printStackTrace();
		}  
		String absolutePath = prop.getProperty("absolutePath");
		int sheetLoca = Integer.valueOf(prop.getProperty("sheetLoca"));
		int initRowLoca = Integer.valueOf(prop.getProperty("initRowLoca"));
		Workbook workbook = ExcelUtil.readExcel(absolutePath);
		List<Cell[]> list = ExcelUtil.sheetEncapsulation(workbook, sheetLoca, initRowLoca);
		return list;
	}
	
	/**
	 * 得到簇的數目
	 * @return 返回簇的數目
	 */
	public static int getclusterNumber(){
		Properties prop = null;
		try {
        	prop = new Properties();
        	InputStream is = new FileInputStream("DataLoadIn.properties");
			prop.load(is);
			is.close();
		} catch (Exception e) {
			System.out.println("未能讀取到Excel檔案,修改配置檔案路徑後重試!");
			e.printStackTrace();
		}  
		int clusterNumber = Integer.valueOf(prop.getProperty("clusterNumber"));
		return clusterNumber;
	}
	
	/**
	 * 匯入正確的分類結果資料
	 * @return 返回封裝該結果資料的list
	 */
	public static List<Cell[]> importResultData(){
		Properties prop = null;
		try {
        	prop = new Properties();
        	InputStream is = new FileInputStream("ResultLoadIn.properties");
			prop.load(is);
			is.close();
		} catch (Exception e) {
			System.out.println("未能讀取到Excel檔案,修改配置檔案路徑後重試!");
			e.printStackTrace();
		}  
		String absolutePath = prop.getProperty("absolutePath");
		int sheetLoca = Integer.valueOf(prop.getProperty("sheetLoca"));
		int initRowLoca = Integer.valueOf(prop.getProperty("initRowLoca"));
		Workbook workbook = ExcelUtil.readExcel(absolutePath);
		List<Cell[]> list = ExcelUtil.sheetEncapsulation(workbook, sheetLoca, initRowLoca);
		return list;
	}
	
}
(6)ExcelUtil.java
package com.remoa.experiment4.common.util;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import jxl.Cell;
import jxl.Sheet;
import jxl.Workbook;
import jxl.read.biff.BiffException;

/**
 * Excel工具類需要匯入jxl的jar包,其常用方法總結如下:
 * (1)Workbook為Excel檔案,Cell為單元格,Sheet為工作表物件
 * (2)sheet.getCell(x,y):獲得第x行第y列的單元格
 * (3)workbook.getWorkbook(File):獲得檔案
 * (4)workbook.getSheet(0):獲得0號(第一個)工作表物件
 * (5)cell.getContents():獲得單元格的內容
 * (6)Cell[] cells = sheet.getColumn(column):獲得某一列的值
 * (7)Cell[] cells = sheet.getRow(row):獲得某一行的值
 * @author Remoa
 *
 */
public class ExcelUtil {
	/**
	 * 讀取Excel檔案
	 * @param filePath Excel檔案的絕對路徑
	 * @return 返回Workbook
	 */
	public static Workbook readExcel(String filePath){
		File file = null;
		Workbook workbook = null;
		file = new File(filePath);
		try {
			workbook = Workbook.getWorkbook(file);
		} catch (BiffException e) {
			System.out.println("輸入流讀入為空,java讀取Excel異常");
			e.printStackTrace();
		} catch (IOException e) {
			System.out.println("IO異常");
			e.printStackTrace();
		}
		return workbook;
	}
	
	/**
	 * 對Excel檔案工作表的內容進行封裝
	 * @param workbook Excel檔案
	 * @param sheetLoca 工作表位置
	 * @param initRowLoca 初始行,即非表頭行的記錄開始的行數
	 * @return 返回一個封裝了一行行資料的List
	 */
	public static List<Cell[]> sheetEncapsulation(Workbook workbook, int sheetLoca, int initRowLoca){
		Sheet sheet = workbook.getSheet(sheetLoca);
		List<Cell[]> list = new ArrayList<Cell[]>();
		Cell[] cells = null;
		int i = initRowLoca - 1, length = sheet.getRows() - initRowLoca + 1;
		while(length-- != 0){
			cells = sheet.getRow(i);
			list.add(cells);
			i++;
		}
		return list;
	}
	
	/**
	 * 當表頭存在多行時,獲得某一特定所需表頭行,將該表頭行資訊儲存為一個Cell陣列
	 * @param workbook Excel檔案
	 * @param sheetLoca 工作表位置
	 * @param wantLoca 想獲得的特定表頭行位置
	 * @return 該表頭行資訊Cell[]陣列
	 */
	public static Cell[] getHeadInfo(Workbook workbook, int sheetLoca, int wantLoca){
		if(wantLoca == -1){
			return null;
		}else{
			Sheet sheet = workbook.getSheet(sheetLoca);
			Cell[] cells = sheet.getRow(wantLoca - 1);
			return cells;
		}
	}
	
}
(7)PrintUtil.java
package com.remoa.experiment4.common.util;

import java.util.Iterator;
import java.util.List;

import com.remoa.experiment4.domain.ClusterVO;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;

/**
 * 列印工具類
 * @author Remoa
 *
 */
public class PrintUtil {
	/**
	 * 列印每個簇中的具體內容
	 * @param dataVO DataVO實體
	 */
	public static void printClusterContents(DataVO dataVO){
		List<ClusterVO> clusterList = dataVO.getClusterList();
		//遍歷每個簇
		for(int i = 0; i < clusterList.size(); i++){
			System.out.println("第" + (i+1) + "個簇共有" + clusterList.get(i).getPointList().size() + "項,內容如下:");
			ClusterVO cluster = clusterList.get(i);//得到該簇
			List<PointVO> pointList = cluster.getPointList();//簇內元素的list
			//遍歷簇內元素
			for(Iterator<PointVO> iter = pointList.iterator(); iter.hasNext(); ){
				PointVO pointVO = iter.next();
				Double[] valueArray = pointVO.getPoint();
				for(int j = 0; j < valueArray.length - 1; j++){
					System.out.print(valueArray[j] + ", ");
				}
				System.out.println(valueArray[valueArray.length - 1]);
			}
		}
	}
	
}
(8)DataVO.java
package com.remoa.experiment4.domain;

import java.util.List;

/**
 * DataVO實體類,封裝了簇的list以及資料點集的list
 * @author Remoa
 *
 */
public class DataVO {
	private List<ClusterVO> clusterList;//簇
	private List<PointVO> pointList;//資料點集
	
	public List<ClusterVO> getClusterList() {
		return clusterList;
	}
	
	public void setClusterList(List<ClusterVO> clusterList) {
		this.clusterList = clusterList;
	}
	
	public List<PointVO> getPointList() {
		return pointList;
	}
	
	public void setPointList(List<PointVO> pointList) {
		this.pointList = pointList;
	}
	
	@Override
	public String toString() {
		return "DataVO [clusterList=" + clusterList + ", pointList=" + pointList + "]";
	}
	
}
(9)ClusterVO.java
package com.remoa.experiment4.domain;

import java.util.Arrays;
import java.util.List;

/**
 * 簇實體,封裝了簇心和該簇中的簇內元素集
 * @author Remoa
 *
 */
public class ClusterVO{
	private String clusterid;
	private Double[] clusterCenter;//簇心
	private List<PointVO> pointList;//簇內元素
	
	public String getClusterid() {
		return clusterid;
	}

	public void setClusterid(String clusterid) {
		this.clusterid = clusterid;
	}

	public Double[] getClusterCenter() {
		return clusterCenter;
	}

	public void setClusterCenter(Double[] clusterCenter) {
		this.clusterCenter = clusterCenter;
	}

	public List<PointVO> getPointList() {
		return pointList;
	}

	public void setPointList(List<PointVO> pointList) {
		this.pointList = pointList;
	}

	@Override
	public String toString() {
		return "ClusterVO [clusterid=" + clusterid + ", clusterCenter=" + Arrays.toString(clusterCenter)
				+ ", pointList=" + pointList + "]";
	}

}
(10)PointVO.java
package com.remoa.experiment4.domain;

import java.util.Arrays;
/**
 * 資料點實體,封裝了具體的Cell中每行的資料點的具體的double值,以及資料點所在的簇和該簇的簇名
 * @author Remoa
 *
 */
public class PointVO {
	private Double[] point;//資料點
	private String clusterid;//資料點所在的簇
	private String pointName;//給資料點所對應的簇名
	
	public Double[] getPoint() {
		return point;
	}
	
	public void setPoint(Double[] point) {
		this.point = point;
	}
	
	public String getClusterid() {
		return clusterid;
	}
	
	public void setClusterid(String clusterid) {
		this.clusterid = clusterid;
	}
	
	public String getPointName() {
		return pointName;
	}
	
	public void setPointName(String pointName) {
		this.pointName = pointName;
	}
	
	@Override
	public String toString() {
		return "PointVO [point=" + Arrays.toString(point) + ", clusterid=" + clusterid + ", pointName=" + pointName
				+ "]";
	}

}
(11)ChooseFactory.java
package com.remoa.experiment4.common.factory;

import com.remoa.experiment4.common.strategy.Clustering;
import com.remoa.experiment4.common.strategy.HierarchicalClustering;
import com.remoa.experiment4.common.strategy.KMeansClustering;

/**
 * 決策工廠類,用於使用者決策選擇
 * @author Remoa
 *
 */
public class ChooserFactory {
	/**
	 * 執行聚類演算法
	 * @param algorithmName
	 */
	public void runAlgorithm(String algorithmName){
		long startTime = System.currentTimeMillis();
		Clustering clustering = null;
		algorithmName = algorithmName.toLowerCase();
		switch(algorithmName){
			case "kmeans":
				clustering = new KMeansClustering();
				break;
			case "hierarchical clustering":
				clustering = new HierarchicalClustering();
				break;
		}
		clustering.clusterAlgorithm();
		long endTime = System.currentTimeMillis();
		System.out.println(algorithmName + "演算法共耗時:" + (endTime - startTime)*1.0 / 1000 + "s");
		System.out.println("------------------------------------------------");
	}
	
}
(12)Clustering.java
package com.remoa.experiment4.common.strategy;
/**
 * 策略介面,封裝聚類分析的不同演算法。
 * @author Remoa
 *
 */
public interface Clustering {
	/**
	 * 策略抽象介面
	 */
	public void clusterAlgorithm();
	
}
(13)KMeansClustering.java
package com.remoa.experiment4.common.strategy;

import java.util.ArrayList;
import java.util.List;

import com.remoa.experiment4.common.ImportData;
import com.remoa.experiment4.common.util.PrintUtil;
import com.remoa.experiment4.domain.DataVO;
import com.remoa.experiment4.domain.PointVO;
import com.remoa.experiment4.service.CorrectRate;
import com.remoa.experiment4.service.KMeans;

import jxl.Cell;

/**
 * 使用K-Means演算法
 * @author Remoa
 *
 */
public class KMeansClustering implements Clustering{
	/**
	 * 呼叫K-Means演算法
	 */
	@Override
	public void clusterAlgorithm() {
		List<Cell[]> cellList = ImportData.importData();
        int clusterNumber = ImportData.getclusterNumber();
        DataVO dataVO = KMeans.initDataVO(cellList, clusterNumber);
        List<PointVO> pointList = dataVO.getPointList();
        int count = clusterNumber;
        List<Double[]> centerValueList = new ArrayList<Double[]>();
        for(int i = 0; i < clusterNumber; i++){
        	Double[] center = dataVO.getClusterList().get(i).getClusterCenter();
        	centerValueList.add(center);
        }
        while(count != 0){
        	count = 0;
        	for(int i = 0; i < pointList.size(); i++){
	        	dataVO = KMeans.distributeIntoCluster(dataVO, dataVO.getPointList().get(i));
	        }
	        dataVO = KMeans.countClusterCenter(dataVO);
        	List<Double[]> newCenterValueList = new ArrayList<Double[]>();
        	for(int i = 0; i < clusterNumber; i++){
            	Double[] center = dataVO.getClusterList().get(i).getClusterCenter();
            	newCenterValueList.add(center);
            }
        	for(int i = 0; i < clusterNumber; i++){
        		Double[] oldCenter = centerValueList.get(i);
        		Double[] newCenter = newCenterValueList.get(i);
        		for(int j = 0; j < oldCenter.length; j++){
        			//控制誤差的精確度範圍在0.01%
        			if(Math.abs(oldCenter[j] - newCenter[j]) >= 0.0001){
        				count++;
        				break;
        			}
        		}
        	}
        	for(int i = 0; i < clusterNumber; i++){
        		centerValueList.remove(0);
        	}
        	centerValueList.addAll(newCenterValueList);
        }
        PrintUtil.printClusterContents(dataVO);
        CorrectRate.getCorrectRate(dataVO);
	}

}
(14)HierarchicalClustering.java