文字聚類——Kmeans

阿新 • • 發佈：2019-01-08

上兩篇文章分別用樸素貝葉斯演算法和KNN演算法對newgroup文字進行了分類測試，本文使用Kmeans演算法對文字進行聚類。

1、文字預處理

文字預處理在前面兩本文章中已經介紹，此處（略）。

2、文字向量化

package com.datamine.kmeans;

import java.io.*;
import java.util.*;
import java.util.Map.Entry;

/**
 * 計算文件的屬性向量，將所有文件向量化
 * @author Administrator
 */
public class ComputeWordsVector {

	/**
	 * 計算文件的TF-IDF屬性向量，返回Map<檔名，<特徵詞，TF-IDF值>>
	 * @param testSampleDir 處理好的聚類樣本測試樣例集
	 * @return 所有測試樣例的屬性向量構成的map
	 * @throws IOException
	 */
	public Map<String,Map<String,Double>> computeTFMultiIDF(String testSampleDir) throws IOException{
		
		String word;
		Map<String,Map<String,Double>> allTestSampleMap = new TreeMap<String, Map<String,Double>>();
		Map<String,Double> idfPerWordMap = computeIDF(testSampleDir);
		Map<String,Double> tfPerDocMap = new TreeMap<String, Double>();
		
		File[] samples = new File(testSampleDir).listFiles();
		System.out.println("the total number of test files is " + samples.length);
		for(int i = 0;i<samples.length;i++){
			
			tfPerDocMap.clear();
			FileReader samReader = new FileReader(samples[i]);
			BufferedReader samBR = new BufferedReader(samReader);
			Double wordSumPerDoc = 0.0; //計算每篇文件的總詞數
			while((word = samBR.readLine()) != null){
				if(!word.isEmpty()){
					wordSumPerDoc++;
					if(tfPerDocMap.containsKey(word))
						tfPerDocMap.put(word, tfPerDocMap.get(word)+1.0);
					else
						tfPerDocMap.put(word, 1.0);
				}
			}
			
			Double maxCount = 0.0,wordWeight; //記錄出現次數最多的詞的次數，用作歸一化  ？？？
			Set<Map.Entry<String, Double>> tempTF = tfPerDocMap.entrySet();
			for(Iterator<Map.Entry<String, Double>> mt = tempTF.iterator();mt.hasNext();){
				Map.Entry<String, Double> me = mt.next();
				if(me.getValue() > maxCount)
					maxCount = me.getValue();
			}
			
			for(Iterator<Map.Entry<String, Double>> mt = tempTF.iterator();mt.hasNext();){
				Map.Entry<String, Double> me = mt.next();
				Double IDF = Math.log(samples.length / idfPerWordMap.get(me.getKey()));
				wordWeight = (me.getValue() / wordSumPerDoc) * IDF;
				tfPerDocMap.put(me.getKey(), wordWeight);
			}
			TreeMap<String,Double> tempMap = new TreeMap<String, Double>();
			tempMap.putAll(tfPerDocMap);
			allTestSampleMap.put(samples[i].getName(), tempMap);
		}
		printTestSampleMap(allTestSampleMap);
		return allTestSampleMap;
	}
	
	/**
	 * 輸出測試樣例map內容，用於測試
	 * @param allTestSampleMap
	 * @throws IOException 
	 */
	private void printTestSampleMap(
			Map<String, Map<String, Double>> allTestSampleMap) throws IOException {
		// TODO Auto-generated method stub
		File outPutFile = new File("E:/DataMiningSample/KmeansClusterResult/allTestSampleMap.txt");
		FileWriter outPutFileWriter = new FileWriter(outPutFile);
		Set<Map.Entry<String, Map<String,Double>>> allWords = allTestSampleMap.entrySet();
		
		for(Iterator<Entry<String, Map<String, Double>>> it = allWords.iterator();it.hasNext();){
			
			Map.Entry<String, Map<String,Double>> me = it.next();
			outPutFileWriter.append(me.getKey()+" ");
			
			Set<Map.Entry<String, Double>> vectorSet = me.getValue().entrySet();
			for(Iterator<Map.Entry<String, Double>> vt = vectorSet.iterator();vt.hasNext();){
				Map.Entry<String, Double> vme = vt.next();
				outPutFileWriter.append(vme.getKey()+" "+vme.getValue()+" ");
			}
			outPutFileWriter.append("\n");
			outPutFileWriter.flush();
		}
		outPutFileWriter.close();
		
	}

	/**
	 * 統計每個詞的總出現次數，返回出現次數大於n次的詞彙構成最終的屬性詞典
	 * @param strDir 處理好的newsgroup檔案目錄的絕對路徑
	 * @param wordMap 記錄出現的每個詞構成的屬性詞典
	 * @return newWordMap 返回出現次數大於n次的詞彙構成最終的屬性詞典
	 * @throws IOException
	 */
	public SortedMap<String, Double> countWords(String strDir,
			Map<String, Double> wordMap) throws IOException {
		
		File sampleFile = new File(strDir);
		File[] sample = sampleFile.listFiles();
		String word;
		
		for(int i =0 ;i < sample.length;i++){
			
			if(!sample[i].isDirectory()){
				FileReader samReader = new FileReader(sample[i]);
				BufferedReader samBR = new BufferedReader(samReader);
				while((word = samBR.readLine()) != null){
					if(!word.isEmpty() && wordMap.containsKey(word))
						wordMap.put(word, wordMap.get(word)+1);
					else
						wordMap.put(word, 1.0);
				}
				samBR.close();
			}else{
				countWords(sample[i].getCanonicalPath(),wordMap);
			}
		}
		
		/*
		 * 去除停頓詞後，先用DF演算法選取特徵詞，後面再加入特徵詞的選取演算法
		 */
		SortedMap<String,Double> newWordMap = new TreeMap<String, Double>();
		Set<Map.Entry<String, Double>> allWords = wordMap.entrySet();
		for(Iterator<Map.Entry<String, Double>> it = allWords.iterator();it.hasNext();){
			Map.Entry<String, Double> me = it.next();
			if(me.getValue() > 100) //DF演算法降維
				newWordMap.put(me.getKey(), me.getValue());
		}
		
		return newWordMap;
	}
	
	/**
	 * 計算IDF，即屬性詞典中每個詞在多少個文件中出現過
	 * @param testSampleDir 聚類演算法測試樣本所在的目錄
	 * @return 單詞IDFmap <單詞，包含該單詞的文件數>
	 * @throws IOException
	 */
	public Map<String,Double> computeIDF(String testSampleDir) throws IOException{
		
		Map<String,Double> IDFPerWordMap = new TreeMap<String, Double>();
		//記下當前已經遇到過的該文件中的詞
		Set<String> alreadyCountWord = new HashSet<String>();
		String word;
		File[] samples = new File(testSampleDir).listFiles();
		for(int i = 0;i<samples.length;i++){
			
			alreadyCountWord.clear();
			FileReader tsReader = new FileReader(samples[i]);
			BufferedReader tsBR = new BufferedReader(tsReader);
			while((word = tsBR.readLine()) != null){
				
				if(!alreadyCountWord.contains(word)){
					if(IDFPerWordMap.containsKey(word))
						IDFPerWordMap.put(word, IDFPerWordMap.get(word)+1.0);
					else
						IDFPerWordMap.put(word, 1.0);
					alreadyCountWord.add(word);
				}
			}
		}
		return IDFPerWordMap;
	}

	/**
	 * 建立聚類演算法的測試樣例集，主要是過濾出只含有特徵詞的文件寫到一個目錄下
	 * @param srcDir 源目錄，已經預處理但是還沒有過濾非特徵詞的文件目錄
	 * @param desDir 目的目錄，聚類演算法的測試樣例目錄
	 * @return 建立測試樣例集中特徵詞陣列
	 * @throws IOException 
	 */
	public String[] createTestSamples(String srcDir, String desDir) throws IOException {
		
		SortedMap<String,Double> wordMap = new TreeMap<String, Double>();
		wordMap = countWords(srcDir,wordMap);
		System.out.println("special words map sizes:" + wordMap.size());
		String word,testSampleFile;
		
		File[] sampleDir = new File(srcDir).listFiles();
		for(int i =0;i<sampleDir.length;i++){
			
			File[] sample = sampleDir[i].listFiles();
			for(int j =0;j<sample.length;j++){
				
				testSampleFile = desDir + sampleDir[i].getName()+"_"+sample[j].getName();
				FileReader samReader = new FileReader(sample[j]);
				BufferedReader samBR = new BufferedReader(samReader);
				FileWriter tsWriter = new FileWriter(new File(testSampleFile));
				while((word = samBR.readLine()) != null){
					if(wordMap.containsKey(word))
						tsWriter.append(word + "\n");
				}
				tsWriter.flush();
				tsWriter.close();
			}
		}
	
		//返回屬性詞典
		String[] terms = new String[wordMap.size()];
		int i = 0;
		Set<Map.Entry<String, Double>> allWords = wordMap.entrySet();
		for(Iterator<Map.Entry<String, Double>> it = allWords.iterator();it.hasNext();){
			Map.Entry<String, Double> me = it.next();
			terms[i] = me.getKey();
			i++;
		}
		
		return terms;
		
	}
	
	
	

	
	
}

3、Kmeans演算法

Kmeans演算法是非常經典的聚類演算法，演算法主要步驟如下：先選K個（或者隨機選擇）初始聚類點作為初始中心點，然後就算其他所有點到K個聚類中心點的距離，將點分到最近的聚類中。聚類完後，再次計算各個類中的中心點，中心點發生變化，於是更新中心點，然後再計算其他點到中心點的距離重新聚類，中心點又發生變化，如此迭代下去。

初始點選取策略：隨機選，均勻抽樣，最大最小法等....

距離的度量方法：1-餘弦相似度，2-向量內積

演算法停止條件：計算準則函式及設定最大迭代次數

空聚類的處理：注意空聚類導致的程式bug

package com.datamine.kmeans;

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.*;

/**
 * kmeans聚類演算法的實現類，將newsgroup文件集聚成10類、20類、30類
 * 演算法結束條件：當每個點最近的聚類中心點就是它所屬的聚類中心點時，演算法結束
 * @author Administrator
 *
 */
public class KmeansCluster {

	/**
	 * kmeans演算法主過程
	 * @param allTestSampleMap 聚類演算法測試樣本map(已經向量化) <檔名，<特徵詞，TF-IDF值>>
	 * @param k 聚類的數量
	 * @return 聚類結果 <檔名，聚類完成後所屬的類別號>
	 */
	private Map<String, Integer> doProcess(
			Map<String, Map<String, Double>> allTestSampleMap, int k) {
		
		//0、首先獲取allTestSampleMap所有檔名順序組成的陣列
		String[] testSampleNames = new String[allTestSampleMap.size()];
		int count =0,tsLength = allTestSampleMap.size();
		Set<Map.Entry<String, Map<String,Double>>> allTestSampleMapSet = allTestSampleMap.entrySet();
		for(Iterator<Map.Entry<String, Map<String,Double>>> it = allTestSampleMapSet.iterator();it.hasNext();){
			Map.Entry<String, Map<String,Double>> me = it.next();
			testSampleNames[count++] = me.getKey();
		}
		
		//1、初始點的選擇演算法是隨機選擇或者是均勻分開選擇，這裡採用後者
		Map<Integer,Map<String,Double>> meansMap = getInitPoint(allTestSampleMap,k);
		double [][] distance = new double[tsLength][k]; //distance[i][k]記錄點i到聚類中心k的距離
		
		//2、初始化k個聚類
		int[] assignMeans = new int[tsLength]; //記錄所有點屬於的聚類序號，初始化全部為0
		Map<Integer,Vector<Integer>> clusterMember = new TreeMap<Integer, Vector<Integer>>();//記錄每個聚類的成員點序號
		Vector<Integer> mem = new Vector<Integer>();
		int iterNum = 0; //迭代次數
		
		while(true){
			System.out.println("Iteration No." + (iterNum++) + "-------------------------");
			//3、計算每個點和每個聚類中心的距離
			for(int i = 0;i < tsLength;i++){
				for(int j = 0;j<k;j++)
					distance[i][j] = getDistance(allTestSampleMap.get(testSampleNames[i]),meansMap.get(j));
			}
			
			//4、找出每個點最近的聚類中心
			int [] nearestMeans = new int[tsLength];
			for(int i = 0;i < tsLength;i++){
				nearestMeans[i] = findNearestMeans(distance,i);
			}
			
			//5、判斷當前所有點屬於的聚類序號是否已經全部是其離的最近的聚類，如果是或者達到最大的迭代次數，那麼結束演算法
			int okCount = 0;
			for(int i= 0;i<tsLength;i++){
				if(nearestMeans[i] == assignMeans[i])
					okCount ++;
			}
			System.out.println("okCount = " + okCount);
			if(okCount == tsLength || iterNum >= 10)
				break;
			
			//6、如果前面條件不滿足，那麼需要重新聚類再次進行一次迭代，需要修改每個聚類的成員和每個點屬於的聚類資訊
			clusterMember.clear();
			for(int i = 0;i < tsLength;i++){
				assignMeans[i] = nearestMeans[i];
				if(clusterMember.containsKey(nearestMeans[i])){
					clusterMember.get(nearestMeans[i]).add(i);
				}
				else{
					mem.clear();
					mem.add(i);
					Vector<Integer> tempMem = new Vector<Integer>();
					tempMem.addAll(mem);
					clusterMember.put(nearestMeans[i], tempMem);
				}
			}
			
			//7、重新計算每個聚類的中心點
			for(int i = 0;i<k;i++){
				
				if(!clusterMember.containsKey(i)) //注意kmeans可能產生空聚類
					continue;
				
				Map<String,Double> newMean = computeNewMean(clusterMember.get(i),allTestSampleMap,testSampleNames);
				Map<String,Double> tempMean = new TreeMap<String,Double>();
				tempMean.putAll(newMean);
				meansMap.put(i, tempMean);
			}
		
		}
		
		//8、形成聚類結果並且返回
 		Map<String,Integer> resMap = new TreeMap<String,Integer>();
		for(int i = 0;i<tsLength;i++){
			resMap.put(testSampleNames[i], assignMeans[i]);
		}
		
		return resMap;
	}
	
	/**
	 * 計算當前聚類的新中心，採用向量平均
	 * @param clusterM 該點到所有聚類中心的距離
	 * @param allTestSampleMap 所有測試樣例 <檔名，向量>
	 * @param testSampleNames 所有測試樣例名構成的陣列
	 * @return 新的聚類中心向量
	 */
	private Map<String, Double> computeNewMean(Vector<Integer> clusterM,
			Map<String, Map<String, Double>> allTestSampleMap,
			String[] testSampleNames) {
		
		double memberNum = (double)clusterM.size();
		Map<String,Double> newMeanMap = new TreeMap<String,Double>();
		Map<String,Double> currentMemMap = new TreeMap<String, Double>();
		
		for(Iterator<Integer> it = clusterM.iterator();it.hasNext();){
			int me = it.next();
			currentMemMap = allTestSampleMap.get(testSampleNames[me]);
			Set<Map.Entry<String, Double>> currentMemMapSet = currentMemMap.entrySet();
			for(Iterator<Map.Entry<String, Double>> jt = currentMemMapSet.iterator();jt.hasNext();){
				Map.Entry<String, Double> ne = jt.next();
				if(newMeanMap.containsKey(ne.getKey()))
					newMeanMap.put(ne.getKey(), newMeanMap.get(ne.getKey())+ne.getValue());
				else
					newMeanMap.put(ne.getKey(), ne.getValue());
			}
		}
		
		Set<Map.Entry<String, Double>> newMeanMapSet = newMeanMap.entrySet();
		for(Iterator<Map.Entry<String, Double>> it = newMeanMapSet.iterator();it.hasNext();){
			Map.Entry<String, Double> me = it.next();
			newMeanMap.put(me.getKey(), newMeanMap.get(me.getKey()) / memberNum);
		}
		
		return newMeanMap;
	}

	/**
	 * 找出距離當前點最近的聚類中心
	 * @param distance 點到所有聚類中心的距離
	 * @param m 點（文字號）
	 * @return 最近聚類中心的序號j
	 */
	private int findNearestMeans(double[][] distance, int m) {
		
		double minDist = 10;
		int j = 0;
		for(int i = 0;i<distance[m].length;i++){
			if(distance[m][i] < minDist){
				minDist = distance[m][i];
				j = i;
			}
		}
		return j;
	}

	/**
	 * 計算兩個點的距離
	 * @param map1 點1的向量map
	 * @param map2 點2的向量map
	 * @return 兩個點的歐式距離
	 */
	private double getDistance(Map<String, Double> map1, Map<String, Double> map2) {

		return 1 - computeSim(map1,map2);
	}

	/**計算兩個文字的相似度
	 * @param testWordTFMap 文字1的<單詞,詞頻>向量
	 * @param trainWordTFMap 文字2<單詞,詞頻>向量
	 * @return Double 向量之間的相似度 以向量夾角餘弦計算（加上註釋部分程式碼即可）或者向量內積計算（不加註釋部分，效果相當而速度更快）
	 * @throws IOException 
	 */
	private double computeSim(Map<String, Double> testWordTFMap,
			Map<String, Double> trainWordTFMap) {
		// TODO Auto-generated method stub
		double mul = 0;//, testAbs = 0, trainAbs = 0;
		Set<Map.Entry<String, Double>> testWordTFMapSet = testWordTFMap.entrySet();
		for(Iterator<Map.Entry<String, Double>> it = testWordTFMapSet.iterator(); it.hasNext();){
			Map.Entry<String, Double> me = it.next();
			if(trainWordTFMap.containsKey(me.getKey())){
				mul += me.getValue()*trainWordTFMap.get(me.getKey());
			}
			//testAbs += me.getValue() * me.getValue();
		}
		//testAbs = Math.sqrt(testAbs);
		
		/*Set<Map.Entry<String, Double>> trainWordTFMapSet = trainWordTFMap.entrySet();
		for(Iterator<Map.Entry<String, Double>> it = trainWordTFMapSet.iterator(); it.hasNext();){
			Map.Entry<String, Double> me = it.next();
			trainAbs += me.getValue()*me.getValue();
		}
		trainAbs = Math.sqrt(trainAbs);*/
		return mul ;/// (testAbs * trainAbs);
	}

	/**
	 * 獲取kmeans演算法迭代的初始點
	 * @param allTestSampleMap <檔名，<特徵詞，TF-IDF值>>
	 * @param k 聚類的數量
	 * @return  meansMap k個聚類的中心點向量
	 */
	private Map<Integer, Map<String, Double>> getInitPoint(
			Map<String, Map<String, Double>> allTestSampleMap, int k) {
		
		int count = 0, i = 0;
		//儲存k個聚類的中心向量
		Map<Integer,Map<String,Double>> meansMap = new TreeMap<Integer, Map<String,Double>>();
		System.out.println("本次聚類的初始點對應的檔案為：");
		Set<Map.Entry<String, Map<String,Double>>> allTestSampleMapSet = allTestSampleMap.entrySet();
		for(Iterator<Map.Entry<String, Map<String,Double>>> it = allTestSampleMapSet.iterator();it.hasNext();){
			Map.Entry<String, Map<String,Double>> me = it.next();
			if(count == i*allTestSampleMapSet.size() / k){
				meansMap.put(i, me.getValue());
				System.out.println(me.getKey());
				i++;
			}
			count++ ;
		}
		
		return meansMap;
	}

	/**
	 * 輸出聚類結果到檔案中
	 * @param kmeansClusterResult 聚類結果
	 * @param kmeansClusterResultFile 輸出聚類結果到檔案中
	 * @throws IOException 
	 */
	private void printClusterResult(Map<String, Integer> kmeansClusterResult,
			String kmeansClusterResultFile) throws IOException {

		FileWriter resultWriter = new FileWriter(kmeansClusterResultFile);
		Set<Map.Entry<String, Integer>> kmeansClusterResultSet = kmeansClusterResult.entrySet();
		for(Iterator<Map.Entry<String, Integer>> it = kmeansClusterResultSet.iterator();it.hasNext();){
			Map.Entry<String, Integer> me = it.next();
			resultWriter.append(me.getKey()+" "+me.getValue()+"\n");
		}
		resultWriter.flush();
		resultWriter.close();
	}
	
	/**
	 * 評估函式根據聚類結果檔案統計熵 和 混淆矩陣
	 * @param kmeansClusterResultFile 聚類結果檔案
	 * @param k 聚類數目
	 * @return 聚類結果的熵值
	 * @throws IOException 
	 */
	private double evaluateClusterResult(String kmeansClusterResultFile, int k) throws IOException {

		Map<String,String> rightCate = new TreeMap<String, String>();
		Map<String,String> resultCate = new TreeMap<String, String>();
		FileReader crReader = new FileReader(kmeansClusterResultFile);
		BufferedReader crBR  = new BufferedReader(crReader);
		String[] s;
		String line;
		while((line = crBR.readLine()) != null){
			s = line.split(" ");
			resultCate.put(s[0], s[1]);
			rightCate.put(s[0], s[0].split("_")[0]);
		}
		crBR.close();
		return computeEntropyAndConfuMatrix(rightCate,resultCate,k);//返回熵
	}
	
	/**
	 * 計算混淆矩陣並輸出，返回熵
	 * @param rightCate 正確的類目對應map
	 * @param resultCate 聚類結果對應map
	 * @param k 聚類的數目
	 * @return 返回聚類熵
	 */
	private double computeEntropyAndConfuMatrix(Map<String, String> rightCate,
			Map<String, String> resultCate, int k) {
		
		//k行20列，[i,j]表示聚類i中屬於類目j的檔案數
		int[][] confusionMatrix = new int[k][20];
		
		//首先求出類目對應的陣列索引
		SortedSet<String> cateNames = new TreeSet<String>();
		Set<Map.Entry<String, String>> rightCateSet = rightCate.entrySet();
		for(Iterator<Map.Entry<String, String>> it = rightCateSet.iterator();it.hasNext();){
			Map.Entry<String, String> me = it.next();
			cateNames.add(me.getValue());
		}
		
		String[] cateNamesArray = cateNames.toArray(new String[0]);
		Map<String,Integer> cateNamesToIndex = new TreeMap<String, Integer>();
		for(int i =0;i < cateNamesArray.length ;i++){
			cateNamesToIndex.put(cateNamesArray[i], i);
		}
		
		for(Iterator<Map.Entry<String, String>> it = rightCateSet.iterator();it.hasNext();){
			Map.Entry<String, String> me = it.next();
			confusionMatrix[Integer.parseInt(resultCate.get(me.getKey()))][cateNamesToIndex.get(me.getValue())]++;
		}
		
		//輸出混淆矩陣
		double [] clusterSum = new double[k]; //記錄每個聚類的檔案數
		double [] everyClusterEntropy = new double[k]; //記錄每個聚類的熵
		double clusterEntropy = 0;
		
		System.out.print("      ");
		
		for(int i=0;i<20;i++){
			System.out.printf("%-6d",i);
		}
		
		System.out.println();
		
		for(int i =0;i<k;i++){
			System.out.printf("%-6d",i);
			for(int j = 0;j<20;j++){
				clusterSum[i] += confusionMatrix[i][j];
				System.out.printf("%-6d",confusionMatrix[i][j]);
			}
			System.out.println();
		}
		System.out.println();
		
		//計算熵值
		for(int i = 0;i<k;i++){
			if(clusterSum[i] != 0){
				for(int j = 0;j< 20 ;j++){
					double p = (double)confusionMatrix[i][j]/clusterSum[i];
					if(p!=0)
						everyClusterEntropy[i] += -p * Math.log(p); 
				}
				clusterEntropy += clusterSum[i]/(double)rightCate.size() * everyClusterEntropy[i];  
			}
		}
		return clusterEntropy;
	}

	public void KmeansClusterMain(String testSampleDir) throws IOException {
		
		//首先計算文件TF-IDF向量，儲存為Map<String,Map<String,Double>> 即為Map<檔名,Map<特徵詞，TF-IDF值>>
		ComputeWordsVector computV = new ComputeWordsVector();
		
		//int k[] = {10,20,30}; 三組分類
		int k[] = {20};
		
		Map<String,Map<String,Double>> allTestSampleMap = computV.computeTFMultiIDF(testSampleDir);
		
		for(int i =0;i<k.length;i++){
			System.out.println("開始聚類，聚成"+k[i]+"類");
			String KmeansClusterResultFile = "E:\\DataMiningSample\\KmeansClusterResult\\";
			Map<String,Integer> KmeansClusterResult = new TreeMap<String, Integer>();
			KmeansClusterResult = doProcess(allTestSampleMap,k[i]);
			KmeansClusterResultFile += k[i];
			printClusterResult(KmeansClusterResult,KmeansClusterResultFile);
			System.out.println("The Entropy for this Cluster is " + evaluateClusterResult(KmeansClusterResultFile,k[i]));
		}
		
	}
	
	
	public static void main(String[] args) throws IOException {
		
		KmeansCluster test = new KmeansCluster();
		
		String KmeansClusterResultFile = "E:\\DataMiningSample\\KmeansClusterResult\\20";
		System.out.println("The Entropy for this Cluster is " + test.evaluateClusterResult(KmeansClusterResultFile,20));
	}


	
}

4、程式入口

package com.datamine.kmeans;

import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;

public class ClusterMain {

	/**
	 * Kmeans 聚類主程式入口
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		
		//資料預處理 在分類演算法中已經實現 這裡（略）
		
		ComputeWordsVector computeV = new ComputeWordsVector();
		
		KmeansCluster kmeansCluster = new KmeansCluster();
		
		String srcDir = "E:\\DataMiningSample\\processedSample\\";
		String desDir = "E:\\DataMiningSample\\clusterTestSample\\";
		
		SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
		String beginTime = sdf.format(new Date());
		System.out.println("程式開始執行時間："+beginTime);
		
		String[] terms = computeV.createTestSamples(srcDir,desDir);
		kmeansCluster.KmeansClusterMain(desDir);
		
		String endTime = sdf.format(new Date());
		System.out.println("程式結束執行時間："+endTime);
		
	}
	
	
}

5、聚類結果

程式開始執行時間：2016-03-14 17:02:38
special words map sizes:3832
the total number of test files is 18828
開始聚類，聚成20類
本次聚類的初始點對應的檔案為：
alt.atheism_49960
comp.graphics_38307
comp.os.ms-windows.misc_10112
comp.sys.ibm.pc.hardware_58990
comp.sys.mac.hardware_50449
comp.windows.x_66402
comp.windows.x_68299
misc.forsale_76828
rec.autos_103685
rec.motorcycles_105046
rec.sport.baseball_104941
rec.sport.hockey_54126
sci.crypt_15819
sci.electronics_54016
sci.med_59222
sci.space_61185
soc.religion.christian_20966
talk.politics.guns_54517
talk.politics.mideast_76331
talk.politics.misc_178699
Iteration No.0-------------------------
okCount = 512
Iteration No.1-------------------------
okCount = 10372
Iteration No.2-------------------------
okCount = 15295
Iteration No.3-------------------------
okCount = 17033
Iteration No.4-------------------------
okCount = 17643
Iteration No.5-------------------------
okCount = 18052
Iteration No.6-------------------------
okCount = 18282
Iteration No.7-------------------------
okCount = 18404
Iteration No.8-------------------------
okCount = 18500
Iteration No.9-------------------------
okCount = 18627
      0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15    16    17    18    19    
0     482   0     3     3     1     1     0     5     2     1     0     0     2     27    11    53    4     6     15    176   
1     4     601   69    8     14    127   7     5     5     8     0     14    31    16    34    2     2     2     1     5     
2     1     64    661   96    18    257   26    9     3     0     0     13    25    13    6     2     3     2     6     2     
3     0     56    78    575   213   15    119   15    6     2     1     4     131   2     4     2     6     0     2     1     
4     1     25    13    151   563   11    50    3     3     1     2     14    125   4     8     1     0     3     0     0     
5     2     28    78    25    37    348   13    2     0     0     2     5     38    5     6     2     1     1     2     8     
6     20    80    24    21    23    166   38    45    45    26    10    37    87    34    27    22    15    8     35    12    
7     4     20    6     24    45    6     629   28    20    14    0     3     87    10    4     1     8     0     13    0     
8     0     2     1     10    8     4     25    781   40    1     1     0     70    5     10    2     8     4     2     3     
9     4     2     11    0     1     1     11    34    831   1     0     1     7     7     0     1     1     1     8     0     
10    10    7     6     2     4     1     7     7     4     633   4     5     11    18    9     5     13    8     10    3     
11    1     0     1     9     4     1     20    1     3     286   961   0     17    8     4     2     2     0     5     3     
12    3     14    0     6     1     2     2     0     1     1     0     858   51    1     1     2     16    8     69    4     
13    3     15    4     7     7     17    5     12    8     5     2     5     46    13    793   6     5     2     30    5     
14    2     4     0     1     0     2     4     6     3     4     4     2     14    746   3     1     2     3     55    11    
15    30    43    29    39    15    18    12    13    7     3     4     13    195   38    36    5     6     18    5     11    
16    195   1     0     2     0     1     1     0     4     1     4     1     4     16    6     846   3     6     16    274   
17    8     2     0     2     4     2     1     5     7     0     0     10    30    12    5     28    363   9     289   23    
18    19    1     0     0     2     0     0     6     0     1     1     3     1     3     2     9     8     843   48    18    
19    10    8     1     1     1     0     2     13    2     6     3     3     9     12    18    5     444   16    164   69    

The Entropy for this Cluster is 1.2444339205006887
程式結束執行時間：2016-03-14 17:08:24

文字聚類——Kmeans

上兩篇文章分別用樸素貝葉斯演算法和KNN演算法對newgroup文字進行了分類測試，本文使用Kmeans演算法對文字進行聚類。 1、文字預處理文字預處理在前面兩本文章中已經介紹，此處（略）。 2、文字向量化 package com.datamine.kmeans;

使用scikit-learn進行KMeans文字聚類

轉載自部落格：https://blog.razrlele.com/p/1614 K-Means 演算法簡介中文名字叫做K-均值演算法，演算法的目的是將n個向量分別歸屬到K箇中心點裡面去。演算法首先會隨機選擇K箇中心向量，然後通過迭代計算以及重新選擇K箇中心向量，使得n個向量各自被分配到距離

sklearn之kmeans文字聚類主題輸出

from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountV

Python基於Kmeans演算法實現文字聚類的簡單練習

接觸機器學習時間不長，也一直有興趣研究這方面的演算法。最近在學習Kmeans演算法，但由於工作的原因無法接觸到相關的專案實戰。為了理清思路、熟悉程式碼，在參照了幾篇機器學習大神的博文後，做了一個簡單的Kmeans演算法的簡單練習。作為一枚機器學習的門外漢，對於文中的一些錯誤和

[python] Kmeans文字聚類演算法+PAC降維+Matplotlib顯示聚類影象

0 前言本文主要講述以下幾點： 1.通過scikit-learn計算文字內容的tfidf並構造N*M矩陣(N個文件 M個特徵詞)； 2.呼叫scikit-learn中的K-means進行文字聚類； 3.使用PAC進行降維處理，每

python實現Kmeans文字聚類，通過PCA降維和Matplotlib顯示聚類3d三維影象

首先感謝Eastmount寫的內容http://blog.csdn.net/Eastmount/article/details/50545937。點選開啟連結在此基礎上，主要實現以下改進及結果 1.替換使用sklearn.feature_extraction.text.T

第十篇：K均值聚類(KMeans)

步驟中國小結 logo kmeans 實現調整 r語言 img 前言本文講解如何使用R語言進行 KMeans 均值聚類分析，並以一個關於人口出生率死亡率的實例演示具體分析步驟。聚類分析總體流程 1. 載入並了解數據集；2. 調用聚類函數進行聚類

聚類-----KMeans

create mllib edi cit clust package contex kmean local package Spark_MLlib import org.apache.spark.ml.clustering.KMeans import org.apach

pyhanlp 文字聚類詳細介紹

文字聚類文字聚類簡單點的來說就是將文字視作一個樣本，在其上面進行聚類操作。但是與我們機器學習中常用的聚類操作不同之處在於。我們的聚類物件不是直接的文字本身，而是文字提取出來的特徵。因此如何提取特徵因而是非常重要的一步。在HanLP中一共有三個文字聚類方法。前兩種都基於詞袋模式，第一個是最

python資料分析與挖掘之聚類kmeans演算法

聚類不指定類別進行分類（劃分（分裂）法，層次分析法、密度分析法）、網格法、模型法 Kmeans演算法屬於分裂法隨機選擇k各點作為聚類中心計算各個點到這K個點的距離將對應的點聚到與它最近的這個聚類中心重新

異端審判器！一個泛用型文字聚類模型的實現（1）

給你的入侵檢測系統提供一個靈感。如果給你一大堆使用者輸入，裡面有大量的中文地名，像是“北京”、“成都”、“東莞”，不幸的是，其中也混有一些羅馬地名，比如 “Singapore”、“New York”、“Tokyo”。你的任務是將它們分開，你會如何去做？當然，有

LDA使用一文字聚類

演算法流程： 1. 對給定的語料先分詞，得到分詞後的語料； 2. 構造詞典，corpus_tfidf, 最後構造 corpus_lda 3. Kmeans聚類，pred 是對語料的聚類結果列表。 pred = kmean.predict(tfidf_vec) #!/usr/bin

文字聚類演算法介紹

個人部落格站已經上線了，網址 www.llwjy.com ~歡迎各位吐槽~ -------------------------------------------------------------------------------------------------

【SciKit-Learn學習筆記】8：k-均值演算法做文字聚類,聚類演算法效能評估

學習《scikit-learn機器學習》時的一些實踐。原理見K-means和K-means++的演算法原理及sklearn庫中引數解釋、選擇。 sklearn中的KMeans from sklearn.datasets import make_blobs from m

基於doc2vec的中文文字聚類及去重

Understand doc2vec Data introduction Train a model Test the model Cluster all the lyrics Filter out the duplicates 1. Unde

資料探勘筆記-聚類-KMeans-原理與簡單實現

K中心點演算法（K-medoids）提出了新的質點選取方式，而不是簡單像k-means演算法採用均值計演算法。在K中心點演算法中，每次迭代後的質點都是從聚類的樣本點中選取，而選取的標準就是當該樣本點成為新的質點後能提高類簇的聚類質量，使得類簇更緊湊。該演算法使用絕對誤差標準來定義一個類簇的緊湊程度。如果

基於 K-Means 演算法的文字聚類

先粘一篇我的實驗報告，其中涉及的細節，有時間再提出來總結實驗內容：基於K-Means演算法的文字聚類實驗要求： 1、能夠從社交媒體或網上給定的資料集（資料集已給定），從中挖掘出新聞話題，如線上檢測微博訊息中大量突現的關鍵字，並將它們進行聚類，從而找

基於arcpy實現空間資料聚類,kmeans

並不能直接進行空間資料的聚類，原理是讀取要素的x，y座標來進行聚類，然後將聚類中心儲存為空間資料以達到效果 # encoding: utf-8 from sklearn.cluster import KMeans import numpy as np import arcpy import pa

Mahout文字聚類例項

1：下載路透社資料 2：提取文字內容下載的檔案資料格式是SGML格式，這種格式和XML格式很類似。我們需要將這種格式的資料轉化為SequenceFile格式，首先就要提取出txt格式。使用的是Mahout中自帶的工具類：org.apache.lucene.bench

鬼吹燈文字挖掘5：sklearn實現文字聚類和文字分類

1. 準備資料import numpy as np import pandas as pd import re import jieba # 章節判斷用變數預處理 def is_chap_head(tmpstr): import re pattern = r

文字聚類——Kmeans

1、文字預處理

2、文字向量化

3、Kmeans演算法

4、程式入口

5、聚類結果

相關推薦