1. 程式人生 > >Java統計一篇文章中出現次數最多的漢字或英文單詞 又出現次數的統計

Java統計一篇文章中出現次數最多的漢字或英文單詞 又出現次數的統計

思想是用到了Map集合的鍵唯一性儲存漢字或者單詞,單詞的獲取通過正則獲取:

統計類:

import java.util.ArrayList;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;
import java.util.TreeSet;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * <pre>
 * 統計文章中出現次數最多的中文漢字或英文單詞
 * 
 * 	呼叫引數說明:
 * 		String str  需要統計的文章
 * 		int index  需要返回的最多的前幾個
 *     return 返回值為一個字串陣列,陣列中字元起始數字就是該漢字或單詞出現的次數
 * </pre>
 * 
 * @author kaifang
 * 
 */

public class CountsWordsMax {
	// 統計漢字
	public static String[] countToZH(String str, int index) {
		// 去掉中間包含的空格、中文逗號、中文句號
		str = str.replace(" ", "").replace(",", "").replace("。", "");
		// 定義返回陣列
		String[] re_str = new String[index];
		// 文字轉換為字元陣列
		char[] chs = str.toCharArray();
		// 定義ArrayList物件儲存漢字
		ArrayList<String> array = new ArrayList<>();
		for (char ch : chs) {
			array.add(String.valueOf(ch));
		}

		// 定義Map集合儲存漢字,鍵為漢字不重複,值為統計的數量
		TreeMap<String, Integer> map = new TreeMap<String, Integer>();
		// 遍歷字元陣列,獲取到每一個字元
		for (String tstr : array) {
			// 用每一個字元作為鍵,在TreeMap中查詢
			Integer val = map.get(tstr);
			if (val == null) {
				// 返回null,則不存在,儲存1
				map.put(tstr, 1);
			} else {
				// 返回非null,則把值加1,重新儲存
				val++;
				map.put(tstr, val);
			}
		}

		// key value拼接後存在TreeSet中會自動排序,將value與key拼接key在前邊
		TreeSet<String> sortSet = new TreeSet<>();
		// 獲取鍵值對的Set集合
		Set<Map.Entry<String, Integer>> sme = map.entrySet();
		// 遍歷拼接
		for (Map.Entry<String, Integer> me : sme) {
			String s = me.getValue().toString() + me.getKey();
			sortSet.add(s);
		}

		// 獲取後出現次數最多的index個單詞,帶有出現次數
		int o = sortSet.size();
		// 記數
		int c = 0;
		for (int i = o - index; i < sortSet.size();) {
			String te = sortSet.last();
			sortSet.remove(te);
			String temp = (o - sortSet.size()) + ":"
					+ te.replaceAll("[^\\d]", "") + "  "
					+ te.replaceAll("[\\d+]", "");
			re_str[c++] = temp;
		}
		return re_str;
	}

	// 統計英文單詞
	public static String[] countToEng(String str, int index) {
		// 定義返回陣列
		String[] re_str = new String[index];
		// 定義ArrayList物件儲存匹配到的單詞
		ArrayList<String> array = new ArrayList<>();
		// 使用正則獲取單詞
		Pattern pattern = Pattern.compile("\\b[\\w+\\-']+\\b");
		Matcher matcher = pattern.matcher(str);
		while (matcher.find()) {
			array.add(matcher.group());
		}

		// 定義Map集合儲存單詞,鍵為單詞不重複,值為統計的數量
		TreeMap<String, Integer> map = new TreeMap<String, Integer>();
		// 遍歷字元陣列,獲取到每一個字元
		for (String tstr : array) {
			// 用每一個字元作為鍵,在TreeMap中查詢
			Integer val = map.get(tstr);
			if (val == null) {
				// 返回null,則不存在,儲存1
				map.put(tstr, 1);
			} else {
				// 返回非null,則把值加1,重新儲存
				val++;
				map.put(tstr, val);
			}
		}

		// key value拼接後存在TreeSet中會自動排序,將value與key拼接key在前邊
		TreeSet<String> sortSet = new TreeSet<>();
		// 獲取鍵值對的Set集合
		Set<Map.Entry<String, Integer>> sme = map.entrySet();
		// 遍歷拼接
		for (Map.Entry<String, Integer> me : sme) {
			String s = me.getValue().toString() + me.getKey();
			sortSet.add(s);
		}

		// 獲取後出現次數最多的index個單詞,帶有出現次數
		int o = sortSet.size();
		// 記數
		int c = 0;
		for (int i = o - index; i < sortSet.size();) {
			String te = sortSet.last();
			sortSet.remove(te);
			String temp = (o - sortSet.size()) + ":"
					+ te.replaceAll("[^\\d]", "") + "  "
					+ te.replaceAll("[\\d+]", "");
			re_str[c++] = temp;
		}
		return re_str;
	}
}
測試類:可以使用流載入文字檔案,拼成String呼叫上邊的方法進行統計,具體需要怎麼使用自己研究
public class Counts {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		String s = "Although it was autumn, the snow was already beginning to fall in Tibet. Our legs were so heavy and cold that they felt like blocks of ice. Have you ever seen snowmen ride bicycles? That's what we looked like! Along the way children dressed in long wool coats stopped to look at us. In the late afternoon we found it was so cold that our water bottles froze. However, the lakes shone like glass in the setting sun and looked wonderful. Wang Wei rode in front of me as usual. She is very reliable and I knew I didn't need to encourage her. To climb the mountains was hard work but as we looked around us, we were surprised by the view. We seemed to be able to see for miles. At one point we were so high that we found ourselves cycling through clouds. Then we began going down the hills. It was great fun especially as it gradually became much warmer. In the valleys colorful butterflies flew around us and we saw many yaks and sheep eating green grass. At this point we had to change our caps, coats, gloves and trousers for T-shirts and shorts.";
		String[] str = CountsWordsMax.countToEng(s, 3);
		for (String t : str) {
			System.out.println(t);
		}
		System.out.println("---------------------");
		String s1 = "頁的記錄;另一種是每分一頁就做一次查詢,每次只查出當頁需要的記錄。第一種做法不是很可取,如果一張表中有上百萬條記錄的話,這無疑將會很慢,而實際專案中海量的資料是 無處不在的!所以比較好的方式是採用後種方法,所以這裡就涉及到如何從資料庫中得到指 定的記錄?相比於mysql,sqlserver等資料庫,oracle的這種查詢語句相對複雜一點。 方法一(使用子查詢):rownum關鍵字在語法規定上不能使用大於一個數值的形式,所以 我們就要利用子查詢來巧妙地實現,具體語句如下: ";
		String[] str1 = CountsWordsMax.countToZH(s1, 3);
		for (String t : str1) {
			System.out.println(t);
		}
	}
}


歡迎,交流技術!謝謝