統計一TXT文件中單詞出現頻率，輸出頻率最高的10個單詞

阿新 • • 發佈：2019-02-13

實驗過程

主要思路就是首先將標點符號，常用冠詞等替換掉，然後利用雜湊表和陣列原理排序，輸出最高頻率的前十個陣列

程式碼如下

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

public class test {
public static void main(String[] args) throws IOException {
long start = System.currentTimeMillis(); // 程式開始時間
File file = new File("E:/TEST.txt");

BufferedReader br = new BufferedReader(new FileReader(file));
StringBuilder sb = new StringBuilder();
String line = null;
while ((line = br.readLine()) != null) {
sb.append(line);
}
br.close(); // 關閉流

String words = sb.toString(); // 全部的單詞字串
String targetString = words.replaceAll("[.,\"\\?!:;\\(\\)]", ""); // 將標點替換為空

// 分詞並且定義英文中不代表實際意義的一些單詞，如介詞、代詞、情態動詞等
String[] singleWord = targetString.split(" ");
String[] keys = { "you", "i", "he", "she", "me", "him", "her", "it",
    "they", "them", "we", "us", "your", "your", "our", "his",
    "her", "its", "my", "in", "into", "on", "for", "out", "up",
    "down", "at", "to","too", "with", "by", "about", "among", "between",
    "over", "from", "be", "been", "am", "is", "are", "was", "were",
    "whthout", "the", "of", "and", "a", "an", "that", "this", "be",
    "or", "as", "will", "would", "can", "could", "may", "might",
    "shall", "should", "must", "has", "have", "had", "than" };

// 將一部分常見的無意義的英語單詞替換為字元 '#' 以便後面輸出單詞出現次數時的判斷
for (int i = 0; i < singleWord.length; i++) {
   for (String str : keys) {
    if (singleWord[i].equals(str))
     singleWord[i] = "#";
   }
}

// 將單詞以及其出現的次數關聯起來
for (int i = 0; i < singleWord.length; i++) {
   count++; // 計算單詞個數
   if ((wordMap.get(singleWord[i]) != null)) {
    int value = ((Integer) wordMap.get(singleWord[i])).intValue();
    value++;
    wordMap.put(singleWord[i].toLowerCase(), new Integer(value)); // 將單詞轉換為小寫存放以統一格式
   } else {
    wordMap.put(singleWord[i].toLowerCase(), new Integer(1));
   }

}

System.out.println("\t\t--檔案資訊--");
System.out.println(" 名稱： " + file.getName() + " 大小： "
+ file.length()/ 1024 + "KB");
System.out.println("\t\t--檔案資訊--");
System.out.println();
System.out.println("■■■■ " + count + " 個單詞中出現頻率最高的 10 個單詞如下■■■■");

// 比較器，按值排序
System.setProperty("java.util.Arrays.useLegacyMergeSort", "true");
List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(
    wordMap.entrySet());
Collections.sort(list, new Comparator<Entry<String, Integer>>() {
   public int compare(Entry<String, Integer> e1,
     Entry<String, Integer> e2) {
    if (e2.getValue() != null && e1.getValue() != null
      && e2.getValue().compareTo(e1.getValue()) > 0) {
     return 1;
    } else {
     return -1;
    }
   }
});

int wordCount = 1; // 記錄已經輸出單詞的個數
for (Map.Entry<String, Integer> entry : list) {
   if (entry.getKey().equals("#")) // 相當於過濾作用，不輸出介詞、代詞、情態動詞等無意義單詞
    continue;
   System.out.printf("\t%2d、 %8s \t %4d次\n", wordCount,
     entry.getKey(), entry.getValue());
   if (wordCount++ == 10) { // 表示只輸出10個
    long end = System.currentTimeMillis(); // 程式結束時間
    System.out.println("■■■■■■■■■■■■■■■ 耗時 " + (end - start)
      + " ms" + " ■■■■■■■■■■■■■■■■");
    return;
   }
}
}

private static HashMap<String, Integer> wordMap = new HashMap<String, Integer>();
private static int count = 0;
}

執行結果如圖

、

並且用JDK自帶的visualVM測試工具進行測試，測試見截圖如下

統計一TXT文件中單詞出現頻率，輸出頻率最高的10個單詞

統計一TXT文件中單詞出現頻率，輸出頻率最高的10個單詞

讀取TXT文件中的每一行，並存儲到陣列當中

Python統計一個英文文件中各單詞出現的行數

awk命令之 - 統計/etc/passwd文件中各用戶所使用的shell類型及出現次數

編寫一個程序,將 a.txt 文件中的單詞與 b.txt 文件中的單詞交替合並到 c.txt 文件中,a.txt 文件中的單詞用回車符分隔,b.txt 文件中用回車或空格進行分隔。

C語言，產生一組數字，並將其寫入txt文件中

js讀取本地txt文件中的json數據

Java導出List集合到txt文件中——（四）

java將已有的字符串保存到txt文件中

[uEnv.txt]在uEnv.txt文件中使用if語句實現Image/dtb文件切換

用python把一個txt文件中所有逗號，替換成空格？

怎樣從生產數據庫中獲得想要的查詢語句，把結果集批量插入到磁盤txt文件中

已知文件 a.txt 文件中的內容為“bcdeadferwplkou”， * 請編寫程序讀取該文件內容，並按照自然順序排序後輸出到 b.txt 文件中。 * 即 b.txt 中的文件內容應為“abcd…………..”這樣的順序。

自動生成不同難度的數學試卷系統，並輸出到txt文件中，命名為當前時間（java）

sort +awk+uniq 統計檔案中出現次數最多的前10個單詞

python+selenium常見例項-如何從TXT文件中讀取使用者資訊

linux中sort（統計檔案中出現次數最多的前10個單詞）

python：將txt文件中是數值型資料讀入到array陣列中

2017.8.19 利用python統計文件中的單詞數，行數和字元數

爬蟲--使用scrapy爬取糗事百科並在txt文件中持久化存儲

統計一TXT文件中單詞出現頻率，輸出頻率最高的10個單詞

相關推薦