最大正向匹配(java版)
阿新 • • 發佈:2019-02-02
之前在網上尋找中文分詞使用的一些方法,看過一篇帖子(http://blog.csdn.net/niuox/article/details/11248567)是利用python寫的關於最大正向匹配的例子。
寫的還不錯,適合新入門的小白來學習學習。
後來因為執行程式的時候資料量太大,就把最大正向匹配用java寫了一遍,結果同樣的資料,同一臺電腦,同樣的思想,速度相差極大。先貼程式碼如下:
public static void main(String[] args) { FileInputStream fr = null; BufferedReader br = null; FileInputStream fr1 = null; BufferedReader br1 = null; FileOutputStream fo = null; BufferedWriter bw = null; try { String str = ""; String str1 = ""; String file = "F:\\WordSeg\\Test.csv"; String file1 = "F:\\WordSeg\\View.csv"; fr = new FileInputStream(file); br = new BufferedReader(new InputStreamReader(fr,"utf-8")); fr1 = new FileInputStream(file1); br1 = new BufferedReader(new InputStreamReader(fr1,"utf-8")); fo = new FileOutputStream("F:\\WordSeg\\Outputjava9.csv",true); bw = new BufferedWriter(new OutputStreamWriter(fo,"utf-8")); LinkedHashMap<String,Integer> dict = new LinkedHashMap<String,Integer>(); int len = 0; int count = 0; while((str = br1.readLine())!=null) { count += 1; String tmp = str.trim(); if(len < tmp.length()) { len = tmp.length(); System.out.println("len count "+len+" "+ count); } if(!dict.containsKey(tmp)) { dict.put(tmp, 1); } else { dict.put(tmp, dict.get(tmp)+1); } } while((str1 = br.readLine())!=null) { String temp = str1.trim(); String ss=""; LinkedList<String> slist = max_match(temp,dict,len); for(String s:slist) { ss += s+'\\'; } ss = temp + "\r\n"+ss+"\r\n"; bw.write(ss); bw.flush(); } } catch (FileNotFoundException e) { System.out.println("找不到指定檔案"); }catch (IOException e) { System.out.println("讀取檔案失敗"); } finally { try { bw.flush(); bw.close(); fo.close(); br.close(); fr.close(); }catch (IOException e) { e.printStackTrace(); } } } public static LinkedList<String> max_match(String temp, LinkedHashMap<String, Integer> dict, int len) { LinkedList<String> word = new LinkedList<String>(); int idx = 0; //boolean matched = false; while(idx < temp.length()) { boolean matched = false; int i; for(i = len;i>0;i--) { String cand = null; if(idx+i<temp.length()) { cand = temp.substring(idx, idx+i); }else { i = temp.length() - idx; cand = temp.substring(idx,idx+i); } if(dict.containsKey(cand)) { word.add(cand); matched = true; break; } } if(matched!=true) { i = 1; } idx += i; } return word; }
最終,python版的程式碼執行時間是一個半小時左右,而java版的執行時間10seconds以內!!!