1. 程式人生 > >最大正向匹配(java版)

最大正向匹配(java版)

之前在網上尋找中文分詞使用的一些方法,看過一篇帖子(http://blog.csdn.net/niuox/article/details/11248567)是利用python寫的關於最大正向匹配的例子。

寫的還不錯,適合新入門的小白來學習學習。

後來因為執行程式的時候資料量太大,就把最大正向匹配用java寫了一遍,結果同樣的資料,同一臺電腦,同樣的思想,速度相差極大。先貼程式碼如下:

public static void main(String[] args) 
{
FileInputStream fr = null;
BufferedReader br = null;

FileInputStream fr1 = null;
BufferedReader br1 = null;

FileOutputStream fo = null;
BufferedWriter bw = null;
try
{
String str = "";
String str1 = "";

String file = "F:\\WordSeg\\Test.csv";
String file1 = "F:\\WordSeg\\View.csv";
fr = new FileInputStream(file);
br = new BufferedReader(new InputStreamReader(fr,"utf-8"));
fr1 = new FileInputStream(file1);
br1 = new BufferedReader(new InputStreamReader(fr1,"utf-8"));

fo = new FileOutputStream("F:\\WordSeg\\Outputjava9.csv",true);
bw = new BufferedWriter(new OutputStreamWriter(fo,"utf-8"));
LinkedHashMap<String,Integer> dict = new LinkedHashMap<String,Integer>();
int len = 0;
int count = 0;
while((str = br1.readLine())!=null)
{
count += 1;
String tmp = str.trim();
if(len < tmp.length())
{
len = tmp.length();
System.out.println("len count "+len+" "+ count);
}

if(!dict.containsKey(tmp))
{
dict.put(tmp, 1);
}
else
{
dict.put(tmp, dict.get(tmp)+1);
}
}
while((str1 = br.readLine())!=null)
{
String temp = str1.trim();
String ss="";
LinkedList<String> slist = max_match(temp,dict,len);
for(String s:slist)
{
ss += s+'\\';
}

ss = temp + "\r\n"+ss+"\r\n";
bw.write(ss);
      bw.flush();
}

}
catch (FileNotFoundException e)
{
System.out.println("找不到指定檔案");
}catch (IOException e) {
  System.out.println("讀取檔案失敗");
 } finally {
  try {
bw.flush();
bw.close();
fo.close();
    br.close();
    fr.close();    
  }catch (IOException e) {
   e.printStackTrace();
  }
  }
  }


public static LinkedList<String> max_match(String temp, LinkedHashMap<String, Integer> dict, int len) {
LinkedList<String> word = new LinkedList<String>();
int idx = 0;
//boolean matched = false;
while(idx < temp.length())
{
boolean matched = false;
int i;
for(i = len;i>0;i--)
{
String cand = null;
if(idx+i<temp.length())
{
cand = temp.substring(idx, idx+i);
}else
{
i = temp.length() - idx;
cand = temp.substring(idx,idx+i);
}

if(dict.containsKey(cand))
{
word.add(cand);
matched = true;
break;
}
}
if(matched!=true)
{
i = 1;
}
idx += i;
}
return word;
}


最終,python版的程式碼執行時間是一個半小時左右,而java版的執行時間10seconds以內!!!