1. 程式人生 > >lucene英文分詞器(StandarAnalyzer)中會被忽略的詞(stopWords)

lucene英文分詞器(StandarAnalyzer)中會被忽略的詞(stopWords)

使用Lucene進行索引查詢時發現有一部分詞會被分詞器直接忽略掉了,被忽略的分詞稱為stopWords,在英文中通常是一些語氣助詞或者無法表達明確含義的詞。

在定義含有stopWords分詞器的時候都會指定stopWords,如果沒有指定可以引用預設的stopWords,在StandardAnalyzer、StopAnalyzer和ClassicAnalyzer分詞器中stopWords是

      "a", "an", "and", "are", "as", "at", "be", "but", "by", 
      "for", "if", "in", "into", "is"
, "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with"

如果想使用自定義的StopWords可以使用lucene提供的StopWordAnalyzer:

      public static final String[] self_stop_words={ "a", "an", "and", "are", "as", "at"
, "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with", "very" }; //Analyzer analyzer=new StopAnalyzer();
Analyzer analyzer=new StopAnalyzer(self_stop_words);