1. 程式人生 > >【Lucene4.8教程之四】分析

【Lucene4.8教程之四】分析

1、基礎內容

(1)相關概念

分析(Analysis),在Lucene中指的是將域(Field)文字轉換成最基本的索引表示單元--項(Term)的過程。在搜尋過程中,這些項用於決定什麼樣的文件能夠匹配查詞條件。

分析器對分析操作進行了封裝,它通過執行若干操作,將文字轉化成語彙單元,這個處理過程也稱為語彙單元化過程(tokenization),而從文字洲中提取的文字塊稱為語彙單元(token)。詞彙單元與它的域名結合後,就形成了項。

(2)何時使用分析器

  • 建立索引期間
		Directory returnIndexDir = FSDirectory.open(indexDir);

		IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,
				new StandardAnalyzer(Version.LUCENE_48));

		IndexWriter writer = new IndexWriter(returnIndexDir, iwc);
  • 使用QueryParser物件進行搜尋時
QueryParser parser = new QueryParser(Version.LUCENE_48, "contents",
				new SimpleAnalyzer(Version.LUCENE_48));
  • 在搜尋中高亮顯示結果時
(3)常用的4個分析器:
  • WhitespaceAnalyzer, as the name implies, simply splits text into tokens on whitespace characters and makes no other effort to normalize the tokens.
  • SimpleAnalyzer first splits tokens at non-letter characters, then lowercases each token. Be careful! This analyzer quietly discards numeric characters.
  • StopAnalyzer is the same as SimpleAnalyzer, except it removes common words (called stop words, described more in section XXX). By default it removes common words in the English language (the, a, etc.), though you can pass in your own set.
  • StandardAnalyzer is Lucene’s most sophisticated core analyzer. It has quite a bit of logic to identify certain kinds of tokens, such as company names,

四、其它內容

在建立IndexWriter時,需要指定分析器,如:
<span>		</span>IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48,
<span>				</span>new StandardAnalyzer(Version.LUCENE_48));

<span>		</span>writer = new IndexWriter(returnIndexDir, iwc);
便在每次向writer中新增文件時,可以針對該文件指定一個分析器,如
writer.addDocument(doc, new SimpleAnalyzer(Version.LUCENE_48));