Lucene6.5.0 下中文分詞IKAnalyzer編譯和使用

阿新 • • 發佈：2019-02-01

前言

lucene本省對中文分詞有支援，不過支援的不好，其分詞方式是機械的將中文詞一個分成一個進行儲存，例如：成都資訊工程大學，最終分成為:：成|都|信|息|工|程|大|學，顯然這種分詞方式是低效且浪費儲存空間的，IK分詞是林良益前輩自定義寫的一個專門針對中文分詞的分析器,最新版本為2012年的版本for4.0之後未做更新，後續版本lucene的介面改變使其不支援，所以需要進行修改。

修改和編譯IKAnalyzer

下載原始碼之後解壓並匯入到單獨的java project,然後再匯入lucene的jar包，如圖所示，是我的工程結構

匯入後修改四個檔案：IKAnalyzer和IKTokenizer以及SWMCQueryBuilder、IKQueryExpressionParser，至於demo中的兩個檔案可直接刪除或進行修改，我進行了修改。修改方式很簡單，這裡貼出修改的原文，以及修改後工程和原始碼下載。

IKAnalyzer

/**
 * IK 中文分詞  版本 6.5.0
 * IK Analyzer release 6.5.0
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 
 */
package org.wltea.analyzer.lucene;

import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.util.IOUtils;


/**
 * IK分詞器，Lucene Analyzer介面實現
 * 相容Lucene 6.5.0版本 暴走抹茶 2017.3.28
 */
public final class IKAnalyzer extends Analyzer{
	
	private boolean useSmart;
	
	public boolean useSmart() {
		return useSmart;
	}

	public void setUseSmart(boolean useSmart) {
		this.useSmart = useSmart;
	}

	/**
	 * IK分詞器Lucene  Analyzer介面實現類
	 * 
	 * 預設細粒度切分演算法
	 */
	public IKAnalyzer(){
		this(false);
	}
	
	/**
	 * IK分詞器Lucene Analyzer介面實現類
	 * 
	 * @param useSmart 當為true時，分詞器進行智慧切分
	 */
	public IKAnalyzer(boolean useSmart){
		super();
		this.useSmart = useSmart;
	}


	@Override
	protected TokenStreamComponents createComponents(String fieldName) {
		 Reader reader=null;
	        try{
	            reader=new StringReader(fieldName);
	            IKTokenizer it = new IKTokenizer(reader);
	            return new Analyzer.TokenStreamComponents(it);
	        }finally {
	            IOUtils.closeWhileHandlingException(reader);
	        }
	}

}

IKTokenizer

/**
 * IK 中文分詞  版本 6.5.0
 * IK Analyzer release 6.5.0
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 

 * 
 */
package org.wltea.analyzer.lucene;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * IK分詞器 Lucene Tokenizer介面卡類 相容Lucene 6.5.0版本 暴走抹茶 2017.3.28
 */
public final class IKTokenizer extends Tokenizer {

	// IK分詞器實現
	private IKSegmenter _IKImplement;

	// 詞元文字屬性
	private final CharTermAttribute termAtt;
	// 詞元位移屬性
	private final OffsetAttribute offsetAtt;
	// 詞元分類屬性（該屬性分類參考org.wltea.analyzer.core.Lexeme中的分類常量）
	private final TypeAttribute typeAtt;
	// 記錄最後一個詞元的結束位置
	private int endPosition;

	public IKTokenizer(Reader in) {
		this(in, false);
	}

	/**
	 * Lucene 6.5.0 Tokenizer介面卡類建構函式
	 * 
	 * @param in
	 * @param useSmart
	 */
	public IKTokenizer(Reader in, boolean useSmart) {
		offsetAtt = addAttribute(OffsetAttribute.class);
		termAtt = addAttribute(CharTermAttribute.class);
		typeAtt = addAttribute(TypeAttribute.class);
		_IKImplement = new IKSegmenter(input, useSmart);
	}

	/*
	 * (non-Javadoc)
	 * 
	 * @see org.apache.lucene.analysis.TokenStream#incrementToken()
	 */
	@Override
	public boolean incrementToken() throws IOException {
		// 清除所有的詞元屬性
		clearAttributes();
		Lexeme nextLexeme = _IKImplement.next();
		if (nextLexeme != null) {
			// 將Lexeme轉成Attributes
			// 設定詞元文字
			termAtt.append(nextLexeme.getLexemeText());
			// 設定詞元長度
			termAtt.setLength(nextLexeme.getLength());
			// 設定詞元位移
			offsetAtt.setOffset(nextLexeme.getBeginPosition(), nextLexeme.getEndPosition());
			// 記錄分詞的最後位置
			endPosition = nextLexeme.getEndPosition();
			// 記錄詞元分類
			typeAtt.setType(nextLexeme.getLexemeTypeString());
			// 返會true告知還有下個詞元
			return true;
		}
		// 返會false告知詞元輸出完畢
		return false;
	}

	/*
	 * (non-Javadoc)
	 * 
	 * @see org.apache.lucene.analysis.Tokenizer#reset(java.io.Reader)
	 */
	@Override
	public void reset() throws IOException {
		super.reset();
		_IKImplement.reset(input);
	}

	@Override
	public final void end() {
		// set final offset
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}
}

IKQueryExpressionParser

/**
 * IK 中文分詞  版本 6.5.0
 * IK Analyzer release 6.5.0
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 
 */
package org.wltea.analyzer.query;

import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Stack;

import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BooleanQuery.Builder;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TermRangeQuery;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.util.BytesRef;

/**
 * IK簡易查詢表示式解析 
 * 結合SWMCQuery演算法  暴走抹茶 2017.3.28
 * 
 * 表示式例子 ：
 * (id='1231231' && title:'monkey') || (content:'你好嗎'  || ulr='www.ik.com') - name:'helloword'
 * @author linliangyi
 *
 */
public class IKQueryExpressionParser {
	
	//public static final String LUCENE_SPECIAL_CHAR = "&&||-()':={}[],";
	
	private List<Element> elements = new ArrayList<Element>();
	
	private Stack<Query> querys =  new Stack<Query>();
	
	private Stack<Element> operates = new Stack<Element>();
	
	/**
	 * 解析查詢表示式，生成Lucene Query物件
	 * 
	 * @param expression
	 * @param quickMode 
	 * @return Lucene query
	 */
	public Query parseExp(String expression , boolean quickMode){
		Query lucenceQuery = null;
		if(expression != null && !"".equals(expression.trim())){
			try{
				//文法解析
				this.splitElements(expression);
				//語法解析
				this.parseSyntax(quickMode);
				if(this.querys.size() == 1){
					lucenceQuery = this.querys.pop();
				}else{
					throw new IllegalStateException("表示式異常： 缺少邏輯操作符 或 括號缺失");
				}
			}finally{
				elements.clear();
				querys.clear();
				operates.clear();
			}
		}
		return lucenceQuery;
	}	
	
	/**
	 * 表示式文法解析
	 * @param expression
	 */
	private void splitElements(String expression){
 		
		if(expression == null){
			return;
		}
		Element curretElement = null;
		
		char[] expChars = expression.toCharArray();
		for(int i = 0 ; i < expChars.length ; i++){
			switch(expChars[i]){
			case '&' :
				if(curretElement == null){
					curretElement = new Element();
					curretElement.type = '&';
					curretElement.append(expChars[i]);
				}else if(curretElement.type == '&'){
					curretElement.append(expChars[i]);
					this.elements.add(curretElement);
					curretElement = null;
				}else if(curretElement.type == '\''){
					curretElement.append(expChars[i]);
				}else {
					this.elements.add(curretElement);
					curretElement = new Element();
					curretElement.type = '&';
					curretElement.append(expChars[i]);
				}
				break;
				
			case '|' :
				if(curretElement == null){
					curretElement = new Element();
					curretElement.type = '|';
					curretElement.append(expChars[i]);
				}else if(curretElement.type == '|'){
					curretElement.append(expChars[i]);
					this.elements.add(curretElement);
					curretElement = null;
				}else if(curretElement.type == '\''){
					curretElement.append(expChars[i]);
				}else {
					this.elements.add(curretElement);
					curretElement = new Element();
					curretElement.type = '|';
					curretElement.append(expChars[i]);
				}				
				break;
				
			case '-' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '-';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;			
				break;

			case '(' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '(';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;			
				break;				

			case ')' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = ')';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;			
				break;					

			case ':' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = ':';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;			
				break;	
			
			case '=' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '=';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;			
				break;					

			case ' ' :
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
					}else{
						this.elements.add(curretElement);
						curretElement = null;
					}
				}
				
				break;
			
			case '\'' :
				if(curretElement == null){
					curretElement = new Element();
					curretElement.type = '\'';
					
				}else if(curretElement.type == '\''){
					this.elements.add(curretElement);
					curretElement = null;
					
				}else{
					this.elements.add(curretElement);
					curretElement = new Element();
					curretElement.type = '\'';
					
				}
				break;
				
			case '[':
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '[';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;					
				break;
				
			case ']':
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = ']';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;
				
				break;
				
			case '{':
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '{';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;					
				break;
				
			case '}':
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = '}';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;
				
				break;
			case ',':
				if(curretElement != null){
					if(curretElement.type == '\''){
						curretElement.append(expChars[i]);
						continue;
					}else{
						this.elements.add(curretElement);
					}
				}
				curretElement = new Element();
				curretElement.type = ',';
				curretElement.append(expChars[i]);
				this.elements.add(curretElement);
				curretElement = null;
				
				break;
				
			default :
				if(curretElement == null){
					curretElement = new Element();
					curretElement.type = 'F';
					curretElement.append(expChars[i]);
					
				}else if(curretElement.type == 'F'){
					curretElement.append(expChars[i]);
					
				}else if(curretElement.type == '\''){
					curretElement.append(expChars[i]);

				}else{
					this.elements.add(curretElement);
					curretElement = new Element();
					curretElement.type = 'F';
					curretElement.append(expChars[i]);
				}			
			}
		}
		
		if(curretElement != null){
			this.elements.add(curretElement);
			curretElement = null;
		}
	}
		
	/**
	 * 語法解析
	 * 
	 */
	private void parseSyntax(boolean quickMode){
		for(int i = 0 ; i < this.elements.size() ; i++){
			Element e = this.elements.get(i);
			if('F' == e.type){
				Element e2 = this.elements.get(i + 1);
				if('=' != e2.type && ':' != e2.type){
					throw new IllegalStateException("表示式異常： = 或 ： 號丟失");
				}
				Element e3 = this.elements.get(i + 2);
				//處理 = 和 ： 運算
				if('\'' == e3.type){
					i+=2;
					if('=' == e2.type){
						TermQuery tQuery = new TermQuery(new Term(e.toString() , e3.toString()));
						this.querys.push(tQuery);
					}else if(':' == e2.type){
						String keyword = e3.toString();
						//SWMCQuery Here
						Query _SWMCQuery =  SWMCQueryBuilder.create(e.toString(), keyword , quickMode);
						this.querys.push(_SWMCQuery);
					}
					
				}else if('[' == e3.type || '{' == e3.type){
					i+=2;
					//處理 [] 和 {}
					LinkedList<Element> eQueue = new LinkedList<Element>();
					eQueue.add(e3);
					for( i++ ; i < this.elements.size() ; i++){							
						Element eN = this.elements.get(i);
						eQueue.add(eN);
						if(']' == eN.type || '}' == eN.type){
							break;
						}
					}
					//翻譯RangeQuery
					Query rangeQuery = this.toTermRangeQuery(e , eQueue);
					this.querys.push(rangeQuery);
				}else{
					throw new IllegalStateException("表示式異常：匹配值丟失");
				}
				
			}else if('(' == e.type){
				this.operates.push(e);
				
			}else if(')' == e.type){
				boolean doPop = true;
				while(doPop && !this.operates.empty()){
					Element op = this.operates.pop();
					if('(' == op.type){
						doPop = false;
					}else {
						Query q = toBooleanQuery(op);
						this.querys.push(q);
					}
					
				}
			}else{ 
				
				if(this.operates.isEmpty()){
					this.operates.push(e);
				}else{
					boolean doPeek = true;
					while(doPeek && !this.operates.isEmpty()){
						Element eleOnTop = this.operates.peek();
						if('(' == eleOnTop.type){
							doPeek = false;
							this.operates.push(e);
						}else if(compare(e , eleOnTop) == 1){
							this.operates.push(e);
							doPeek = false;
						}else if(compare(e , eleOnTop) == 0){
							Query q = toBooleanQuery(eleOnTop);
							this.operates.pop();
							this.querys.push(q);
						}else{
							Query q = toBooleanQuery(eleOnTop);
							this.operates.pop();
							this.querys.push(q);
						}
					}
					
					if(doPeek && this.operates.empty()){
						this.operates.push(e);
					}
				}
			}			
		}
		
		while(!this.operates.isEmpty()){
			Element eleOnTop = this.operates.pop();
			Query q = toBooleanQuery(eleOnTop);
			this.querys.push(q);			
		}		
	}

	/**
	 * 根據邏輯操作符，生成BooleanQuery
	 * @param op
	 * @return
	 */
	private Query toBooleanQuery(Element op){
		if(this.querys.size() == 0){
			return null;
		}
		
		//BooleanQuery resultQuery = null;
		Builder builder = new Builder();

		if(this.querys.size() == 1){
			return this.querys.get(0);
		}
		
		Query q2 = this.querys.pop();
		Query q1 = this.querys.pop();
		if('&' == op.type){
			if(q1 != null){
				if(q1 instanceof BooleanQuery){
					List<BooleanClause> clauses = ((BooleanQuery)q1).clauses();
					if(clauses.size() > 0 
							&& clauses.get(0).getOccur() == Occur.MUST){
						for(BooleanClause c : clauses){
							builder.add(c);
						}					
					}else{
						builder.add(q1,Occur.MUST);
					}

				}else{
					//q1 instanceof TermQuery 
					//q1 instanceof TermRangeQuery 
					//q1 instanceof PhraseQuery
					//others
					builder.add(q1,Occur.MUST);
				}
			}
			
			if(q2 != null){
				if(q2 instanceof BooleanQuery){
					List<BooleanClause> clauses = ((BooleanQuery)q2).clauses();
					if(clauses.size() > 0 
							&& clauses.get(0).getOccur() == Occur.MUST){
						for(BooleanClause c : clauses){
							builder.add(c);
						}					
					}else{
						builder.add(q2,Occur.MUST);
					}
					
				}else{
					//q1 instanceof TermQuery 
					//q1 instanceof TermRangeQuery 
					//q1 instanceof PhraseQuery
					//others
					builder.add(q2,Occur.MUST);
				}
			}
			
		}else if('|' == op.type){
			if(q1 != null){
				if(q1 instanceof BooleanQuery){
					List<BooleanClause> clauses = ((BooleanQuery)q1).clauses();
					if(clauses.size() > 0 
							&& clauses.get(0).getOccur() == Occur.SHOULD){
						for(BooleanClause c : clauses){
							builder.add(c);
						}					
					}else{
						builder.add(q1,Occur.SHOULD);
					}
					
				}else{
					//q1 instanceof TermQuery 
					//q1 instanceof TermRangeQuery 
					//q1 instanceof PhraseQuery
					//others
					builder.add(q1,Occur.SHOULD);
				}
			}
			
			if(q2 != null){
				if(q2 instanceof BooleanQuery){
					List<BooleanClause> clauses = ((BooleanQuery)q2).clauses();
					if(clauses.size() > 0 
							&& clauses.get(0).getOccur() == Occur.SHOULD){
						for(BooleanClause c : clauses){
							builder.add(c);
						}					
					}else{
						builder.add(q2,Occur.SHOULD);
					}
				}else{
					//q2 instanceof TermQuery 
					//q2 instanceof TermRangeQuery 
					//q2 instanceof PhraseQuery
					//others
					builder.add(q2,Occur.SHOULD);
					
				}
			}
			
		}else if('-' == op.type){
			if(q1 == null || q2 == null){
				throw new IllegalStateException("表示式異常：SubQuery 個數不匹配");
			}
			
			if(q1 instanceof BooleanQuery){
				List<BooleanClause> clauses = ((BooleanQuery)q1).clauses();
				if(clauses.size() > 0){
					for(BooleanClause c : clauses){
						builder.add(c);
					}					
				}else{
					builder.add(q1,Occur.MUST);
				}

			}else{
				//q1 instanceof TermQuery 
				//q1 instanceof TermRangeQuery 
				//q1 instanceof PhraseQuery
				//others
				builder.add(q1,Occur.MUST);
			}				
			
			builder.add(q2,Occur.MUST_NOT);
		}
		return builder.build();
	}	
	
	/**
	 * 組裝TermRangeQuery
	 * @param elements
	 * @return
	 */
	private TermRangeQuery toTermRangeQuery(Element fieldNameEle , LinkedList<Element> elements){

		boolean includeFirst = false;
		boolean includeLast = false;
		String firstValue = null;
		String lastValue = null;
		//檢查第一個元素是否是[或者{
		Element first = elements.getFirst();
		if('[' == first.type){
			includeFirst = true;
		}else if('{' == first.type){
			includeFirst = false;
		}else {
			throw new IllegalStateException("表示式異常");
		}
		//檢查最後一個元素是否是]或者}
		Element last = elements.getLast();
		if(']' == last.type){
			includeLast = true;
		}else if('}' == last.type){
			includeLast = false;
		}else {
			throw new IllegalStateException("表示式異常, RangeQuery缺少結束括號");
		}
		if(elements.size() < 4 || elements.size() > 5){
			throw new IllegalStateException("表示式異常, RangeQuery 錯誤");
		}			
		//讀出中間部分
		Element e2 = elements.get(1);
		if('\'' == e2.type){
			firstValue = e2.toString();
			//
			Element e3 = elements.get(2);
			if(',' != e3.type){
				throw new IllegalStateException("表示式異常, RangeQuery缺少逗號分隔");
			}
			//
			Element e4 = elements.get(3);
			if('\'' == e4.type){
				lastValue = e4.toString();
			}else if(e4 != last){
				throw new IllegalStateException("表示式異常，RangeQuery格式錯誤");
			}				
		}else if(',' == e2.type){
			firstValue = null;
			//
			Element e3 = elements.get(2);
			if('\'' == e3.type){
				lastValue = e3.toString();
			}else{
				throw new IllegalStateException("表示式異常，RangeQuery格式錯誤");
			}
			
		}else {
			throw new IllegalStateException("表示式異常, RangeQuery格式錯誤");
		}
		
		return new TermRangeQuery(fieldNameEle.toString() , new BytesRef(firstValue) , new BytesRef(lastValue) , includeFirst , includeLast);
	}	
	
	/**
	 * 比較操作符優先順序
	 * @param e1
	 * @param e2
	 * @return
	 */
	private int compare(Element e1 , Element e2){
		if('&' == e1.type){
			if('&' == e2.type){
				return 0;
			}else {
				return 1;
			}
		}else if('|' == e1.type){
			if('&' == e2.type){
				return -1;
			}else if('|' == e2.type){
				return 0;
			}else{
				return 1;
			}
		}else{
			if('-' == e2.type){
				return 0;
			}else{
				return -1;
			}
		}
	}
	
	/**
	 * 表示式元素（操作符、FieldName、FieldValue）
	 * @author linliangyi
	 * May 20, 2010
	 */
	private class Element{
		char type = 0;
		StringBuffer eleTextBuff;

		public Element(){
			eleTextBuff = new StringBuffer();
		}
		
		public void append(char c){
			this.eleTextBuff.append(c);
		}
	
		public String toString(){
			return this.eleTextBuff.toString();
		}
	}	

	public static void main(String[] args){
		IKQueryExpressionParser parser = new IKQueryExpressionParser();
		//String ikQueryExp = "newsTitle:'的兩款《魔獸世界》外掛Bigfoot和月光寶盒'";
		String ikQueryExp = "(id='ABcdRf' && date:{'20010101','20110101'} && keyword:'魔獸中國') || (content:'KSHT-KSH-A001-18'  || ulr='www.ik.com') - name:'林良益'";
		Query result = parser.parseExp(ikQueryExp , true);
		System.out.println(result);

	}	
	
}

SWMCQueryBuilder

/**
 * IK 中文分詞  版本 6.5.0
 * IK Analyzer release 6.5.0
 * 
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 * provided by Linliangyi and copyright 2012 by Oolong studio
 * 
 */
package org.wltea.analyzer.query;

import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.Query;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;

/**
 * Single Word Multi Char Query Builder
 * IK分詞演算法專用  暴走抹茶 2017.3.28
 * @author linliangyi
 *
 */
public class SWMCQueryBuilder {

	/**
	 * 生成SWMCQuery
	 * @param fieldName
	 * @param keywords
	 * @param quickMode
	 * @return Lucene Query
	 */
	public static Query create(String fieldName ,String keywords , boolean quickMode){
		if(fieldName == null || keywords == null){
			throw new IllegalArgumentException("引數 fieldName 、 keywords 不能為null.");
		}
		//1.對keywords進行分詞處理
		List<Lexeme> lexemes = doAnalyze(keywords);
		//2.根據分詞結果，生成SWMCQuery
		Query _SWMCQuery = getSWMCQuery(fieldName , lexemes , quickMode);
		return _SWMCQuery;
	}
	
	/**
	 * 分詞切分，並返回結連結串列
	 * @param keywords
	 * @return
	 */
	private static List<Lexeme> doAnalyze(String keywords){
		List<Lexeme> lexemes = new ArrayList<Lexeme>();
		IKSegmenter ikSeg = new IKSegmenter(new StringReader(keywords) , true);
		try{
			Lexeme l = null;
			while( (l = ikSeg.next()) != null){
				lexemes.add(l);
			}
		}catch(IOException e){
			e.printStackTrace();
		}
		return lexemes;
	}
	
	
	/**
	 * 根據分詞結果生成SWMC搜尋
	 * @param fieldName
	 * @param pathOption
	 * @param quickMode
	 * @return
	 */
	private static Query getSWMCQuery(String fieldName , List<Lexeme> lexemes , boolean quickMode){
		//構造SWMC的查詢表示式
		StringBuffer keywordBuffer = new StringBuffer();
		//精簡的SWMC的查詢表示式
		StringBuffer keywordBuffer_Short = new StringBuffer();
		//記錄最後詞元長度
		int lastLexemeLength = 0;
		//記錄最後詞元結束位置
		int lastLexemeEnd = -1;
		
		int shortCount = 0;
		int totalCount = 0;
		for(Lexeme l : lexemes){
			totalCount += l.getLength();
			//精簡表示式
			if(l.getLength() > 1){
				keywordBuffer_Short.append(' ').append(l.getLexemeText());
				shortCount += l.getLength();
			}
			
			if(lastLexemeLength == 0){
				keywordBuffer.append(l.getLexemeText());				
			}else if(lastLexemeLength == 1 && l.getLength() == 1
					&& lastLexemeEnd == l.getBeginPosition()){//單字位置相鄰，長度為一，合併)
				keywordBuffer.append(l.getLexemeText());
			}else{
				keywordBuffer.append(' ').append(l.getLexemeText());
				
			}
			lastLexemeLength = l.getLength();
			lastLexemeEnd = l.getEndPosition();
		}

		//藉助lucene queryparser 生成SWMC Query
		QueryParser qp = new QueryParser(fieldName, new StandardAnalyzer());
		qp.setDefaultOperator(QueryParser.AND_OPERATOR);
		qp.setAutoGeneratePhraseQueries(true);
		
		if(quickMode && (shortCount * 1.0f / totalCount) > 0.5f){
			try {
				//System.out.println(keywordBuffer.toString());
				Query q = qp.parse(keywordBuffer_Short.toString());
				return q;
			} catch (ParseException e) {
				e.printStackTrace();
			}
			
		}else{
			if(keywordBuffer.length() > 0){
				try {
					//System.out.println(keywordBuffer.toString());
					Query q = qp.parse(keywordBuffer.toString());
					return q;
				} catch (ParseException e) {
					e.printStackTrace();
				}
			}
		}
		return null;
	}
}

Lucene6.5.0 下中文分詞IKAnalyzer編譯和使用

前言

修改和編譯IKAnalyzer

Lucene6.5.0 下中文分詞IKAnalyzer編譯和使用

ElasticSearch 6.5.4 安裝中文分詞器 IK和pinyiin

Solr6.5配置中文分詞IKAnalyzer和拼音分詞pinyinAnalyzer (二)

Solr 5.0.0配置中文分詞器IK Analyzer

二、Elastic5.5.2安裝中文分詞器教程及簡單測試

Solr-6.5.1配置中文分詞器smartcn

ElasticSearch5.5.0 通過IK分詞建立IK對映

SolrCloud-5.5.1配置中文分詞ansj-3.4.6

solr8.0 ik中文分詞器的簡單配置（二）

elasticsearch教程--中文分詞器作用和使用

結巴中文分詞的學習和使用

centos7 下solr7.4.0 配置mysql 資料來源、中文分詞

Linux系統下Solr7.0安裝及設定中文分詞和拼音檢索

Centos下Sphinx中文分詞編譯安裝測試---CoreSeek

Ubuntu16.04下安裝elasticsearch+kibana實現php客戶端的中文分詞

Solr6.6.0添加IK中文分詞器

IKAnalyzer結合Lucene實現中文分詞

IKAnalyzer中文分詞器V2012_FF使用手冊

Linux下ElasticSearch6.4.x、ElasticSearch-Head、Kibana以及中文分詞器IK的安裝配置

Solr-4.10 配置中文分詞器(IKAnalyzer)

Lucene6.5.0 下中文分詞IKAnalyzer編譯和使用

前言

修改和編譯IKAnalyzer

相關推薦