1. 程式人生 > >elasticsearch搜尋中文分詞理解、類似SQL語句中的"LIKE”條件的模糊搜尋以及忽略大小寫搜尋

elasticsearch搜尋中文分詞理解、類似SQL語句中的"LIKE”條件的模糊搜尋以及忽略大小寫搜尋

elasticsearch作為一款搜尋引擎,應用於資料庫無法承受前端的搜尋壓力時,採用其進行資料的搜尋。可以大併發架構設計中一種選擇,以下是elasticsearch搜尋引擎的部分規則,在實際應用中可以讓我們快速熟悉和幫助解決一些問題。

01》不進行分詞的索引建立

URL:es_index_test

{
  "settings": {
    "index": {
      "number_of_shards": "4",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "es_index_type_test": {
      "properties": {
        "productId": {
          "type": "text"
        },
        "productName": {
          "type": "keyword",
          "index": "true"
        }
      }
    }
  }

}

說明:“productName”屬性建立索引時,將其設定為不進行分詞設定。利用wildcard搜尋方式,可以實現MYSQL中的LIKE效果。例如:文件{"productId":10001,"productName":"山雞圖"},可以用{"query":{"wildcard":{"productName":"*雞*"}}},搜尋出來。

02》需要進行分詞的索引建立

URL:es_index_test

{
  "settings": {
    "index": {
      "number_of_shards": "4",
      "number_of_replicas": "1"
    }
  },
  "mappings": {
    "es_index_type_test": {
      "properties": {
        "productId": {
        "type": "text"
        },
        "productName": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }

}

說明:“productName”屬性建立索引時,將其設定為進行分詞設定。elasticsearch預設針對中文的分詞是按照一箇中文字元,就是一個分詞。例如:文件{"productId":10001,"productName":"山雞圖"}中,會拆分為“山”、“雞”和“圖”三個分詞。中文分詞的拆分,可以安裝ik分詞器進行分詞拆分。例如:文件{"productId":10001,"productName":"山雞圖"}中,會拆分為“山雞”和“圖”兩個分詞。中文短句具體拆分成哪些分詞是ik分詞器的字典來識別的,此字典可以根據實際情況進行調整。

03》忽略大小寫的索引建立

URL:es_index_test
{
  "settings": {
    "index": {
      "number_of_shards": "10",
      "number_of_replicas": "3"
    },
    "analysis": {
      "normalizer": {
        "es_normalizer": {
          "filter": [
            "lowercase",
            "asciifolding"
          ],
          "type": "custom"
        }
      }
    }
  },
  "mappings": {
    "es_index_test": {
      "properties": {
        "productId": {
          "type": "text"
        },
        "productName": {
          "type": "keyword",
          "normalizer": "es_normalizer",
          "index": "true"
        }
      }
    }
  }

}

說明:“productName”屬性建立索引時,將其設定為忽略大小寫。

04》分詞查詢

URL:es_index_test/es_index_type_test/_analyze

  • ik分詞器以“ik_max_word”方式拆分
{
  "analyzer":"ik_max_word",
  "text":"中華人民共和國"

}

  • 結果

{
  "tokens": [
    {
      "end_offset": 7,
      "start_offset": 0,
      "position": 0,
      "type": "CN_WORD",
      "token": "中華人民共和國"
    },
    {
      "end_offset": 4,
      "start_offset": 0,
      "position": 1,
      "type": "CN_WORD",
      "token": "中華人民"
    },
    {
      "end_offset": 2,
      "start_offset": 0,
      "position": 2,
      "type": "CN_WORD",
      "token": "中華"
    },
    {
      "end_offset": 3,
      "start_offset": 1,
      "position": 3,
      "type": "CN_WORD",
      "token": "華人"
    },
    {
      "end_offset": 7,
      "start_offset": 2,
      "position": 4,
      "type": "CN_WORD",
      "token": "人民共和國"
    },
    {
      "end_offset": 4,
      "start_offset": 2,
      "position": 5,
      "type": "CN_WORD",
      "token": "人民"
    },
    {
      "end_offset": 7,
      "start_offset": 4,
      "position": 6,
      "type": "CN_WORD",
      "token": "共和國"
    },
    {
      "end_offset": 6,
      "start_offset": 4,
      "position": 7,
      "type": "CN_WORD",
      "token": "共和"
    },
    {
      "end_offset": 7,
      "start_offset": 6,
      "position": 8,
      "type": "CN_CHAR",
      "token": "國"
    }
  ]

}

  • ik分詞器以“ik_smart”方式拆分

{
  "analyzer":"ik_smart",
  "text":"中華人民共和國"

}

  • 結果

{

"tokens": [

    {
      "end_offset": 7,
      "start_offset": 0,
      "position": 0,
      "type": "CN_WORD",
      "token": "中華人民共和國"
    }
  ]

}

  • ES預設

{
  "text":"中華人民共和國"

}

  • 結果

{

  "tokens": [
    {
      "end_offset": 1,
      "start_offset": 0,
      "position": 0,
      "type": "<IDEOGRAPHIC>",
      "token": "中"
    },
    {
      "end_offset": 2,
      "start_offset": 1,
      "position": 1,
      "type": "<IDEOGRAPHIC>",
      "token": "華"
    },
    {
      "end_offset": 3,
      "start_offset": 2,
      "position": 2,
      "type": "<IDEOGRAPHIC>",
      "token": "人"
    },
    {
      "end_offset": 4,
      "start_offset": 3,
      "position": 3,
      "type": "<IDEOGRAPHIC>",
      "token": "民"
    },
    {
      "end_offset": 5,
      "start_offset": 4,
      "position": 4,
      "type": "<IDEOGRAPHIC>",
      "token": "共"
    },
    {
      "end_offset": 6,
      "start_offset": 5,
      "position": 5,
      "type": "<IDEOGRAPHIC>",
      "token": "和"
    },
    {
      "end_offset": 7,
      "start_offset": 6,
      "position": 6,
      "type": "<IDEOGRAPHIC>",
      "token": "國"
    }
  ]

}

說明:以上三種分詞拆分的方式不一樣,最終產生分詞的結果不相同。

05》資料查詢-wildcard

URL:es_index_test/es_index_type_test/_search
{
  "query":{"wildcard":{"productName": "山雞圖" }}

}

說明:wildcard種查詢方式需要結合方法支援匹配符合,例如:*雞*,ES會去匹配,在JAVA程式中構建採用。JAVA程式中採用QueryBuilders類的wildcardQuery(String name, Object text)方法。

06》資料查詢-match

URL:es_index_test/es_index_type_test/_search
{
  "query":{"match":{"productName": "山雞圖" }}

}

說明:查詢時會根據分詞進行匹配,例如:“山雞圖”ES拆分為“山雞”和“圖”兩個分詞到ES搜尋引擎內篩選出記錄,最後將符合記錄的資料返回。返回的記錄可能包含,山雞湯(包含“山雞”)和山虎圖(包含“圖”分詞)。JAVA程式中採用QueryBuilders類的matchQuery(String name, Object text)方法。

07》資料查詢-term

URL:es_index_test/es_index_type_test/_search
{
  "query":{
    "term":{
      "productName":"山雞圖"
    }
  }
}說明:只有分詞完全匹配“山雞圖”這三個字後,才可以返回資料。JAVA程式中採用QueryBuilders類的termQuery(String name, Object value)方法。

08》資料查詢-terms

URL:es_index_test/es_index_type_test/_search

{
  "query":{
    "terms":{
      "productName":["山雞圖","山虎圖"]
    }
  }

}

說明:分詞匹配“山雞圖”和“山虎圖”返回記錄。JAVA程式中採用QueryBuilders類的termsQuery(String name, String... values)方法。

09》刪除查詢出來的結果集

URL:es_index_test/es_index_type_test/_delete_by_query

{
  "query":{"wildcard":{"productName": "*雞*" }}

}

說明:刪除產品名稱包含“雞”字文件。

10》elasticsearch中JAVA例項

1、ElasticSearchProperties
package com.jd.ccc.sys.biz.yb.op.notice.config;

import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.stereotype.Component;

import lombok.Data;

/**
 * ElasticSearch搜尋引擎配置引數 
 * 具體引數的配置資訊在yml檔案內
 * 
 * @create 2018-5-10
 * @author zhangqiang200<https://blog.csdn.net/zhangqiang_accp>
 *
 */
@Data
@Component
@ConfigurationProperties(prefix = "elasticsearch")
public class ElasticSearchProperties {
   /**
    * 叢集名
    */
private String clusterName;
   /**
    * 索引名稱
    */
private String indexName;

   /**
    * 型別名稱
    */
private String typeName;

   /**
    * 主節點
    */
private String masterNode;

   /**
    * 從節點
    */
private String slaveNodes;

}
2、ElasticSearchConfig
package com.jd.ccc.sys.biz.yb.op.notice.config;

import java.net.InetAddress;
import java.net.UnknownHostException;

import org.elasticsearch.client.Client;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import org.elasticsearch.transport.client.PreBuiltTransportClient;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.context.properties.EnableConfigurationProperties;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * 
 * 初始化一個ES搜尋引擎配置
 * 
 * @create 2018-5-10
 * @author zhangqiang200<https://blog.csdn.net/zhangqiang_accp>
 *
 */
@Configuration
@EnableConfigurationProperties(ElasticSearchProperties.class)
public class ElasticSearchConfig {
   
    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticSearchConfig.class);
   
   @Autowired
private ElasticSearchProperties elasticSearchProperties;
   
   private static final String SYS_PROPERTY="es.set.netty.runtime.available.processors";
   
   private static final String CLUSTER_NAME="cluster.name";
   
   private static final String CLIENT_SNIFF="client.transport.sniff";

   @Bean(name="elasticSearchCluster")
   public Client getClient() {

      System.setProperty(SYS_PROPERTY, "false");
      Settings settings = Settings.builder().put(CLUSTER_NAME, elasticSearchProperties.getClusterName())
            .put(CLIENT_SNIFF, false).build();
      
      TransportClient transportClient = null;
      try {

         String[] masters = elasticSearchProperties.getMasterNode().split(":");
         transportClient = new PreBuiltTransportClient(settings).addTransportAddress(
               new InetSocketTransportAddress(InetAddress.getByName(masters[0]), Integer.parseInt(masters[1])));

         String[] slaveNodes = elasticSearchProperties.getSlaveNodes().split(",");// 逗號分隔
         //遍歷從庫資訊
for (String node : slaveNodes) {
            String[] ipPort = node.split(":");// 冒號分隔
if (ipPort.length == 2) {
               transportClient.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName(ipPort[0]),
                     Integer.parseInt(ipPort[1])));
            }
         }
         return transportClient;
      } catch (UnknownHostException e) {
         LOGGER.error("ES 客戶端連線失敗.{}",e);
         return null;
      }
   }

}
3、服務層操作
/**
 * 查詢模糊搜尋產品列表的總記錄數
 *
 * @param likeProductName
*            模糊搜尋產品名稱的關鍵字
 * @param type
*            產品型別
 * @return 總記錄數
 *
 * @create 2018-5-9
 * @author zhangqiang200<https://blog.csdn.net/zhangqiang_accp>
 */
private Integer queryCount(String likeProductName, String type) {
    BoolQueryBuilder builder=this.builderQueryData(likeProductName, type);
    try {
        SearchResponse searchResponse = elasticSearchCluster.prepareSearch(elasticSearchProperties.getIndexName())
                .setTypes(elasticSearchProperties.getTypeName()).setQuery(builder)
                .setSearchType(SearchType.DEFAULT).get();
        SearchHits hits = searchResponse.getHits();
        return (int)hits.getTotalHits();
    }catch(Exception e) {
        LOGGER.error("Server access failure,{}",e);
        return 0;
    }
}

/**
 * 拼接模糊查詢篩選條件
 *
 * @param likeProductName
*            模糊搜尋產品名稱的關鍵字
 * @param type
*            產品型別
 * @return 篩選條件字串
 *
 * @create 2018-5-9
 * @author zhangqiang200<https://blog.csdn.net/zhangqiang_accp>
 */
private BoolQueryBuilder builderQueryData(String likeProductName, String type) {
    BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();
    boolQueryBuilder.must(QueryBuilders.matchQuery(PRODUCT_STATUS, "03"));
    if(StringUtils.isNotBlank(likeProductName)) {
        boolQueryBuilder.must(QueryBuilders.wildcardQuery(PRODUCT_NAME,"*"+likeProductName+"*"));
    }
    // 型別不為空
if (StringUtils.isNotBlank(type)) {
        String[] types = type.split(",");
        if (types.length == 1) {
            boolQueryBuilder.must(QueryBuilders.matchQuery(INST_TYPE,type));
        } else {
            boolQueryBuilder.must(QueryBuilders.termsQuery(INST_TYPE, types));
        }
    }
    LOGGER.debug("wild card query-->{}",boolQueryBuilder.toString());
    return boolQueryBuilder;
}

/**
 * 模糊查詢商品列表資料
 * @param likeProductName 模糊搜尋產品名稱的關鍵字
 * @param type 產品型別
 * @param startIndex 開始索引
 * @param pageSize 每頁大小
 * @returnW
*
 * @create 2018-5-9
 * @author zhangqiang200<https://blog.csdn.net/zhangqiang_accp>
 */
private List<String> queryData(String likeProductName, String type, int startIndex, int pageSize) {
    List<String> resultList = new ArrayList<>();
    BoolQueryBuilder builder=this.builderQueryData(likeProductName, type);
    try {

        SearchResponse searchResponse = elasticSearchCluster.prepareSearch(elasticSearchProperties.getIndexName())
                .setTypes(elasticSearchProperties.getTypeName()).setQuery(builder)
                .setSearchType(SearchType.DEFAULT).setFrom(startIndex).setSize(pageSize).get();
        SearchHit[] hits = searchResponse.getHits().getHits();


        for (SearchHit hit : hits) {
            resultList.add(hit.getSourceAsString());
        }
    }catch(Exception e) {
        LOGGER.error("Server access failure,{}",e);
    }

    return resultList;
}