ElesticSearch IK中文分詞使用詳解

阿新 • • 發佈：2019-02-20

一、前言

對於ES IK分詞外掛在中文檢索中非常常用，本人也使用了挺久的。但知識細節一直很碎片化，一直沒有做詳細的整理。過一段時間用的話，也是依然各種找資料，也因此會降低開發效率。所以在有空的時候好好整理下相關資料。也希望本文對使用 ElasticSearch 和 IK分詞外掛的開發者有所幫助。希望能少走點彎路。

本文包括前言、IK分詞介紹、分詞效果對比、自定義詞典使用、索引設定和欄位設定（Python 建立索引並匯入資料）、查詢測試（Python 查詢）、結論等七個部分。

二、IK分詞介紹

IK分詞器外掛的安裝、測試、自定義詞典的使用可直接參考。github上的資料：https://github.com/medcl/elasticsearch-analysis-ik

這裡注意三點：
1. 注意ElasticSearch和IK外掛版本的對應。
2. 在ElasticSearch的配置檔案config/elasticsearch.yml中的最後一行新增引數 index.analysis.analyzer.default.type: ik，則設定所有索引的預設分詞器為ik分詞(也可以不這麼做，通過設定mapping來使用ik分詞)。
3. 強調下IK分詞器的兩種分詞模式。

ik_max_word: 會將文字做最細粒度的拆分，比如會將"中華人民共和國國歌"拆分為"中華人民共和國,中華人民,中華,華人,人民共和國,人民,人,民,共和國,共和,和,國國,國歌"，會窮盡各種可能的組合； 


ik_smart: 會做最粗粒度的拆分，比如會將"中華人民共和國國歌"拆分為"中華人民共和國,國歌"。

驗證 IK 安裝成功，並測試兩種分詞模式：

http://localhost:9200/_analyze/?analyzer=ik_smart&text=中華人民共和國國歌

這裡寫圖片描述

http://localhost:9200/_analyze/?analyzer=ik_max_word&text=中華人民共和國國歌

這裡寫圖片描述

三、分詞效果對比

基於github上給的資料

1 建立2個索引ik_test和 ik_test_1

curl -XPUT http://localhost:9200/ik_test
curl -XPUT http://localhost:9200/ik_test_1

2 對 ik_test 索引設定mapping

curl -XPOST http://localhost:9200/ik_test/fulltext/_mapping -d'
{
    "fulltext": {
        "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }

3 對兩個索引插入資料

curl -XPOST http://localhost:9200/ik_test/fulltext/1 -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'
curl -XPOST http://localhost:9200/ik_test/fulltext/2 -d'
{"content":"公安部：各地校車將享最高路權"}
'
curl -XPOST http://localhost:9200/ik_test/fulltext/3 -d'
{"content":"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"}
'
curl -XPOST http://localhost:9200/ik_test/fulltext/4 -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}
'

curl -XPOST http://localhost:9200/ik_test_1/fulltext/1 -d'
{"content":"美國留給伊拉克的是個爛攤子嗎"}
'
curl -XPOST http://localhost:9200/ik_test_1/fulltext/2 -d'
{"content":"公安部：各地校車將享最高路權"}
'
curl -XPOST http://localhost:9200/ik_test_1/fulltext/3 -d'
{"content":"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船"}
'
curl -XPOST http://localhost:9200/ik_test_1/fulltext/4 -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}
'

4 對兩個索引分別搜尋

curl -XPOST http://localhost:9200/ik_test/fulltext/_search?pretty  -d'{
    "query" : { "match" : { "content" : "洛杉磯領事館" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

結果如下：
這裡寫圖片描述

curl -XPOST http://localhost:9200/ik_test_1/fulltext/_search?pretty  -d'{
    "query" : { "match" : { "content" : "洛杉磯領事館" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

結果如下：
這裡寫圖片描述

四、自定義詞典使用

自定義詞典使用，按照github上的說明配置詞典。在 custom/mydict.dic 檔案中增加 “洛杉磯領事館” 一詞，然後重啟ES。自定義詞典使用參考：https://github.com/medcl/elasticsearch-analysis-ik
使用如下搜尋：

curl -XPOST http://localhost:9200/ik_test/fulltext/_search?pretty  -d'{
    "query" : { "match" : { "content" : "洛杉磯領事館" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

結果如下：
這裡寫圖片描述
從結果可見，貌似自定義詞典沒有起作用。是的、、、這裡困擾我很久的，一直以為這功能有問題。後多次測試後發現，繼續插入資料的話，對以後的資料是能正確分詞的。

在修改自定義詞典之後，插入第5條資料，content欄位和第4條資料是一樣的。

curl -XPOST http://localhost:9200/ik_test/fulltext/5 -d'
{"content":"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"}'

然後繼續用上述的query 進行搜尋。
這裡寫圖片描述
結果查到 _id =4 和 _id =5 的兩條資料，其中 _id =5 就是我們想要的結果，_id =4 按理來說確實是我們想要的結果。但是結果結果卻把”洛杉磯領事館” 切成了兩個詞語。

猜測和 ES中儲存以及match 搜尋方式有關。
“洛杉磯領事館” 一詞在 _id = 4的文件中存為：“洛杉磯”、“領事館”、“洛”、“杉”、“磯”、“領事”、“館” 等7個詞語。
“洛杉磯領事館” 一詞在 _id = 5的文件中存為：“洛杉磯領事館”、“洛杉磯”、“領事館”、“洛”、“杉”、“磯”、“領事”、“館” 等8個詞語。

分詞結果如下：

http://localhost:9200/_analyze/?analyzer=ik_max_word&text=洛杉磯領事館

這裡寫圖片描述

還有這裡如果用 term 方式進行搜尋。

curl -XPOST http://localhost:9200/ik_test/fulltext/_search?pretty  -d'{
    "query" : { "term" : { "content" : "洛杉磯領事館" }},
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "content" : {}
        }
    }
}'

這裡寫圖片描述

結果只搜到了_id =5的文件。因此，這裡驗證了我的兩個猜測：
1. match 和 term 的搜尋方式不同。參考：http://www.cnblogs.com/yjf512/p/4897294.html
2. ES底層儲存有關：不然的話用term搜尋，應該把 _id =4 和_id=5全部搜尋出來，這裡之所以 _id=4沒有搜出來是因為 _id=4 的底層在ES中的儲存不包含“洛杉磯領事館” 這整體一詞。

五、索引設定和欄位設定

這裡主要是mapping得設定，可以使用IK給的 mapping 格式：

{
    "fulltext": {
        "_all": {
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "term_vector": "no",
            "store": "false"
        },
        "properties": {
            "content": {
                "type": "string",
                "store": "no",
                "term_vector": "with_positions_offsets",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word",
                "include_in_all": "true",
                "boost": 8
            }
        }
    }
}

# -*- coding: utf-8 -*-

import elasticsearch


class ElasticSearchClient(object):
    @staticmethod
    def get_es_servers():
        es_servers = [{
            "host": "localhost",
            "port": "9200"
        }]
        es_client = elasticsearch.Elasticsearch(hosts=es_servers)
        return es_client


class LoadElasticSearch(object):
    def __init__(self):
        self.index = "hz"
        self.doc_type = "text"
        self.es_client = ElasticSearchClient.get_es_servers()
        self.set_mapping()

    def set_mapping(self):
        """
        設定mapping
        """
        chinese_field_config = {
            "type": "string",
            "store": "no",
            "term_vector": "with_positions_offsets",
            "analyzer": "ik_max_word",
            "search_analyzer": "ik_max_word",
            "include_in_all": "true",
            "boost": 8
        }

        mapping = {
            self.doc_type: {
                "_all": {"enabled": False},

                "properties": {
                    "document_id": {
                        "type": "integer"
                    },
                    "content": chinese_field_config
                }
            }
        }

        if not self.es_client.indices.exists(index=self.index):
            # 建立Index和mapping
            self.es_client.indices.create(index=self.index, ignore=400)
            self.es_client.indices.put_mapping(index=self.index, doc_type=self.doc_type, body=mapping)

    def add_date(self, row_obj):
        """
        單條插入ES
        """
        _id = row_obj.get("_id", 1)
        row_obj.pop("_id")
        self.es_client.index(index=self.index, doc_type=self.doc_type, body=row_obj, id=_id)


if __name__ == '__main__':

    content_ls = [
        u"美國留給伊拉克的是個爛攤子嗎",
        u"公安部：各地校車將享最高路權",
        u"中韓漁警衝突調查：韓警平均每天扣1艘中國漁船",
        u"中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
    ]

    load_es = LoadElasticSearch()
    # 插入單條資料測試
    for index, content in enumerate(content_ls):
        write_obj = {
            "_id": index,
            "document_id": index,
            "content": content
        }
        load_es.add_date(write_obj)

六、查詢測試

# -*- coding: utf-8 -*-

import elasticsearch


class ElasticSearchClient(object):
    @staticmethod
    def get_es_servers():
        es_servers = [{
            "host": "localhost",
            "port": "9200"
        }]
        es_client = elasticsearch.Elasticsearch(hosts=es_servers)
        return es_client


class SearchData(object):
    index = 'hz'
    doc_type = 'text'

    @classmethod
    def search(cls, field, query, search_offset, search_size):
        # 設定查詢條件
        es_search_options = cls.set_search_optional(field, query)
        # 發起檢索。
        es_result = cls.get_search_result(es_search_options, search_offset, search_size)
        # 對每個結果, 進行封裝。得到最終結果
        final_result = cls.get_highlight_result_list(es_result, field)
        return final_result

    @classmethod
    def get_highlight_result_list(cls, es_result, field):
        result_items = es_result['hits']['hits']
        final_result = []
        for item in result_items:
            item['_source'][field] = item['highlight'][field][0]
            final_result.append(item['_source'])
        return final_result

    @classmethod
    def get_search_result(cls, es_search_options, search_offset, search_size):
        es_result = ElasticSearchClient.get_es_servers().search(
            index=cls.index,
            doc_type=cls.doc_type,
            body=es_search_options,
            from_=search_offset,
            size=search_size
        )
        return es_result

    @classmethod
    def set_search_optional(cls, field, query):
        es_search_options = {
            "query": {
                "match": {
                    field: {
                        "query": query,
                        "slop": 10
                    }
                }
            },
            "highlight": {
                "fields": {
                    "*": {
                        "require_field_match": True,
                    }
                }
            }
        }
        return es_search_options


if __name__ == '__main__':
    final_results = SearchData().search("content", "中國", 0, 30)
    for obj in final_results:
        for k, v in obj.items():
            print k, ":", v
        print "======="

輸出結果：

七、結論

文中也實現了，搜尋相關的功能。不過更加詳細的內容，此文先不介紹了，下篇文章會對ElasticSearch 的搜尋進行講解和實現。

感謝閱讀！

ElesticSearch IK中文分詞使用詳解

一、前言

二、IK分詞介紹

三、分詞效果對比

四、自定義詞典使用

五、索引設定和欄位設定

六、查詢測試

七、結論

ElesticSearch IK中文分詞使用詳解

es5.4安裝head、ik中文分詞插件

Solr6.6.0添加IK中文分詞器

elastic ik中文分詞測試

Solr6.2搭建和配置ik中文分詞器

ES倒排索引與分詞詳解

IK中文分詞器安裝

solr與ik中文分詞的配置，以及新增Core（Add Core）的方式

elasticsearch6.x ik中文分詞整合

學習筆記:從0開始學習大資料-29. solr增加ik中文分詞器並匯入doc，pdf文件全文檢索

NLP ---分詞詳解（常見的五種分詞技術二）

NLP ---分詞詳解（常見的五種分詞技術一）

NLP --- 分詞詳解（分詞的歷史）

solr 6.2.0系列教程（二）IK中文分詞器配置及新增擴充套件詞、停止詞、同義詞

Elasticsearch5.5.1安裝IK中文分詞器

淘淘商城23_solr在Linux上的操作02_安裝IK中文分詞器

solr5.5版本中ik中文分詞配置

solr5.3.1 整合IK中文分詞器

IK中文分詞擴充套件自定義詞典！！！

ElasticSearch系列五：掌握ES使用IK中文分詞器

ElesticSearch IK中文分詞使用詳解

一、前言

二、IK分詞介紹

三、分詞效果對比

四、自定義詞典使用

五、索引設定和欄位設定

六、查詢測試

七、結論

相關推薦