ElasticSearch學習筆記之三十三 IK分詞器擴充套件字典及text全文型別資料分詞聚合查詢

阿新 • • 發佈：2018-12-11

ElasticSearch學習筆記之三十三 IK分詞器擴充套件字典及text全文型別資料分詞聚合查詢

專屬詞彙分詞失敗
擴充套件字典

檢視當前詞庫
自定義詞典
更新配置

再次檢視分詞
text全文型別資料分詞聚合

新建索引
插入資料
聚合查詢

專屬詞彙分詞失敗

前面我們已經知道了IK分詞器已經可以很好的為中文的text全文型別資料分詞，但是有一些特定行業的特定專屬詞彙，IK分詞器卻不能按照我們的設想來分詞，例如人名／書名／專屬詞彙等等

例如我們分析一下

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": ["鬥破蒼穹真好看"]
}

結果如下

{
  "tokens": [
    {
      "token": "鬥",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "破",
      "start_offset": 1,
      "end_offset": 2,
      "type" 
: "CN_CHAR",
      "position": 1
    },
    {
      "token": "蒼穹",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "真",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token" 
: "好看",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    }
  ]
}

很顯然，按照我們的設想，我們希望的分詞結果是鬥破蒼穹,真,好看，但是我們卻沒有得到我們想要的結果。這個時候擴充套件字典就可以派上用場了。

擴充套件字典

檢視當前詞庫

# ls  -l /usr/local/elasticsearch/config/analysis-ik/
總用量 8260
-rw-rw----. 1 elk  elk  5225922 10月 16 10:49 extra_main.dic
-rw-rw----. 1 elk  elk    63188 10月 16 10:49 extra_single_word.dic
-rw-rw----. 1 elk  elk    63188 10月 16 10:49 extra_single_word_full.dic
-rw-rw----. 1 elk  elk    10855 10月 16 10:49 extra_single_word_low_freq.dic
-rw-rw----. 1 elk  elk      156 10月 16 10:49 extra_stopword.dic
-rw-rw----. 1 elk  elk      644 12月  4 13:15 IKAnalyzer.cfg.xml
-rw-rw----. 1 elk  elk  3058510 10月 16 10:49 main.dic
-rw-rw----. 1 elk  elk      123 10月 16 10:49 preposition.dic
-rw-rw----. 1 elk  elk     1824 10月 16 10:49 quantifier.dic
-rw-rw----. 1 elk  elk      164 10月 16 10:49 stopword.dic
-rw-rw----. 1 elk  elk      192 10月 16 10:49 suffix.dic
-rw-rw----. 1 elk  elk      752 10月 16 10:49 surname.dic

自定義詞典

mkdir /usr/local/elasticsearch/config/analysis-ik/custom && /usr/local/elasticsearch/config/analysis-ik/custom

vim new_word.dic

將鬥破蒼穹寫入到new_word.dic

更新配置

vim /usr/local/elasticsearch/config/analysis-ik/IKAnalyzer.cfg.xml

將自定義詞典路徑配置進去

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
        <comment>IK Analyzer 擴充套件配置</comment>
        <!--使用者可以在這裡配置自己的擴充套件字典 -->
        <entry key="ext_dict">custom/new_word.dic</entry>
         <!--使用者可以在這裡配置自己的擴充套件停止詞字典-->
        <entry key="ext_stopwords"></entry>
        <!--使用者可以在這裡配置遠端擴充套件字典 -->
        <!-- <entry key="remote_ext_dict">words_location</entry> -->
        <!--使用者可以在這裡配置遠端擴充套件停止詞字典-->
        <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

然後重啟elasticsearch就可以了。

再次檢視分詞

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": ["鬥破蒼穹真好看"]
}

結果如下：

{
  "tokens": [
    {
      "token": "鬥破蒼穹",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "真",
      "start_offset": 4,
      "end_offset": 5,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "好看",
      "start_offset": 5,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 2
    }
  ]
}

text全文型別資料分詞聚合

當我們使用elasticsearch對text全文型別資料進行索引詞聚合的時候，elasticsearch會自動進行分詞將分詞後的結果進行聚合。獲取每一個分詞出現在文件的文件個數。

新建索引

PUT message
{
  "mappings": {
    "message": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "ik_smart",
          "fielddata": true
        },
        "content": {
          "type": "text",
          "analyzer": "ik_smart",
          "fielddata": true
        }
      }
    }
  }
}

插入資料

PUT /message/message/1
{
  "title":"鬥破蒼穹好看",
  "content":"鬥破蒼穹真好看，就是更新慢 "
}

PUT /message/message/2
{
  "title":"鬥破蒼穹沒看過",
  "content":"鬥破蒼穹聽說好看沒看過"
}

PUT /message/message/3
{
  "title":"鬥破蒼穹是啥",
  "content":"不知道沒看過"
}

PUT /message/message/4
{
  "title":"鬥破蒼穹是啥",
  "content":"鬥破蒼穹是小說"
}

聚合查詢

GET message/message/_search
{
  "aggs": {
    "term_content": {
      "terms": {
        "field": "content",
        "size": 20
      }
    }
  },
  "size": 0
}

結果如下：

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "term_content": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "鬥破蒼穹",
          "doc_count": 3
        },
        {
          "key": "好看",
          "doc_count": 2
        },
        {
          "key": "沒",
          "doc_count": 2
        },
        {
          "key": "看過",
          "doc_count": 2
        },
        {
          "key": "不知道",
          "doc_count": 1
        },
        {
          "key": "聽說",
          "doc_count": 1
        },
        {
          "key": "小說",
          "doc_count": 1
        },
        {
          "key": "就是",
          "doc_count": 1
        },
        {
          "key": "慢",
          "doc_count": 1
        },
        {
          "key": "是",
          "doc_count": 1
        },
        {
          "key": "更新",
          "doc_count": 1
        },
        {
          "key": "真",
          "doc_count": 1
        }
      ]
    }
  }
}

ElasticSearch學習筆記之三十三 IK分詞器擴充套件字典及text全文型別資料分詞聚合查詢

ElasticSearch學習筆記之三十三 IK分詞器擴充套件字典及text全文型別資料分詞聚合查詢專屬詞彙分詞失敗擴充套件字典檢視當前詞庫自定義詞典更新配置再次檢視分詞 text全文型別資料分詞聚合

ElasticSearch學習筆記之三十二 JAVA Client 之 Exists Delete Update APIs

ElasticSearch學習筆記之三十二 JAVA Client 之 Exists Delete Update APIs Exists API Exists Request Synchronous Execution(同步執行) Asy

ElasticSearch學習筆記之三十一 JAVA Client 之 GET APIs

ElasticSearch學習筆記之三十一 JAVA Client 之 GET APIs Get API Get Request Optional arguments(引數配置) Synchronous Execution(同步執行)

ElasticSearch學習筆記之三十 JAVA Client 之 Document APIs

ElasticSearch學習筆記之三十 JAVA Client 之文件請求概述 Document APIs(文件APIS) Index API Index Request(索引請求) Providing the document sou

ElasticSearch學習筆記之二十三桶聚合

ElasticSearch學習筆記之二十三桶聚合桶聚合 Children Aggregation（子聚合） Range Aggregation（範圍聚合） Keyed Response

ElasticSearch學習筆記之三十 JAVA Client 之 Exists Delete Update APIs

Exists API 如果文件存在的化exists API 會返回true，否則false。 Exists Request Exists Request就像Get API一樣使用GetRequest. 也支援Get API的所有功能引數. 由於exists()

ElasticSearch學習筆記（二）IK分詞器和拼音分詞器的安裝

ElasticSearch是自帶分詞器的，但是自帶的分詞器一般就只能對英文分詞，對英文的分詞只要識別空格就好了，還是很好做的（ES的這個分詞器和Lucene的分詞器很想，是不是直接使用Lucene的就不知道），自帶的分詞器對於中文就只能分成一個字一個字，這個顯然

Linux學習筆記之三————Linux命令概述

上下 eight ive 幫助 option pos misc tor tro 一、引言很多人可能在電視或電影中看到過類似的場景，黑客面對一個黑色的屏幕，上面飄著密密麻麻的字符，梆梆一頓敲，就完成了竊取資料的任務。 Linux 剛出世時沒有什麽圖形界面，所有的操

Linux學習筆記（三十三）iptables備份、firewalld

iptables備份；firewall一、保存和備份iptables規則 service iptables save //會把規則保存到/etc/sysconfig/iptables iptables-save > my.ipt // 把iptables規則備份到my.ipt文件中

dbms_lob包學習筆記之三：instr和substr存儲過程

hello 字節數 TE bms HERE substring 成功其中 oracle instr和substr存儲過程，分析內部大對象的內容 instr函數與substr函數 instr函數用於從指定的位置開始，從大型對象中查找第N個與模式匹配

R語言學習筆記之三

結構 urn padding 效果 rand html 創建字符 pri 僅用於記錄R語言學習過程：內容提要：條件與循環正文：格式：條件 ü if (條件) 執行的程序 else ü if (條件) {函數體（分行，或者用；隔開） } else 返回值 ü

ElasticSearch學習筆記（三）

elastic ise inf arc img png cse 演示 earch 1. URISearch詳解與演示 ElasticSearch學習筆記（三）

C++學習筆記之三

() 利用在一起忘記構造函數 end 會同筆記允許 9.3.3為何所有的析構函數都應該聲明為 virtual 的？如果使用delete刪去一個實際指向派生類的基類指針，析構函數調用鏈就被破壞。這導致後面使用parent 的指針訪問child 對象並刪去對象時，就

Memcached學習筆記之三：詳解MemCached原理

memcached是一個高效能的分散式記憶體快取伺服器，memcached在Linux上可以通過yum命令安裝，這樣方便很多，在生產環境下建議用Linux系統，memcached使用libevent這個庫在Linux系統上才能發揮它的高效能。它的分散式其實在服務端是不具有分散式的特徵的，是依靠客戶端

Vue.js框架學習筆記之三

Vue.js中的表單可以用v-model指令在表單 <input>及 <textarea>元素上建立雙向資料繫結,它會根據控制元件型別自動選取正確的方法來更新元素。 v-model 會忽略所有表單元素的 value、checked、select

ElasticSearch學習筆記之二十八細說Pipeline Aggregations

ElasticSearch學習筆記之二十八細說Pipeline Aggregations Avg Bucket Aggregation(平均值分組聚合) Syntax(語法) avg_bucket 引數 Max Bucket Ag

ElasticSearch學習筆記之二十七 Pipeline Aggregations

ElasticSearch學習筆記之二十七 Pipeline Aggregations Pipeline Aggregations buckets_path 語法 Special Paths(特殊路徑) Dealing with dots

ElasticSearch學習筆記之二十九 Java REST Client

ElasticSearch學習筆記之二十九 Java REST Client Java REST Client Java High Level REST Client Compatibility(相容性) Javadoc Maven Reposi

ElasticSearch學習筆記之二十二指標聚合續

ElasticSearch學習筆記之二十二指標聚合續 Max Aggregation Min Aggregation Percentiles Aggregation Stats Aggregation Sum Aggregation Va

ElasticSearch學習筆記之二十一指標聚合

ElasticSearch學習筆記之二十一指標聚合指標聚合 Avg Aggregation Script Value Script Missing value Weighted Avg Agg

ElasticSearch學習筆記之三十三 IK分詞器擴充套件字典及text全文型別資料分詞聚合查詢