Elasticsearch 之（12）query string的分詞，修改分詞器以及自定義分詞器

阿新 • • 發佈：2018-12-22

query string分詞

query string必須以和index建立時相同的analyzer進行分詞query string對exact value和full text的區別對待（第10節中詳細闡述過）date：exact value_all：full text比如我們有一個document，其中有一個field，包含的value是：hello you and me，建立倒排索引我們要搜尋這個document對應的index，搜尋文字是hell me，這個搜尋文字就是query stringquery string，預設情況下，es會使用它對應的field建立倒排索引時相同的分詞器去進行分詞，分詞和normalization，只有這樣，才能實現正確的搜尋我們建立倒排索引的時候，將dogs --> dog，結果你搜索的時候，還是一個dogs，那不就搜尋不到了嗎？所以搜尋的時候，那個dogs也必須變成dog才行。才能搜尋到。知識點：不同型別的field，可能有的就是full text，有的就是exact valuepost_date，date：exact value_all：full text，分詞，normalization

分詞器使用

GET /_search?q=2017搜尋的是_all field，document所有的field都會拼接成一個大串，進行分詞2017-01-02 my second article this is my second article in this website 11400 doc1 doc2 doc32017 * * *01 * 02 *03 *_all，2017，自然會搜尋到3個docuemntGET /_search?q=2017-01-01_all，2017-01-01，query string會用跟建立倒排索引一樣的分詞器去進行分詞20170101GET /_search?q=post_date:2017-01-01date，會作為exact value去建立索引 doc1 doc2 doc32017-01-01 * 2017-01-02 * 2017-01-03 *post_date:2017-01-01，2017-01-01，doc1一條documentGET /_search?q=post_date:2017，這個在這裡不講解，因為是es 5.2以後做的一個優化

測試分詞器

GET /_analyze
{
  "analyzer": "standard",
  "text": "Text to analyze"
}

（1）往es裡面直接插入資料，es會自動建立索引，同時建立type以及對應的mapping（2）mapping中就自動定義了每個field的資料型別（3）不同的資料型別（比如說text和date），可能有的是exact value，有的是full text（4）exact value，在建立倒排索引的時候，分詞的時候，是將整個值一起作為一個關鍵詞建立到倒排索引中的；full text，會經歷各種各樣的處理，分詞，normaliztion（時態轉換，同義詞轉換，大小寫轉換），才會建立到倒排索引中（5）同時呢，exact value和full text型別的field就決定了，在一個搜尋過來的時候，對exact value field或者是full text field進行搜尋的行為也是不一樣的，會跟建立倒排索引的行為保持一致；比如說exact value搜尋的時候，就是直接按照整個值進行匹配，full text query string，也會進行分詞和normalization再去倒排索引中去搜索（6）可以用es的dynamic mapping，讓其自動建立mapping，包括自動設定資料型別；也可以提前手動建立index和type的mapping，自己對各個field進行設定，包括資料型別，包括索引行為，包括分詞器，等等mapping，就是index的type的元資料，每個type都有一個自己的mapping，決定了資料型別，建立倒排索引的行為，還有進行搜尋的行為

正排索引

搜尋的時候，要依靠倒排索引；排序的時候，需要依靠正排索引，看到每個document的每個field，然後進行排序，所謂的正排索引，其實就是doc values在建立索引的時候，一方面會建立倒排索引，以供搜尋用；一方面會建立正排索引，也就是doc values，以供排序，聚合，過濾等操作使用doc values是被儲存在磁碟上的，此時如果記憶體足夠，os會自動將其快取在記憶體中，效能還是會很高；如果記憶體不足夠，os會將其寫入磁碟上doc1: hello world you and medoc2: hi, world, how are youword doc1 doc2hello *world * *you * *and *me *hi *how *are *hello you --> hello, youhello --> doc1you --> doc1,doc2doc1: hello world you and medoc2: hi, world, how are yousort by agedoc1: { "name": "jack", "age": 27 }doc2: { "name": "tom", "age": 30 }document name agedoc1 jack 27doc2 tom 30

預設的分詞器

standardstandard tokenizer：以單詞邊界進行切分standard token filter：什麼都不做lowercase token filter：將所有字母轉換為小寫stop token filer（預設被禁用）：移除停用詞，比如a the it等等

修改分詞器的設定

啟用english停用詞token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "analyzer": "standard", 
  "text": "a dog is in the house"
}

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text":"a dog is in the house"
}

定製化自己的分詞器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": ["&=> and"]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, <a>, HAHA!!",
  "analyzer": "my_analyzer"
}

PUT /my_index/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

Elasticsearch 之（12）query string的分詞，修改分詞器以及自定義分詞器

query string分詞

分詞器使用

測試分詞器

正排索引

預設的分詞器

修改分詞器的設定

定製化自己的分詞器

Elasticsearch 之（12）query string的分詞，修改分詞器以及自定義分詞器

Elasticsearch 之（25）重寫IK分詞器原始碼來基於mysql熱更新詞庫

JVM 之（12）類載入機制

Elasticsearch 之（21）字首搜尋、萬用字元搜尋、正則搜尋、推薦搜尋和模糊搜尋

Elasticsearch 之（43） Java API 實現 ES 的增刪改查、聚合分析

Elasticsearch 之（36）使用search template將搜尋模板化

Elasticsearch 之（41）搜尋距離當前位置一定範圍內的酒店

shell練習（12）——批量生成用戶，並設置密碼

Elasticsearch筆記六之中文分詞器及自定義分詞器

SQL Server2012 學習之（六）：檢視的建立、修改等基本操作

CSS進階（12）—— position:absolute如此高深，我當真不懂（上）

曹工說Spring Boot原始碼（12）-- Spring解析xml檔案，到底從中得到了什麼（context:component-scan完整解析）

列表（list）的增、刪、改、查。range自定義數組。1.24日

深入分析Spring屬性編輯器（預設屬性編輯器和自定義屬性編輯器）

Android學習（5）——靜態變數傳值，全域性變數傳值以及由A-B-A的傳值

cookie和session以及自定義分頁

Django分頁器及自定義分頁器

【Apache Solr系列】使用IKAnalyzer中文分詞以及自定義分詞字典

ElasticSearch最佳入門實踐（四十一）query string 的分詞以及 mapping 引入案例遺留問題的大揭祕

ElasticSearch最佳入門實踐（三十六）query string search 語法以及 _all metadata 原理揭祕

Elasticsearch 之（12）query string的分詞，修改分詞器以及自定義分詞器

query string分詞

分詞器使用

測試分詞器

正排索引

預設的分詞器

修改分詞器的設定

定製化自己的分詞器

相關推薦