搜尋引擎（Elasticsearch索引管理2）

阿新 • • 發佈：2019-03-25

學完本課題，你應達成如下目標：

掌握分詞器的配置、測試
掌握文件的管理操作
掌握路由規則。

分詞器

認識分詞器

Analyzer 分析器

在ES中一個Analyzer 由下面三種元件組合而成：

character filter ：字元過濾器，對文字進行字元過濾處理，如處理文字中的html標籤字元。處理完後再交給tokenizer進行分詞。一個analyzer中可包含0個或多個字元過濾器，多個按配置順序依次進行處理。
tokenizer：

分詞器，對文字進行分詞。一個analyzer必需且只可包含一個tokenizer。
token filter：詞項過濾器，對tokenizer分出的詞進行過濾處理。如轉小寫、停用詞處理、同義詞處理。一個analyzer可包含0個或多個詞項過濾器，按配置順序進行過濾。

如何測試分詞器

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}

搞清楚position和offset

 {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    }

內建的character filter

HTML Strip Character Filter
        html_strip ：過濾html標籤，解碼HTML entities like &.
Mapping Character Filter
        mapping ：用指定的字串替換文字中的某字串。
Pattern Replace Character Filter
        pattern_replace ：進行正則表示式替換。

HTML Strip Character Filter

測試：

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

在索引中配置：

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

測試：

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

escaped_tags 用來指定例外的標籤。如果沒有例外標籤需配置，則不需要在此進行客戶化定義，在上面的my_analyzer中直接使用 html_strip

Mapping character filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

Pattern Replace Character Filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

內建的Tokenizer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

整合的中文分詞器Ikanalyzer中提供的tokenizer：ik_smart 、 ik_max_word

測試tokenizer

POST _analyze
{
  "tokenizer":      "standard", 
  "text": "張三說的確實在理"
}

POST _analyze
{
  "tokenizer":      "ik_smart", 
  "text": "張三說的確實在理"
}

內建的Token Filter

ES中內建了很多Token filter ，詳細瞭解：https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

Lowercase Token Filter ：lowercase 轉小寫
Stop Token Filter ：stop 停用詞過濾器
Synonym Token Filter： synonym 同義詞過濾器
說明：中文分詞器Ikanalyzer中自帶有停用詞過濾功能。

Synonym Token Filter 同義詞過濾器

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "my_ik_synonym" : {
                        "tokenizer" : "ik_smart",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}
synonyms_path：指定同義詞檔案（相對config的位置）

同義詞定義格式

ES同義詞格式支援 solr、 WordNet 兩種格式。

在analysis/synonym.txt中用solr格式定義如下同義詞檔案一定要UTF-8編碼

張三,李四
電飯煲,電飯鍋 => 電飯煲   
電腦 => 計算機,computer

一行一類同義詞，=> 表示標準化為

測試：

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "張三說的確實在理"
}

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "我想買個電飯鍋和一個電腦"
}

通過例子的結果瞭解同義詞的處理行為

內建的Analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

整合的中文分詞器Ikanalyzer中提供的Analyzer：ik_smart 、 ik_max_word

內建的和整合的analyzer可以直接使用。如果它們不能滿足我們的需要，則我們可自己組合字元過濾器、分詞器、詞項過濾器來定義自定義的analyzer

自定義 Analyzer

zero or more character filters
a tokenizer
zero or more token filters.

配置引數：

PUT my_index8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
             "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }    }  }}

為欄位指定分詞器

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer"
    }
  }
}

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "other_analyzer" 
    }
  }
}

PUT my_index8/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index8/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

為索引定義個default分詞器

PUT /my_index10
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "ik_smart",
          "filter": [
            "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  },

 "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index10/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index10/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

Analyzer的使用順序

我們可以為每個查詢、每個欄位、每個索引指定分詞器。

在索引階段ES將按如下順序來選用分詞：

首先選用欄位mapping定義中指定的analyzer
欄位定義中沒有指定analyzer，則選用 index settings中定義的名字為default 的analyzer。
如index setting中沒有定義default分詞器，則使用 standard analyzer.

查詢階段ES將按如下順序來選用分詞：

The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.

文件管理

新建文件

PUT twitter/_doc/1            指定文件id，新增/修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

POST twitter/_doc/            新增，自動生成文件id
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

{
  "_index": "twitter",              //所屬索引
  "_type": "_doc",                  //所屬mapping type
  "_id": "p-D3ymMBl4RK_V6aWu_V",    //文件id
  "_version": 1,                    //文件版本
  "result": "created",
  "_shards": {                      //分片的寫入情況
    "total": 3,                     //所在分片有三個副本
    "successful": 1,                //1個副本上成功寫入
    "failed": 0                     //失敗副本數
  },
  "_seq_no": 0,                     //第幾次操作該文件
  "_primary_term": 3                //詞項數
}

獲取單個文件

HEAD twitter/_doc/11      檢視是否儲存

GET twitter/_doc/1

GET twitter/_doc/1?_source=false

GET twitter/_doc/1/_source

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "id": 1,
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "trying out Elasticsearch"
  }}

PUT twitter11
{                                   //獲取儲存欄位
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            } }   }  }}

PUT twitter11/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

GET twitter11/_doc/1?stored_fields=tags,counter

獲取多個文件 _mget

GET /_mget
{
    "docs" : [
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "2"
            "stored_fields" : ["field3", "field4"]
        }
    ]
}

GET /twitter/_mget
{
    "docs" : [
        {
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

GET /twitter/_doc/_mget
{
    "docs" : [
        {
            "_id" : "1"
        },
        {
            "_id" : "2"
        }
    ]
}

GET /twitter/_doc/_mget
{
    "ids" : ["1", "2"]
}

請求引數_source stored_fields 可以用在url上也可用在請求json串中

刪除文件

DELETE twitter/_doc/1 指定文件id進行刪除

DELETE twitter/_doc/1?version=1 用版本來控制刪除

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 2,
    "_primary_term": 1,
    "_seq_no": 5,
    "result": "deleted"
}

查詢刪除

POST twitter/_delete_by_query
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}
當有文件有版本衝突時，不放棄刪除操作（記錄衝突的文件，繼續刪除其他複合查詢的文件）

通過task api 來檢視查詢刪除任務

GET _tasks?detailed=true&actions=*/delete/byquery

GET /_tasks/taskId:1    查詢具體任務的狀態

POST _tasks/task_id:1/_cancel    取消任務

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/delete/byquery",
          "status" : {    
            "total" : 6154,
            "updated" : 0,
            "created" : 0,
            "deleted" : 3500,
            "batches" : 36,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": 0,
            "throttled_millis": 0
          },
          "description" : ""
        }      }   }  }}

更新文件

PUT twitter/_doc/1      指定文件id進行修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

PUT twitter/_doc/1?version=1     樂觀鎖併發更新控制
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 3,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 3
}

Scripted update 通過指令碼來更新文件

PUT uptest/_doc/1       1、準備一個文件
{
    "counter" : 1,
    "tags" : ["red"]
}

POST uptest/_doc/1/_update     2、對文件1的counter + 4
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}

POST uptest/_doc/1/_update     3、往陣列中加入元素
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

指令碼說明：painless是es內建的一種指令碼語言，ctx執行上下文物件（通過它還可訪問_index, _type, _id, _version, _routing and _now (the current timestamp) ），params是引數集合

Scripted update 通過指令碼來更新文件

說明：指令碼更新要求索引的_source 欄位是啟用的。更新執行流程：
1、獲取到原文件
2、通過_source欄位的原始資料，執行指令碼修改。
3、刪除原索引文件
4、索引修改後的文件
它只是降低了一些網路往返，並減少了get和索引之間版本衝突的可能性。

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}
4、新增一個欄位

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}
5、移除一個欄位

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}
6、判斷刪除或不做什麼

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}
7、合併傳人的文件欄位進行更新

{
  "_index": "uptest",
  "_type": "_doc",
  "_id": "1",
  "_version": 4,
  "result": "noop",
  "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
  }
}
8、再次執行7，更新內容相同，不需做什麼

POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}
9、設定不做noop檢測

POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}
10、upsert 操作：如果要更新的文件存在，則執行指令碼進行更新，如不存在，則把 upsert中的內容作為一個新文件寫入。

查詢更新

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}
通過條件查詢來更新文件

搜尋引擎（Elasticsearch索引管理2）

學完本課題，你應達成如下目標：

分詞器

文件管理

搜尋引擎（Elasticsearch索引管理2）

搜尋引擎（Elasticsearch搜尋詳解）

PyQt5（2）——調整布局（布局管理器）第一個程序

搜尋引擎（Solr配置管理詳解）

搜尋引擎（Elasticsearch聚合分析）

搜尋引擎（Elasticsearc-叢集管理）

java基礎10（多線程2）

Android開發之AudioManager（音頻管理器）具體解釋

機器人--寒暄庫（數據準備2）

SQL Server調優系列進階篇（如何索引調優）

（MYSQL學習筆記2）多表連接查詢

Python（入門小練習2）

Pandas快速入門（深度學習入門2）

Storm的StreamID使用樣例（版本1.0.2）

Linux Centos 6.5 DNS主從復制配置（bind-9.8.2）

jmeter手寫腳本，使用正則獲取cookie（禁用cookies管理器）

Elasticsearch 索引管理和內核探秘

CASE18:第二次作業（熟悉缺陷管理工具）

C++陣列（C++學習筆記 2）

Supervisor 安裝與配置（Linux 程序管理工具）

搜尋引擎（Elasticsearch索引管理2）

學完本課題，你應達成如下目標：

分詞器

文件管理

相關推薦