ElasticSearch學習（一）------建立索引庫，設定索引規則

阿新 • • 發佈：2018-12-21

一、建立索引庫，並且設定預設分詞器為 IK

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings" : {
        "index" : {
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    }
}
'

max_result_window 這個屬性指定了查詢數量限制，ES 預設限制了分頁查詢 start + limit <= 10000。

如果已經建立了索引庫，但是索引庫中還沒有內容的時候，需要更換分詞器，那麼需要先關閉索引庫，設定新的分詞器，在開啟索引庫。（索引庫中已經有內容的話，建議還是刪掉索引庫完全重建好了，舊索引分詞不符合預期也沒有留著的必要）

關閉索引庫：

curl -XPOST http://localhost:9200/myindex/_close -d '
{}
'

重新設定分詞器：

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings" : {
        "index" : {
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    }
}
'

開啟索引庫：

curl -XPOST http://localhost:9200/myindex/_open -d '
{}
'

自定義分詞方式：

首先大概解釋幾個概念：

Analyzers：語法分析器，ES 包含很多內建的分析器，比如 standard, simple, whitespace 等等。

Tokenizer：分詞器，將指定文字分割為一個一個單詞。

Character Filter：當一串文字被傳遞到 Tokenizer 之前，可以用 Character Filter 過濾一遍，處理其中的字元，比如將指定的字元替換成別的字元。

Filter：經過 Tokenizer 分詞結束的單詞，可以用 filter 進行處理，比如將其轉換成小寫字母之類的。

接下來舉個例子，比如我輸入的文字為：“張三吃飯#李四洗澡”，我希望僅僅按照 # 分詞，也就是我最後得到的結果是 “張三吃飯”，“李四洗澡” 兩個詞語，並且 “張三吃飯”不會被分詞為 “張三”、“吃飯”兩個詞。

curl -XPUT http://localhost:9200/myindex -d '
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_smart"
                },
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "my_tokenizer"
                }
            },
            "tokenizer": {
                "my_tokenizer": {
                    "type": "pattern",
                    "pattern": "#"
                }
            }
        }
    },
    "mappings": {
        "mytype": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                }
            }
        }
    }
}
'

然後我將 “張三吃飯#李四洗澡”這串文字傳遞到 myindex/mytype 進行索引。

curl -XPOST http://localhost:9200/myindex/mytype/1 -d '
{
    "title": "張三吃飯#李四洗澡"
}
'

接著，查詢分詞結果：

curl -XGET http://localhost:9200/myindex/mytype/1/_termvectors?fields=title

得到結果：

{
    "_index": "myindex",
    "_type": "mytype",
    "_id": "1",
    "_version": 1,
    "found": true,
    "took": 46,
    "term_vectors": {
        "title": {
            "field_statistics": {
                "sum_doc_freq": 2,
                "doc_count": 1,
                "sum_ttf": 2
            },
            "terms": {
                "張三吃飯": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 0,
                            "end_offset": 4
                        }
                    ]
                },
                "李四洗澡": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 5,
                            "end_offset": 9
                        }
                    ]
                }
            }
        }
    }
}

可以看到分詞結果符合預期。

二、設定索引規則（mapping）

預設情況下，即使沒有事先設定 mapping，ES也會根據提交的 json 資料自動建立 mapping 規則，但是自動建立的 mapping 比較簡單，只會將欄位設定為 long 和 text 兩種型別。

手動建立 mapping 規則方法如下：

curl -XPOST http://localhost:9200/myindex/mytype/_mappings -d '
{
    "properties": {
        "id": {
            "type": "integer"
        },
        "title": {
            "type": "text"
        },
        "content": {
            "type": "text"
        }
    }
}
'

對於 String 欄位，可以設定型別為 text 或者 keyword。text型別的資料會被分詞處理，而 keyword 型別的資料不會被分詞處理。因此想根據某個欄位精確查詢的話，可以將其設定為 keyword 型別（版本5.0之後）。

如果一個索引庫已經存在索引文件，這時想要更改索引的mapping的話，最好刪除當前索引庫，重新建立索引庫，設定 mapping 之後，將資料重新新增到索引庫中。

mapping 的缺陷：

5.0 版本之後，在同一個索引庫中，相同名字的欄位 mapping type 只有一個（7.0 準備移除 mapping type 的概念），也就是說比如學生資訊有name欄位，學生成績資訊也有name欄位，這兩個name欄位在 Lucene 中是用一個欄位來儲存的。那麼意味著在每個索引庫中，每個被索引的欄位名必須不重複。

ES 早期的概念中，將 index 類比於資料庫，將type類比於表，這對於索引是不合理的，在最新的文件中他們也承認了這個問題，每個 table 中的欄位名即使重複，也不會對於其他表造成影響，而在 Lucene 中並不是這樣，相同的欄位應該就是隻有一份索引。

以上問題就意味著，如果你想要對兩個有關聯的 table （比如外來鍵）單獨做到一個索引庫的兩個type中是不會成功的，因為有相同的欄位名。如果你確實想對這兩個table中的資料做索引，那麼最好是建立一個獨立的資料物件，包含了這兩個表中所有欄位（去掉重複部分），然後將整個物件做索引。

三、重建 setting 或者 mapping 無縫遷移生產環境資料

程式開發設計時總會有缺陷，當你的索引庫 setting 或者 mapping 需要重建時，最簡單粗暴的辦法當然是刪除索引庫，按照新規則建立索引庫，然後重新建立索引，但是這種暴力方式會導致你的服務一段時間內不可用，資料量越大，影響時間越長。

其實我們可以通過給索引庫建立別名的方式，來解決這個問題，基本的思路就是，給你的舊 index 取一個別名為 index-alias，然後程式碼中使用 index-alias去訪問 index，然後按照新規則建立一個 index2, 將 index 中的資料完全重新索引到 index2 中，然後將 index-alias 這個別名，跟 index2 繫結起來，這樣就做到了重建索引庫之後無縫切換，整個切換過程中服務依舊可以使用，只是會有部分搜尋結果不正常，但是總比完全停止要好。

1. 給現有索引庫建立別名

curl -X POST http://localhost:9200/_aliases -d '
{
    "actions": [
        { "add" : { "index" : "index", "alias" : "index-alias" } }
    ]
}
'

2. 建立新的 index2

curl -X PUT http://localhost:9200/index2 -d '
{
    "settings": {
        "index": {
            "number_of_shards": 5,
            "number_of_replicas": 1,
            "max_result_window" : 100000000
        },
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "ik_max_word"
                }
            }
        }
    },
    "mappings": {
        "user": {
            "properties": {
                "name": {
                    "type": "keyword"
                }
            }
        }
    }
}
'

3. 將 index 中的資料完全重建索引到 index2 中

curl -X POST http://localhost:9200/_reindex -d '
{
    "source": {
        "index": "index"
    },
    "dest": {
        "index": "index2"
    }
}
'

4. 更換別名

curl -X POST http://localhost:9200/_aliases -d '
{
    "actions": [
    	{ "remove" : { "index" : "index", "alias" : "index-alias" } },
        { "add" : { "index" : "index2", "alias" : "index-alias" } }
    ]
}
'

5. 刪除舊索引庫

curl -X DELETE http://localhost:9200/index

至此，索引庫重建完畢。

如果僅僅是 mapping 新增欄位的話，可以簡單一點（因為是新增，此時 ES 裡面應該沒有需要新增的欄位的資料）

curl -H "Content-Type: application/json" -X POST http://localhost:9200/index-alias/type1/_mapping -d '
{
	"properties": {
		"newPropertity": {
			"type": "nested"
		}
	}
}
'

ElasticSearch學習（一）------建立索引庫，設定索引規則

ElasticSearch學習（一）------建立索引庫，設定索引規則

Unity學習（一）建立物體+編輯指令碼

Elasticsearch學習（一）初識Elasticsearch

Git學習（二）--建立版本庫

ElasticSearch學習 - （一）windows下安裝ElasticSearch

android自定義View學習（一）----建立一個檢視類

ElasticSearch學習（一）基本概念

多執行緒學習（一）——建立執行緒的三種方式及比較

ElasticSearch學習（一）：ElasticSearch介紹

salesforce 零基礎開發入門學習（一）Salesforce功能介紹，IDE配置以及資源下載

SpringMVC學習（一）引數繫結，自定義轉換器，處理請求亂碼

Robot framework--內置庫xml學習（一）

Oracle數據庫學習（一）

MySQL學習之（一）建立MySQL

記錄我的Python學習之旅（一）關於turtle庫的基本用法

STM32Hal庫學習（一）CubeMx學習點亮LED燈[轉載]

java設計模式學習筆記（一）--- 建立型模式

建立spring-boot專案，學習（一）

STM32 HAL庫學習（一） STM32CubeMX和TRUEStudio的使用

arcgis runtime for android 100.3開發學習（一）（點、線、面，圖層的建立）

ElasticSearch學習（一）------建立索引庫，設定索引規則

相關推薦