1. 程式人生 > >ES Mapping、欄位型別Field type詳解

ES Mapping、欄位型別Field type詳解

 

欄位型別概述
一級分類	二級分類	具體型別
核心型別	字串型別	string,text,keyword
整數型別	integer,long,short,byte
浮點型別	double,float,half_float,scaled_float
邏輯型別	boolean
日期型別	date
範圍型別	range
二進位制型別	binary
複合型別	陣列型別	array
物件型別	object
巢狀型別	nested
地理型別	地理座標型別	geo_point
地理地圖	geo_shape
特殊型別	IP型別	ip
範圍型別	completion
令牌計數型別	token_count
附件型別	attachment
抽取型別	percolator

一、Field datatype(欄位資料型別)

1.1string型別

ELasticsearch 5.X之後的欄位型別不再支援string,由text或keyword取代。 如果仍使用string,會給出警告。

測試:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type":  "string"
        }
      }
    }
  }
}

結果:

#! Deprecation: The [string] field is deprecated, please use [text] or [keyword] instead on [title]
{
  "acknowledged": true,
  "shards_acknowledged": true
}

1.2 text型別

text取代了string,當一個欄位是要被全文搜尋的,比如Email內容、產品描述,應該使用text型別。設定text型別以後,欄位內容會被分析,在生成倒排索引以前,字串會被分析器分成一個一個詞項。text型別的欄位不用於排序,很少用於聚合(termsAggregation除外)。

把full_name欄位設為text型別的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

1.3 keyword型別

keyword型別適用於索引結構化的欄位,比如email地址、主機名、狀態碼和標籤。如果欄位需要進行過濾(比如查詢已釋出部落格中status屬性為published的文章)、排序、聚合。keyword型別的欄位只能通過精確值搜尋到。

1.4 數字型別

對於數字型別,ELasticsearch支援以下幾種:

型別 取值範圍
long -2^63至2^63-1
integer -2^31至2^31-1
short -32,768至32768
byte -128至127
double 64位雙精度IEEE 754浮點型別
float 32位單精度IEEE 754浮點型別
half_float 16位半精度IEEE 754浮點型別
scaled_float 縮放型別的的浮點數(比如價格只需要精確到分,price為57.34的欄位縮放因子為100,存起來就是5734)

對於float、half_float和scaled_float,-0.0和+0.0是不同的值,使用term查詢查詢-0.0不會匹配+0.0,同樣range查詢中上邊界是-0.0不會匹配+0.0,下邊界是+0.0不會匹配-0.0。

對於數字型別的資料,選擇以上資料型別的注意事項:

  1. 在滿足需求的情況下,儘可能選擇範圍小的資料型別。比如,某個欄位的取值最大值不會超過100,那麼選擇byte型別即可。迄今為止吉尼斯記錄的人類的年齡的最大值為134歲,對於年齡欄位,short足矣。欄位的長度越短,索引和搜尋的效率越高。
  2. 優先考慮使用帶縮放因子的浮點型別。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}

1.5 Object型別

JSON天生具有層級關係,文件會包含巢狀的物件:

PUT my_index/my_type/1
{ 
  "region": "US",
  "manager": { 
    "age":     30,
    "name": { 
      "first": "John",
      "last":  "Smith"
    }
  }
}

上面的文件中,整體是一個JSON,JSON中包含一個manager,manager又包含一個name。最終,文件會被索引成一平的key-value對:

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

上面文件結構的Mapping如下:

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { 
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}

1.6 date型別

JSON中沒有日期型別,所以在ELasticsearch中,日期型別可以是以下幾種:

  1. 日期格式的字串:e.g. “2015-01-01” or “2015/01/01 12:10:30”.
  2. long型別的毫秒數( milliseconds-since-the-epoch)
  3. integer的秒數(seconds-since-the-epoch)

日期格式可以自定義,如果沒有自定義,預設格式如下:

"strict_date_optional_time||epoch_millis"

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type": "date" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "date": "2015-01-01" } 

PUT my_index/my_type/2
{ "date": "2015-01-01T12:10:30Z" } 

PUT my_index/my_type/3
{ "date": 1420070400001 } 

GET my_index/_search
{
  "sort": { "date": "asc"} 
}

檢視三個日期型別:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": 1,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "date": "2015-01-01"
        }
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": 1,
        "_source": {
          "date": 1420070400001
        }
      }
    ]
  }
}

排序結果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": null,
        "_source": {
          "date": "2015-01-01"
        },
        "sort": [
          1420070400000
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "3",
        "_score": null,
        "_source": {
          "date": 1420070400001
        },
        "sort": [
          1420070400001
        ]
      },
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "2",
        "_score": null,
        "_source": {
          "date": "2015-01-01T12:10:30Z"
        },
        "sort": [
          1420114230000
        ]
      }
    ]
  }
}

1.7 Array型別

ELasticsearch沒有專用的陣列型別,預設情況下任何欄位都可以包含一個或者多個值,但是一個數組中的值要是同一種類型。例如:

  1. 字元陣列: [ “one”, “two” ]
  2. 整型陣列:[1,3]
  3. 巢狀陣列:[1,[2,3]],等價於[1,2,3]
  4. 物件陣列:[ { “name”: “Mary”, “age”: 12 }, { “name”: “John”, “age”: 10 }]

注意事項:

  • 動態新增資料時,陣列的第一個值的型別決定整個陣列的型別
  • 混合陣列型別是不支援的,比如:[1,”abc”]
  • 陣列可以包含null值,空陣列[ ]會被當做missing field對待。

1.8 binary型別

binary型別接受base64編碼的字串,預設不儲存也不可搜尋。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" 
}

搜尋blog欄位:

GET my_index/_search
{
  "query": {
    "match": {
      "blob": "test" 
    }
  }
}

返回結果:
{
  "error": {
    "root_cause": [
      {
        "type": "query_shard_exception",
        "reason": "Binary fields do not support searching",
        "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
        "index": "my_index"
      }
    ],
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    "grouped": true,
    "failed_shards": [
      {
        "shard": 0,
        "index": "my_index",
        "node": "3dQd1RRVTMiKdTckM68nPQ",
        "reason": {
          "type": "query_shard_exception",
          "reason": "Binary fields do not support searching",
          "index_uuid": "fgA7UM5XSS-56JO4F4fYug",
          "index": "my_index"
        }
      }
    ]
  },
  "status": 400
}

Base64加密、解碼工具:http://www1.tc711.com/tool/BASE64.htm

1.9 ip型別

ip型別的欄位用於儲存IPV4或者IPV6的地址。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

1.10 range型別

range型別支援以下幾種:

型別 範圍
integer_range -2^31至2^31-1
float_range 32-bit IEEE 754
long_range -2^63至2^63-1
double_range 64-bit IEEE 754
date_range 64位整數,毫秒計時

range型別的使用場景:比如前端的時間選擇表單、年齡範圍選擇表單等。 
例子:

PUT range_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", 
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

PUT range_index/my_type/1
{
  "expected_attendees" : { 
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { 
    "gte" : "2015-10-31 12:00:00", 
    "lte" : "2015-11-01"
  }
}

上面程式碼建立了一個range_index索引,expected_attendees的人數為10到20,時間是2015-10-31 12:00:00至2015-11-01。

查詢:

POST range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { 
        "gte" : "2015-08-01",
        "lte" : "2015-12-01",
        "relation" : "within" 
      }
    }
  }
}

查詢結果:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "range_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "_source": {
          "expected_attendees": {
            "gte": 10,
            "lte": 20
          },
          "time_frame": {
            "gte": "2015-10-31 12:00:00",
            "lte": "2015-11-01"
          }
        }
      }
    ]
  }
}

1.11 nested型別

nested巢狀型別是object中的一個特例,可以讓array型別的Object獨立索引和查詢。 使用Object型別有時會出現問題,比如文件 my_index/my_type/1的結構如下:

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [ 
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

user欄位會被動態新增為Object型別。 
最後會被轉換為以下平整的形式:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

user.first和user.last會被平鋪為多值欄位,Alice和White之間的關聯關係會消失。上面的文件會不正確的匹配以下查詢(雖然能搜尋到,實際上不存在Alice Smith):

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

使用nested欄位型別解決Object型別的不足:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "user": {
          "type": "nested" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} 
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} 
          ]
        }
      },
      "inner_hits": { 
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}

1.12token_count型別

token_count用於統計詞頻:


PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "length": { 
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{ "name": "John Smith" }

PUT my_index/my_type/2
{ "name": "Rachel Alice Williams" }

GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 
    }
  }
}

1.13 geo point 型別

地理位置資訊型別用於儲存地理位置資訊的經緯度:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Geo-point as an object",
  "location": { 
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/my_type/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" 
}

PUT my_index/my_type/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" 
}

PUT my_index/my_type/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] 
}

GET my_index/_search
{
  "query": {
    "geo_bounding_box": { 
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}

二、Meta-Fields(元資料)

2.1 _all

_all欄位是把其它欄位拼接在一起的超級欄位,所有的欄位用空格分開,_all欄位會被解析和索引,但是不儲存。當你只想返回包含某個關鍵字的文件但是不明確地搜某個欄位的時候就需要使用_all欄位。 
例子:

PUT my_index/blog/1 
{
  "title":    "Master Java",
  "content":     "learn java",
  "author": "Tom"
}

_all欄位包含:[ “Master”, “Java”, “learn”, “Tom” ]

搜尋:

GET my_index/_search
{
  "query": {
    "match": {
      "_all": "Java"
    }
  }
}

返回結果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.39063013,
    "hits": [
      {
        "_index": "my_index",
        "_type": "blog",
        "_id": "1",
        "_score": 0.39063013,
        "_source": {
          "title": "Master Java",
          "content": "learn java",
          "author": "Tom"
        }
      }
    ]
  }
}

使用copy_to自定義_all欄位:

PUT myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "title": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "content": {
          "type":    "text",
          "copy_to": "full_content" 
        },
        "full_content": {
          "type":    "text"
        }
      }
    }
  }
}

PUT myindex/mytype/1
{
  "title": "Master Java",
  "content": "learn Java"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_content": "java"
    }
  }
}

2.2 _field_names

_field_names欄位用來儲存文件中的所有非空欄位的名字,這個欄位常用於exists查詢。例子如下:

PUT my_index/my_type/1
{
  "title": "This is a document"
}

PUT my_index/my_type/2?refresh=true
{
  "title": "This is another document",
  "body": "This document has a body"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_field_names": [ "body" ] 
    }
  }
}

結果會返回第二條文件,因為第一條文件沒有title欄位。 
同樣,可以使用exists查詢:

GET my_index/_search
{
    "query": {
        "exists" : { "field" : "body" }
    }
}

2.3 _id

每條被索引的文件都有一個_type和_id欄位,_id可以用於term查詢、temrs查詢、match查詢、query_string查詢、simple_query_string查詢,但是不能用於聚合、指令碼和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] 
    }
  }
}

2.4 _index

多索引查詢時,有時候只需要在特地索引名上進行查詢,_index欄位提供了便利,也就是說可以對索引名進行term查詢、terms查詢、聚合分析、使用指令碼和排序。

_index是一個虛擬欄位,不會真的加到Lucene索引中,對_index進行term、terms查詢(也包括match、query_string、simple_query_string),但是不支援prefix、wildcard、regexp和fuzzy查詢。

舉例,2個索引2條文件


PUT index_1/my_type/1
{
  "text": "Document in index 1"
}

PUT index_2/my_type/2
{
  "text": "Document in index 2"
}

對索引名做查詢、聚合、排序並使用指令碼新增欄位:

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] 
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { 
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "inline": "doc['_index']" 
      }
    }
  }
}

2.4 _meta

忽略

2.5 _parent

_parent用於指定同一索引中文件的父子關係。下面例子中現在mapping中指定文件的父子關係,然後索引父文件,索引子文件時指定父id,最後根據子文件查詢父文件。

PUT my_index
{
  "mappings": {
    "my_parent": {},
    "my_child": {
      "_parent": {
        "type": "my_parent" 
      }
    }
  }
}


PUT my_index/my_parent/1 
{
  "text": "This is a parent document"
}

PUT my_index/my_child/2?parent=1 
{
  "text": "This is a child document"
}

PUT my_index/my_child/3?parent=1&refresh=true 
{
  "text": "This is another child document"
}


GET my_index/my_parent/_search
{
  "query": {
    "has_child": { 
      "type": "my_child",
      "query": {
        "match": {
          "text": "child document"
        }
      }
    }
  }
}

2.6 _routing

路由引數,ELasticsearch通過以下公式計算文件應該分到哪個分片上:

shard_num = hash(_routing) % num_primary_shards

預設的_routing值是文件的_id或者_parent,通過_routing引數可以設定自定義路由。例如,想把user1釋出的部落格儲存到同一個分片上,索引時指定routing引數,查詢時在指定路由上查詢:

PUT my_index/my_type/1?routing=user1&refresh=true 
{
  "title": "This is a document"
}

GET my_index/my_type/1?routing=user1

在查詢的時候通過routing引數查詢:

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] 
    }
  }
}

GET my_index/_search?routing=user1,user2 
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}

在Mapping中指定routing為必須的:

PUT my_index2
{
  "mappings": {
    "my_type": {
      "_routing": {
        "required": true 
      }
    }
  }
}

PUT my_index2/my_type/1 
{
  "text": "No routing value provided"
}

2.7 _source

儲存的文件的原始值。預設_source欄位是開啟的,也可以關閉:

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      }
    }
  }
}

但是一般情況下不要關閉,除法你不想做一些操作:

  • 使用update、update_by_query、reindex
  • 使用高亮
  • 資料備份、改變mapping、升級索引
  • 通過原始欄位debug查詢或者聚合

2.8 _type

每條被索引的文件都有一個_type和_id欄位,可以根據_type進行查詢、聚合、指令碼和排序。例子如下:

PUT my_index/type_1/1
{
  "text": "Document with type 1"
}

PUT my_index/type_2/2?refresh=true
{
  "text": "Document with type 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_type": [ "type_1", "type_2" ] 
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "inline": "doc['_type']" 
      }
    }
  }
}

2.9 _uid

_uid和_type和_index的組合。和_type一樣,可用於查詢、聚合、指令碼和排序。例子如下:

PUT my_index/my_type/1
{
  "text": "Document with ID 1"
}

PUT my_index/my_type/2?refresh=true
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "my_type#1", "my_type#2" ] 
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", 
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { 
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "inline": "doc['_uid']" 
      }
    }
  }
}

三、Mapping引數

3.1 analyzer

指定分詞器(分析器更合理),對索引和查詢都有效。如下,指定ik分詞的配置:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

3.2 normalizer

normalizer用於解析前的標準化配置,比如把所有的字元轉化為小寫等。例子:

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

BÀR經過normalizer過濾以後轉換為bar,文件1和文件2會被搜尋到。

3.3 boost

boost欄位用於設定欄位的權重,比如,關鍵字出現在title欄位的權重是出現在content欄位中權重的2倍,設定mapping如下,其中content欄位的預設權重是1.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

同樣,在查詢時指定權重也是一樣的:

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

推薦在查詢時指定boost,第一中在mapping中寫死,如果不重新索引文件,權重無法修改,使用查詢可以實現同樣的效果。

3.4 coerce

coerce屬性用於清除髒資料,coerce的預設值是true。整型數字5有可能會被寫成字串“5”或者浮點數5.0.coerce屬性可以用來清除髒資料:

  • 字串會被強制轉換為整數
  • 浮點數被強制轉換為整數

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "number_one": "10" 
}

PUT my_index/my_type/2
{
  "number_two": "10" 
}

mapping中指定number_one欄位是integer型別,雖然插入的資料型別是String,但依然可以插入成功。number_two欄位關閉了coerce,因此插入失敗。

3.5 copy_to

copy_to屬性用於配置自定義的_all欄位。換言之,就是多個欄位可以合併成一個超級欄位。比如,first_name和last_name可以合併為full_name欄位。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

3.6 doc_values

doc_values是為了加快排序、聚合操作,在建立倒排索引的時候,額外增加一個列式儲存對映,是一個空間換時間的做法。預設是開啟的,對於確定不需要聚合或者排序的欄位可以關閉。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

注:text型別不支援doc_values。

3.7 dynamic

dynamic屬性用於檢測新發現的欄位,有三個取值:

  • true:新發現的欄位新增到對映中。(預設)
  • flase:新檢測的欄位被忽略。必須顯式新增新欄位。
  • strict:如果檢測到新欄位,就會引發異常並拒絕文件。

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}

PS:取值為strict,非布林值要加引號。

3.8 enabled

ELasticseaech預設會索引所有的欄位,enabled設為false的欄位,es會跳過欄位內容,該欄位只能從_source中獲取,但是不可搜。而且欄位可以是任意型別。

PUT my_index
{
  "mappings": {
    "session": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/session/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/session/session_2
{
  "user_id": "jpountz",
  "session_data": "none", 
  "last_updated": "2015-12-06T18:22:13"
}

3.9 fielddata

搜尋要解決的問題是“包含查詢關鍵詞的文件有哪些?”,聚合恰恰相反,聚合要解決的問題是“文件包含哪些詞項”,大多數字段再索引時生成doc_values,但是text欄位不支援doc_values。

取而代之,text欄位在查詢時會生成一個fielddata的資料結構,fielddata在欄位首次被聚合、排序、或者使用指令碼的時候生成。ELasticsearch通過讀取磁碟上的倒排記錄表重新生成文件詞項關係,最後在Java堆記憶體中排序。

text欄位的fielddata屬性預設是關閉的,開啟fielddata非常消耗記憶體。在你開啟text欄位以前,想清楚為什麼要在text型別的欄位上做聚合、排序操作。大多數情況下這麼做是沒有意義的。

“New York”會被分析成“new”和“york”,在text型別上聚合會分成“new”和“york”2個桶,也許你需要的是一個“New York”。這是可以加一個不分析的keyword欄位:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

上面的mapping中實現了通過my_field欄位做全文搜尋,my_field.keyword做聚合、排序和使用指令碼。

3.10 format

format屬性主要用於格式化日期:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

更多內建的日期格式:https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html

3.11 ignore_above

ignore_above用於指定欄位索引和儲存的長度最大值,超過最大值的會被忽略:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 15
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size": 0, 
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

mapping中指定了ignore_above欄位的最大長度為15,第一個文件的欄位長小於15,因此索引成功,第二個超過15,因此不索引,返回結果只有”Syntax error”,結果如下:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

3.12 ignore_malformed

ignore_malformed可以忽略不規則資料,對於login欄位,有人可能填寫的是date型別,也有人填寫的是郵件格式。給一個欄位索引不合適的資料型別發生異常,導致整個文件索引失敗。如果ignore_malformed引數設為true,異常會被忽略,出異常的欄位不會被索引,其它欄位正常索引。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

上面的例子中number_one接受integer型別,ignore_malformed屬性設為true,因此文件一種number_one欄位雖然是字串但依然能寫入成功;number_two接受integer型別,預設ignore_malformed屬性為false,因此寫入失敗。

3.13 include_in_all

include_in_all屬性用於指定欄位是否包含在_all欄位裡面,預設開啟,除索引時index屬性為no。 
例子如下,title和content欄位包含在_all欄位裡,date不包含。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": { 
          "type": "text"
        },
        "content": { 
          "type": "text"
        },
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

include_in_all也可用於欄位級別,如下my_type下的所有欄位都排除在_all欄位之外,author.first_name 和author.last_name 包含在in _all中:

PUT my_index
{
  "mappings": {
    "my_type": {
      "include_in_all": false, 
      "properties": {
        "title":          { "type": "text" },
        "author": {
          "include_in_all": true, 
          "properties": {
            "first_name": { "type": "text" },
            "last_name":  { "type": "text" }
          }
        },
        "editor": {
          "properties": {
            "first_name": { "type": "text" }, 
            "last_name":  { "type": "text", "include_in_all": true } 
          }
        }
      }
    }
  }
}

3.14 index

index屬性指定欄位是否索引,不索引也就不可搜尋,取值可以為true或者false。

3.15 index_options

index_options控制索引時儲存哪些資訊到倒排索引中,接受以下配置:

引數 作用
docs 只儲存文件編號
freqs 儲存文件編號和詞項頻率
positions 文件編號、詞項頻率、詞項的位置被儲存,偏移位置可用於臨近搜尋和短語查詢
offsets 文件編號、詞項頻率、詞項的位置、詞項開始和結束的字元位置都被儲存,offsets設為true會使用Postings highlighter

3.16 fields

fields可以讓同一文字有多種不同的索引方式,比如一個String型別的欄位,可以使用text型別做全文檢索,使用keyword型別做聚合和排序。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

3.17 norms

norms引數用於標準化文件,以便查詢時計算文件的相關性。norms雖然對評分有用,但是會消耗較多的磁碟空間,如果不需要對某個欄位進行評分,最好不要開啟norms。

3.18 null_value

值為null的欄位不索引也不可以搜尋,null_value引數可以讓值為null的欄位顯式的可索引、可搜尋。例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "status_code": null
}

PUT my_index/my_type/2
{
  "status_code": [] 
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

文件1可以被搜尋到,因為status_code的值為null,文件2不可以被搜尋到,因為status_code為空陣列,但是不是null。

3.19 position_increment_gap

為了支援近似或者短語查詢,text欄位被解析的時候會考慮此項的位置資訊。舉例,一個欄位的值為陣列型別:

 "names": [ "John Abraham", "Lincoln Smith"]

為了區別第一個欄位和第二個欄位,Abraham和Lincoln在索引中有一個間距,預設是100。例子如下,這是查詢”Abraham Lincoln”是查不到的:

PUT my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" 
            }
        }
    }
}

指定間距大於100可以查詢到:

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 
            }
        }
    }
}

在mapping中通過position_increment_gap引數指定間距:

PUT my_index
{
  "mappings": {
    "groups": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 
        }
      }
    }
  }
}

3.20 properties

Object或者nested型別,下面還有巢狀型別,可以通過properties引數指定。

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

對應的文件結構:

PUT my_index/my_type/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

可以對manager.name、manager.age做搜尋、聚合等操作。

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White" 
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age", 
            "interval": 5
          }
        }
      }
    }
  }
}

3.21 search_analyzer

大多數情況下索引和搜尋的時候應該指定相同的分析器,確保query解析以後和索引中的詞項一致。但是有時候也需要指定不同的分析器,例如使用edge_ngram過濾器實現自動補全。

預設情況下查詢會使用analyzer屬性指定的分析器,但也可以被search_analyzer覆蓋。例子:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick Brown Fox" 
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", 
        "operator": "and"
      }
    }
  }
}

3.22 similarity

similarity引數用於指定文件評分模型,引數有三個:

  • BM25 :ES和Lucene預設的評分模型
  • classic :TF/IDF評分
  • boolean:布林模型評分 
    例子:
PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "default_field": { 
          "type": "text"
        },
        "classic_field": {
          "type": "text",
          "similarity": "classic" 
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" 
        }
      }
    }
  }
}

default_field自動使用BM25評分模型,classic_field使用TF/IDF經典評分模型,boolean_sim_field使用布林評分模型。

3.23 store

預設情況下,自動是被索引的也可以搜尋,但是不儲存,這也沒關係,因為_source欄位裡面儲存了一份原始文件。在某些情況下,store引數有意義,比如一個文件裡面有title、date和超大的content欄位,如果只想獲取title和date,可以這樣:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] 
}

查詢結果:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "fields": {
          "date": [
            "2015-01-01T00:00:00.000Z"
          ],
          "title": [
            "Some short title"
          ]
        }
      }
    ]
  }
}

Stored fields返回的總是陣列,如果想返回原始欄位,還是要從_source中取。

3.24 term_vector

詞向量包含了文字被解析以後的以下資訊:

  • 詞項集合
  • 詞項位置
  • 詞項的起始字元對映到原始文件中的位置。

term_vector引數有以下取值:

引數取值 含義
no 預設值,不儲存詞向量
yes 只儲存詞項集合
with_positions 儲存詞項和詞項位置
with_offsets 詞項和字元偏移位置
with_positions_offsets 儲存詞項、詞項位置、字元偏移位置

例子:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

四、動態Mapping

4.1 default mapping

在mapping中使用default欄位,那麼其它欄位會自動繼承default中的設定。

PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

上面的mapping中,default中關閉了all欄位,user會繼承_default中的配置,因此user中的all欄位也是關閉的,blogpost中開啟_all,覆蓋了_default的預設配置。

default被更新以後,只會對後面新加的文件產生作用。

4.2 Dynamic field mapping

文件中有一個之前沒有出現過的欄位被新增到ELasticsearch之後,文件的type mapping中會自動新增一個新的欄位。這個可以通過dynamic屬性去控制,dynamic屬性為false會忽略新增的欄位、dynamic屬性為strict會丟擲異常。如果dynamic為true的話,ELasticsearch會自動根據欄位的值推測出來型別進而確定mapping:

JSON格式的資料 自動推測的欄位型別
null 沒有欄位被新增
true or false boolean型別
floating型別數字 floating型別
integer long型別
JSON物件 object型別
陣列 由陣列中第一個非空值決定
string 有可能是date型別(開啟日期檢測)、double或long型別、text型別、keyword型別

日期檢測預設是檢測符合以下日期格式的字串:

[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

例子:

PUT my_index/my_type/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping

mapping 如下,可以看到create_date為date型別:

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "create_date": {
            "type": "date",
            "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis"
          }
        }
      }
    }
  }
}

關閉日期檢測:

PUT my_index
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}

PUT my_index/my_type/1 
{
  "create": "2015/09/02"
}

再次檢視mapping,create欄位已不再是date型別:

GET my_index/_mapping
返回結果:
{
  "my_index": {
    "mappings": {
      "my_type": {
        "date_detection": false,
        "properties": {
          "create": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

自定義日期檢測的格式:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/my_type/1
{
  "create_date": "09/25/2015"
}

開啟數字型別自動檢測:

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

4.3 Dynamic templates

動態模板可以根據欄位名稱設定mapping,如下對於string型別的欄位,設定mapping為:

  "mapping": { "type": "long"}

但是匹配欄位名稱為long_*格式的,不匹配*_text格式的:

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/my_type/1
{
  "long_num": "5", 
  "long_text": "foo" 
}

寫入文件以後,long_num欄位為long型別,long_text扔為string型別。

4.4 Override default template

可以通過default欄位覆蓋所有索引的mapping配置,例子:

PUT _template/disable_all_field
{
  "order": 0,
  "template": "*", 
  "mappings": {
    "_default_": { 
      "_all": { 
        "enabled": false
      }
    }
  }
}