1. 程式人生 > >elasticsearch reference 2.3 學習筆記

elasticsearch reference 2.3 學習筆記

1. Getting Started

1.1 Elasticsearch

Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

1.2 Sharding is important for two primary reasons:

It allows you to horizontally split/scale your content volume
It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput

1.3 Replication is important for two primary reasons:

It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.
It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

1.4 Cluster Health

curl 'localhost:9200/_cat/health?v'

curl 'localhost:9200/_cat/nodes?v'

1.5 List All Indices

curl 'localhost:9200/_cat/indices?v'

1.6 Create an Index

curl -XPUT 'localhost:9200/customer?pretty'

1.7 Index and Query a Document

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '
{
  "name": "John Doe"
}'

curl -XGET 'localhost:9200/customer/external/1?pretty'

1.8 Delete an Index

curl -XDELETE 'localhost:9200/customer?pretty'

1.9 Updating Documents

Note though that Elasticsearch does not actually do in-place updates under the hood. Whenever we do an update, Elasticsearch deletes the old document and then indexes a new document with the update applied to it in one shot.

1.10 Deleting Documents

curl -XDELETE 'localhost:9200/customer/external/2?pretty'

1.11 Batch Processing

In addition to being able to index, update, and delete individual documents, Elasticsearch also provides the ability to perform any of the above operations in batches using the _bulk API. This functionality is important in that it provides a very efficient mechanism to do multiple operations as fast as possible with as little network roundtrips as possible.

The bulk API executes all the actions sequentially and in order. If a single action fails for whatever reason, it will continue to process the remainder of the actions after it. When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.

1.12 The Search API

It is important to understand that once you get your search results back, Elasticsearch is completely done with the request and does not maintain any kind of server-side resources or open cursors into your results. This is in stark contrast to many other platforms such as SQL wherein you may initially get a partial subset of your query results up-front and then you have to continuously go back to the server if you want to fetch (or page through) the rest of the results using some kind of stateful server-side cursor.

1.13 Executing Filters

In the previous section, we skipped over a little detail called the document score (_score field in the search results). The score is a numeric value that is a relative measure of how well the document matches the search query that we specified. The higher the score, the more relevant the document is, the lower the score, the less relevant the document is.

But queries do not always need to produce scores, in particular when they are only used for “filtering” the document set. Elasticsearch detects these situations and automatically optimizes query execution in order not to compute useless scores.

1.14 Executing Aggregations

Aggregations provide the ability to group and extract statistics from your data. The easiest way to think about aggregations is by roughly equating it to the SQL GROUP BY and the SQL aggregate functions. In Elasticsearch, you have the ability to execute searches returning hits and at the same time return aggregated results separate from the hits all in one response. This is very powerful and efficient in the sense that you can run queries and multiple aggregations and get the results back of both (or either) operations in one shot avoiding network roundtrips using a concise and simplified API.

2. Setup

2.1 Environment Variables

Most times it is better to leave the default JAVA_OPTS as they are, and use the ES_JAVA_OPTS environment variable in order to set / change JVM settings or arguments.

The ES_HEAP_SIZE environment variable allows to set the heap memory that will be allocated to elasticsearch java process. It will allocate the same value to both min and max values, though those can be set explicitly (not recommended) by setting ES_MIN_MEM (defaults to 256m), and ES_MAX_MEM (defaults to 1g).

It is recommended to set the min and max memory to the same value, and enable mlockall.

2.2 File Descriptors

Make sure to increase the number of open files descriptors on the machine (or for the user running elasticsearch). Setting it to 32k or even 64k is recommended.

In order to test how many open files the process can open, start it with -Des.max-open-files set to true. This will print the number of open files the process can open on startup.

Alternatively, you can retrieve the max_file_descriptors for each node using the Nodes Info API, with:

curl localhost:9200/_nodes/stats/process?pretty

2.3 Virtual memory

Elasticsearch uses a hybrid mmapfs / niofs directory by default to store its indices. The default operating system limits on mmap counts is likely to be too low, which may result in out of memory exceptions. On Linux, you can increase the limits by running the following command as root:

sysctl -w vm.max_map_count=262144

To set this value permanently, update the vm.max_map_count setting in /etc/sysctl.conf.

2.4 Memory Settings

Most operating systems try to use as much memory as possible for file system caches and eagerly swap out unused application memory, possibly resulting in the elasticsearch process being swapped. Swapping is very bad for performance and for node stability, so it should be avoided at all costs.

There are three options: Disable swap、Configure swappiness、mlockall

2.5 Elasticsearch Settings

elasticsearch configuration files can be found under ES_HOME/config folder. The folder comes with two files, the elasticsearch.yml for configuring Elasticsearch different modules, and logging.yml for configuring the Elasticsearch logging.

The configuration format is YAML.

2.6 Directory Layout

zip and tar.gz
|Type | Description| Location |
|:— |:———–|:———|
home |Home of elasticsearch installation|{extract.path}
bin |Binary scripts including elasticsearch to start a node|{extract.path}/bin
conf |Configuration files elasticsearch.yml and logging.yml|{extract.path}/config
data |The location of the data files of each index / shard allocated on the node|{extract.path}/data
logs |Log files location|{extract.path}/logs
plugins|Plugin files location. Each plugin will be contained in a subdirectory|{extract.path}/plugins
repo |Shared file system repository locations.|Not configured
script|Location of script files.|{extract.path}/config/scripts

3 Breaking changes (skipped)

4 API Conventions (skipped)

5 Document APIs

5.1 Index API

The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the “twitter” index, under a type called “tweet” with an id of 1:

curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.2 Automatic Index Creation

The index operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if one has not yet been created (check out the put mapping API for manually creating a type mapping).

5.3 Versioning

Each indexed document is given a version number. The associated version number is returned as part of the response to the index API request. The index API optionally allows for optimistic concurrency control when the version parameter is specified. This will control the version of the document the operation is intended to be executed against. A good example of a use case for versioning is performing a transactional read-then-update. Specifying a version from the document initially read ensures no changes have happened in the meantime (when reading in order to update, it is recommended to set preference to _primary). For example:

curl -XPUT 'localhost:9200/twitter/tweet/1?version=2' -d '{
    "message" : "elasticsearch now has versioning support, double cool!"
}'

5.4 Operation Type

The index operation also accepts an op_type that can be used to force a create operation, allowing for “put-if-absent” behavior. When create is used, the index operation will fail if a document by that id already exists in the index.

Here is an example of using the op_type parameter:

curl -XPUT 'http://localhost:9200/twitter/tweet/1?op_type=create' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.5 Routing

By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter. For example:

curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

5.6 Parents & Children (**適合什麼場景?)

A child document can be indexed by specifying its parent when indexing. For example:

curl -XPUT localhost:9200/blogs/blog_tag/1122?parent=1111 -d '{
    "tag" : "something"
}'

When indexing a child document, the routing value is automatically set to be the same as its parent, unless the routing value is explicitly specified using the routing parameter.

5.7 Distributed

The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.

5.8 Write Consistency

To prevent writes from taking place on the “wrong” side of a network partition, by default, index operations only succeed if a quorum (>replicas/2+1) of active shards are available.

5.9 Write Consistency (**如果index設定為1個主分片,兩個複製分片。當兩個複製分片都不可用的時候index在主分片是否成功?複製分片可用了是否能複製?)

To prevent writes from taking place on the “wrong” side of a network partition, by default, index operations only succeed if a quorum (>replicas/2+1) of active shards are available.

The index operation only returns after all active shards within the replication group have indexed the document (sync replication).

5.10 Refresh

To refresh the shard (not the whole index) immediately after the operation occurs, so that the document appears in search results immediately, the refresh parameter can be set to true. Setting this option to true should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint. Note, getting a document using the get API is completely realtime and doesn’t require a refresh.

5.11 Get API

The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called twitter, under a type called tweet, with id valued 1:

curl -XGET 'http://localhost:9200/twitter/tweet/1'

5.12 Preference

Controls a preference of which shard replicas to execute the get request on. By default, the operation is randomized between the shard replicas.

The preference can be set to:

  • _primary: The operation will go and be executed only on the primary shards.
  • _local: The operation will prefer to be executed on a local allocated shard if possible.
  • Custom (string) value: A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with “jumping values” when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

5.13 Delete API

The delete API allows to delete a typed JSON document from a specific index based on its id. The following example deletes the JSON document from an index called twitter, under a type called tweet, with id valued 1:

curl -XDELETE 'http://localhost:9200/twitter/tweet/1'

The delete operation gets hashed into a specific shard id. It then gets redirected into the primary shard within that id group, and replicated (if needed) to shard replicas within that id group.

5.14 Update API

The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the “get” and “reindex”.

Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field needs to be enabled for this feature to work.

5.15 Update By Query API (new and should still be considered experimental)

The simplest usage of _update_by_query just performs an update on every document in the index without changing the source. This is useful to pick up a new property or some other online mapping change. Here is the API:

curl -XPOST 'localhost:9200/twitter/_update_by_query?conflicts=proceed'

All update and query failures cause the _update_by_query to abort and are returned in the failures of the response. The updates that have been performed still stick. In other words, the process is not rolled back, only aborted.

5.16 Multi Get API

Multi GET API allows to get multiple documents based on an index, type (optional) and id (and possibly routing). The response includes a docs array with all the fetched documents, each element similar in structure to a document provided by the get API. Here is an example:

curl 'localhost:9200/_mget' -d '{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_type" : "type",
            "_id" : "2"
        }
    ]
}'

5.17 Bulk API

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

The REST API endpoint is /_bulk, and it expects the following JSON structure:

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

NOTE: the final line of data must end with a newline character \n.

The possible actions are index, create, delete and update. index and create expect a source on the next line, and have the same semantics as the op_type parameter to the standard index API (i.e. create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary). delete does not expect a source on the following line, and has the same semantics as the standard delete API. update expects that the partial doc, upsert and script and its options are specified on the next line.

Here is an example of a correct sequence of bulk commands:

{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "type1", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "type1", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "type1", "_index" : "index1"} }
{ "doc" : {"field2" : "value2"} }

The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk. When the index or the index/type are provided, they will be used by default on bulk items that don’t provide them explicitly.

A note on the format. The idea here is to make processing of this as fast as possible. As some of the actions will be redirected to other shards on other nodes, only action_meta_data is parsed on the receiving node side.

5. 18 Reindex API (new and should still be considered experimental)

The most basic form of _reindex just copies documents from one index to another. This will copy documents from the twitter index into the new_twitter index:

POST /_reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

You can limit the documents by adding a type to the source or by adding a query. This will only copy tweet’s made by kimchy into new_twitter:

POST /_reindex
{
  "source": {
    "index": "twitter",
    "type": "tweet",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

5.19 Term Vectors

Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.

curl -XGET 'http://localhost:9200/twitter/tweet/1/_termvectors?pretty=true'

Three types of values can be requested: term information, term statistics and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics.

Term information
- term frequency in the field (always returned)
- term positions (positions : true)
- start and end offsets (offsets : true)
- term payloads (payloads : true), as base64 encoded bytes

Term statistics
- total term frequency (how often a term occurs in all documents)
- document frequency (the number of documents containing the current term)

Setting term_statistics to true (default is false) will return term statistics. By default these values are not returned since term statistics can have a serious performance impact.

Field statistics
- document count (how many documents contain this field)
- sum of document frequencies (the sum of document frequencies for all terms in this field)
- sum of total term frequencies (the sum of total term frequencies of each term in this field)

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in, unless dfs is set to true. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected.

6 Search APIs

The search API allows you to execute a search query and get back search hits that match the query. The query can either be provided using a simple query string as a parameter, or using a request body.

All search APIs can be applied across multiple types within an index, and across multiple indices with support for the multi index syntax.

A search request can be executed purely using a URI by providing request parameters. Not all search options are exposed when executing a search using this mode, but it can be handy for quick “curl tests”. Here is an example:

curl -XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy'

The search request can be executed with a search DSL, which includes the Query DSL, within its body. Here is an example:

curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}'

6.4 Query

The query element within the search request body allows to define a query using the Query DSL.

{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.5 From / Size

Pagination of results can be done by using the from and size parameters.

{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll API for more efficient ways to do deep scrolling.

6.6 Sort (**多個欄位的排序規則是怎麼樣的?)

Allows to add one or more sort on specific fields. Each sort can be reversed as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.

{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The sort values for each document returned are also returned as part of the response.
The order option can have the following values:

  • asc: Sort in ascending order
  • desc: Sort in descending order

The order defaults to desc when sorting on the _score, and defaults to asc when sorting on anything else.

6.7 Sort mode option

Elasticsearch supports sorting by array or multi-valued fields. The mode option controls what array value is picked for sorting the document it belongs to. The mode option can have the following values:

  • min: Pick the lowest value.
  • max: Pick the highest value.
  • sum: Use the sum of all values as sort value. Only applicable for number based array fields.
  • avg: Use the average of all values as sort value. Only applicable for number based array fields.
  • median: Use the median of all values as sort value. Only applicable for number based array fields.

6.8 Missing Values

The missing parameter specifies how docs which are missing the field should be treated: The missing value can be set to _last, _first, or a custom value (that will be used for missing docs as the sort value). For example:

{
    "sort" : [
        { "price" : {"missing" : "_last"} },
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.9 Script Based Sorting

Allow to sort based on custom scripts, here is an example:

{
    "query" : {
        ....
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "inline": "doc['field_name'].value * factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

6.10 Memory Considerations

When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them. For string based types, the field sorted on should not be analyzed / tokenized. For numeric types, if possible, it is recommended to explicitly set the type to narrower types (like short, integer and float).

6.11 Source filtering

Allows to control how the _source field is returned with every hit.

By default operations return the contents of the _source field unless you have used the fields parameter or if the _source field is disabled.

  • To disable _source retrieval set to false.
  • The _source also accepts one or more wildcard patterns to control what parts of the _source should be returned.
  • Finally, for complete control, you can specify both include and exclude patterns.

6.12 Fields

The fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

6.13 Script Fields

Allows to return a script evaluation (based on different fields) for each hit, for example:

{
    "query" : {
        ...
    },
    "script_fields" : {
        "test1" : {
            "script" : "_source.obj1.obj2"
        },
        "test2" : {
            "script" : {
                "inline": "doc['my_field_name'].value * factor",
                "params" : {
                    "factor"  : 2.0
                }
            }
        }
    }
}

Note the _source keyword here to navigate the json-like model.

It’s important to understand the difference between doc[‘my_field’].value and _source.my_field. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[…] notation only allows for simple valued fields (can’t return a json object from it) and make sense only on non-analyzed or single term based fields.

The _source on the other hand causes the source to be loaded, parsed, and then only the relevant part of the json is returned.

6.14 Field Data Fields (**需要了解stored和fielddata的概念)

Allows to return the field data representation of a field for each hit, for example:

{
    "query" : {
        ...
    },
    "fielddata_fields" : ["test1", "test2"]
}

Field data fields can work on fields that are not stored.

It’s important to understand that using the fielddata_fields parameter will cause the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

6.15 Post filter

The post_filter is applied to the search hits at the very end of a search request, after aggregations have already been calculated.

{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "brandName": "vans"
        }
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": {
        "field": "colorNames"
      }
    },
    "color_red": {
      "filter": {
        "term": {
          "colorNames": "紅色"
        }
      },
      "aggs": {
        "smallSorts": {
          "terms": {
            "field": "smallSort"
          }
        }
      }
    }
  },
  "post_filter": {
    "term": {
      "colorNames": "紅色"
    }
  }
}
  • The main query now finds all products by vans, regardless of color.
  • The colors agg returns popular colors by vans.
  • The color_red agg limits the small sort sub-aggregation to red vans products.
  • Finally, the post_filter removes colors other than red from the search hits.

6.16 Highlighting

Allows to highlight search results on one or more fields. The implementation uses either the lucene highlighter, fast-vector-highlighter or postings-highlighter. The following is an example of the search request body:

{
    "query" : {...},
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}

6.16.1 Plain highlighter

The default choice of highlighter is of type plain and uses the Lucene highlighter. It tries hard to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.

6.16.2 Postings highlighter

If index_options is set to offsets in the mapping the postings highlighter will be used instead of the plain highlighter. The postings highlighter:

  • Is faster since it doesn’t require to reanalyze the text to be highlighted: the larger the documents the better the performance gain should be
  • Requires less disk space than term_vectors, needed for the fast vector highlighter
  • Breaks the text into sentences and highlights them. Plays really well with natural languages, not as well with - fields containing for instance html markup
  • Treats the document as the whole corpus, and scores individual sentences as if they were documents in this corpus, using the BM25 algorithm

6.16.3 Fast vector highlighter

If term_vector information is provided by setting term_vector to with_positions_offsets in the mapping then the fast vector highlighter will be used instead of the plain highlighter. The fast vector highlighter:

  • Is faster especially for large fields (> 1MB)
  • Can be customized with boundary_chars, boundary_max_scan, and fragment_offset (see below)
  • Requires setting term_vector to with_positions_offsets which increases the size of the index
  • Can combine matches from multiple fields into one result. See matched_fields
  • Can assign different weights to matches at different positions allowing for things like phrase matches being - sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches

6.17 Rescoring

Rescoring can help to improve precision by reordering just the top (eg 100 - 500) documents returned by the query and post_filter phases, using a secondary (usually more costly) algorithm, instead of applying the costly algorithm to all documents in the index.

A rescore request is executed on each shard before it returns its results to be sorted by the node handling the overall search request.

Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.

By default the scores from the original query and the rescore query are combined linearly to produce the final _score for each document. The relative importance of the original query and of the rescore query can be controlled with the query_weight and rescore_query_weight respectively. Both default to 1.

{
  "query": {
    "match": {
      "productName.productName_ansj": {
        "operator": "or",
        "query": "連帽 套裝",
        "type": "boolean"
      }
    }
  },
  "_source": [
    "productName"
  ],
  "rescore": {
    "window_size": 50,
    "query": {
      "rescore_query": {
        "match": {
          "productName.productName_ansj": {
            "query": "連帽 套裝",
            "type": "phrase",
            "slop": 2
          }
        }
      },
      "query_weight": 0.7,
      "rescore_query_weight": 1.2
    }
  }
}

Score Mode
- total:Add the original score and the rescore query score. The default.
- multiply: Multiply the original score by the rescore query score. Useful for function query rescores.
- avg: Average the original score and the rescore query score.
- max: Take the max of original score and the rescore query score.
- min: Take the min of the original score and the rescore query score.

6.18 Search Type

There are different execution paths that can be done when executing a distributed search. The distributed search operation needs to be scattered to all the relevant shards and then all the results are gathered back. When doing scatter/gather type execution, there are several ways to do that, specifically with search engines.

One of the questions when executing a distributed search is how many results to retrieve from each shard. For example, if we have 10 shards, the 1st shard might hold the most relevant results from 0 till 10, with other shards results ranking below it. For this reason, when executing a request, we will need to get results from 0 till 10 from all shards, sort them, and then return the results if we want to ensure correct results.

Another question, which relates to the search engine, is the fact that each shard stands on its own. When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first gather the term frequencies from all shards to calculate global term frequencies, then execute the query on each shard using these global frequencies.

Also, because of the need to sort the results, getting back a large document set, or even scrolling it, while maintaining the correct sorting behavior can be a very expensive operation. For large result set scrolling, it is best to sort by _doc if the order in which documents are returned is not important.

Elasticsearch is very flexible and allows to control the type of search to execute on a per search request basis. The type can be configured by setting the search_type parameter in the query string. The types are:

6.18.1 Query Then Fetch(query_then_fetch)

The request is processed in two phases. In the first phase, the query is forwarded to all involved shards. Each shard executes the search and generates a sorted list of results, local to that shard. Each shard returns just enough information to the coordinating node to allow it merge and re-sort the shard level results into a globally sorted set of results, of maximum length size.

During the second phase, the coordinating node requests the document content (and highlighted snippets, if any) from only the relevant shards.

Note: This is the default setting, if you do not specify a search_type in your request.

6.18.2 Dfs, Query Then Fetch(dfs_query_then_fetch)

Same as “Query Then Fetch”, except for an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.

6.18.3 Count (Deprecated in 2.0.0-beta1)

6.18.4 Scan (Deprecated in 2.1.0)

6.19 Scroll

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

6.20 Preference

Controls a preference of which shard replicas to execute the search request on. By default, the operation is randomized between the shard replicas.

The preference is a query string parameter which can be set to:

  • _primary: The operation will go and be executed only on the primary shards.
  • _primary_first: The operation will go and be executed on the primary shard, and if not available (failover), will execute on other shards.
  • _replica: The operation will go and be executed only on a replica shard.
  • _replica_first: The operation will go and be executed only on a replica shard, and if not available (failover), will execute on other shards.
  • _local: The operation will prefer to be executed on a local allocated shard if possible.
  • _only_node:xyz: Restricts the search to execute only on a node with the provided node id (xyz in this case).
  • _prefer_node:xyz: Prefers execution on the node with the provided node id (xyz in this case) if applicable.
  • _shards:2,3: Restricts the operation to the specified shards. (2 and 3 in this case). This preference can be combined with other preferences but it has to appear first: _shards:2,3;_primary
  • Custom (string) value: A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with “jumping values” when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

6.21 Explain

Enables explanation for each hit on how its score was computed.

{
    "explain": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.22 Version

Returns a version for each search hit.

{
    "version": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

6.23 Index Boost

Allows to configure different boost level per index when searching across more than one indices. This is very handy when hits coming from one index matter more than hits coming from another index (think social graph where each user has an index).

{
    "indices_boost" : {
        "index1" : 1.4,
        "index2" : 1.3
    }
}

6.24 Inner hits (******Skipped: 和parent/child有關係,後續一起學習)

6.25 Search Template

The /_search/template endpoint allows to use the mustache language to pre render search requests, before they are executed and fill existing templates with template parameters.

GET /_search/template
{
    "inline" : {
      "query": { "match" : { "{{my_field}}" : "{{my_value}}" } },
      "size" : "{{my_size}}"
    },
    "params" : {
        "my_field" : "foo",
        "my_value" : "bar",
        "my_size" : 5
    }
}

6.26 Search Shards API

The search shards api returns the indices and shards that a search request would be executed against. This can give useful feedback for working out issues or planning optimizations with routing and shard preferences.

curl -XGET 'localhost:9200/twitter/_search_shards'

6.27 Suggesters (Skipped)

The suggest feature suggests similar looking terms based on a provided text by using a suggester. Parts of the suggest feature are still under development.

6.28 Count API

The count API allows to easily execute a query and get the number of matches for that query. It can be executed across one or more indices and across one or more types. The query can either be provided using a simple query string as a parameter, or using the Query DSL defined within the request body. Here is an example:

curl -XGET 'http://localhost:9200/twitter/tweet/_count?q=user:kimchy'

curl -XGET 'http://localhost:9200/twitter/tweet/_count' -d '
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}'

6.29 Validate API

The validate API allows a user to validate a potentially expensive query without executing it. The following example shows how it can be used:

curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}'

When the query is valid, the response contains valid:true:

curl -XGET 'http://localhost:9200/twitter/_validate/query?q=user:foo'

{"valid":true,"_shards":{"total":1,"successful":1,"failed":0}}

6.30 Explain API

The explain api computes a score explanation for a query and a specific document. This can give useful feedback whether a document matches or didn’t match a specific query.

curl -XGET 'localhost:9200/twitter/tweet/1/_explain' -d '{
      "query" : {
        "term" : { "message" : "search" }
      }
}'

6.30 Profile API (experimental and may be changed or removed)

The Profile API provides detailed timing information about the execution of individual components in a query. It gives the user insight into how queries are executed at a low level so that the user can understand why certain queries are slow, and take steps to improve their slow queries.

The output from the Profile API is very verbose, especially for complicated queries executed across many shards. Pretty-printing the response is recommended to help understand the output.

curl -XGET 'localhost:9200/_search' -d '{
  "profile": true,
  "query" : {
    "match" : { "message" : "search test" }
  }
}

6.31 Field stats API (experimental and may be changed or removed)

The field stats api allows one to find statistical properties of a field without executing a search, but looking up measurements that are natively available in the Lucene index. This can be useful to explore a dataset which you don’t know much about. For example, this allows creating a histogram aggregation with meaningful intervals based on the min/max range of values.

The field stats api by defaults executes on all indices, but can execute on specific indices too.

All indices:

curl -XGET "http://localhost:9200/_field_stats?fields=rating"

Specific indices:

curl -XGET "http://localhost:9200/index1,index2/_field_stats?fields=rating"

7 Aggregations

7.1 Aggregations

The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.

An aggregation can be seen as a unit-of-work that builds analytic information over a set of documents. The context of the execution defines what this document set is (e.g. a top-level aggregation executes within the context of the executed query/filters of the search request).

There are many different types of aggregations, each with its own purpose and output. To better understand these types, it is often easier to break them into three main families:

  • Bucketing: A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to “fall in” the relevant bucket. By the end of the aggregation process, we’ll end up with a list of buckets - each one with a set of documents that “belong” to it.
  • Metric: Aggregations that keep track and compute metrics over a set of documents.
  • Pipeline: Aggregations that aggregate the output of other aggregations and their associated metrics

The interesting part comes next. Since each bucket effectively defines a document set (all documents belonging to the bucket), one can potentially associate aggregations on the bucket level, and those will execute within the context of that bucket. This is where the real power of aggregations kicks in: aggregations can be nested!

7.2 Structuring Aggregations

The following snippet captures the basic structure of aggregations:

"aggregations" : {
    "<aggregation_name>" : {
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

7.3 Metrics Aggregations

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.

Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output a single numeric metric (e.g. avg) and are called single-value numeric metrics aggregation, others generate multiple metrics (e.g. stats) and are called multi-value numeric metrics aggregation. The distinction between single-value and multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket).

7.3.1 Avg Aggregation

A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "avg_price": {
      "avg": {
        "field": "salesPrice"
      }
    }
  }
}

"aggregations": {
    "avg_price": {
        "value": 428.51063644785825
    }
}

7.3.2 Cardinality Aggregation

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "brand_count": {
      "cardinality": {
        "field": "brandId"
      }
    }
  }
}

"aggregations": {
    "brand_count": {
        "value": 1186
    }
}

7.3.3 Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The stats that are returned consist of: min, max, sum, count and avg.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "price_stat": {
      "stats": {
        "field": "salesPrice"
      }
    }
  }
}

"aggregations": {
    "price_stat": {
        "count": 221275,
        "min": 0,
        "max": 131231,
        "avg": 428.51063644785825,
        "sum": 94818691.07999983
    }
}

7.3.4 Extended Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The extended_stats aggregations is an extended version of the stats aggregation, where additional metrics are added such as sum_of_squares, variance, std_deviation and std_deviation_bounds.

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "price_stat": {
      "extended_stats": {
        "field": "salesPrice"
      }
    }
  }
}

"aggregations": {
    "price_stat": {
        "count": 221275,
        "min": 0,
        "max": 131231,
        "avg": 428.51063644785825,
        "sum": 94818691.07999983,
        "sum_of_squares": 118950750156.63016,
        "variance": 353948.4012870255,
        "std_deviation": 594.9356278514723,
        "std_deviation_bounds": {
        "upper": 1618.3818921508027,
        "lower": -761.3606192550864
        }
    }
}

7.3.5 Geo Bounds Aggregation (Skipped)

7.3.6 Geo Centroid Aggregation (Skipped)

7.3.7 Max Aggregation

A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents.

"aggs": {
    "max_price": {
        "max": {
            "field": "salesPrice"
        }
    }
}

7.3.8 Min Aggregation

A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents.

"aggs": {
    "min_price": {
      "min": {
        "field": "salesPrice"
      }
    }
  }

7.3.9 Percentiles Aggregation

A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents.

Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.

When a range of percentiles are retrieved, they can be used to estimate the data distribution and determine if the data is skewed, bimodal, etc.

"aggs": {
    "price_outlier": {
      "percentiles": {
        "field": "salesPrice"
      }
    }
  }

"aggregations": {
    "price_outlier": {
        "values": {
            "1.0": 19,
            "5.0": 49.049088235742005,
            "25.0": 148.8903318997934,
            "50.0": 288.33201291736634,
            "75.0": 521.2972145384141,
            "95.0": 1286.9096656603726,
            "99.0": 2497.931283641535
        }
    }
}

7.3.10 Percentile Ranks Aggregation

A multi-value metrics aggregation that calculates one or more percentile ranks over numeric values extracted from the aggregated documents.

Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.

"aggs": {
    "price_outlier": {
      "percentile_ranks": {
        "field": "salesPrice",
        "values": [
          200,
          500
        ]
      }
    }
  }

"aggregations": {
    "price_outlier": {
        "values": {
            "200.0": 37.906112721751086,
            "500.0": 74.407593883831
        }
    }
}

7.3.11 Scripted Metric Aggregation (experimental and may be changed or removed)

A metric aggregation that executes using scripts to provide a metric output.

7.3.12 Sum Aggregation

A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents.

{
  "query": {
    "term": {
      "brandName": "vans"
    }
  },
  "aggs": {
    "salesNum_total": {
      "sum": {
        "field": "salesNum"
      }
    }
  }
}

"aggregations": {
    "salesNum_total": {
        "value": 253365
    }
}

7.3.13 Top hits Aggregation

A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.

The top_hits aggregator can effectively be used to group result sets by certain fields via a bucket aggregator. One or more bucket aggregators determines by which properties a result set get sliced into.

Options:
- from: The offset from the first result you want to fetch.
- size: The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.
- sort: How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.

7.3.14 Value Count Aggregation

A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. Typically, this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg one might be interested in the number of values the average is computed over.

7.4 Bucket Aggregations

Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context “falls” into it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents that “fell into” each bucket.

Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be aggregated for the buckets created by their “parent” bucket aggregation.

There are different bucket aggregators, each with a different “bucketing” strategy. Some define a single bucket, some define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.

7.4.1 Children Aggregation

A special single bucket aggregation that enables aggregating from buckets on parent document types to buckets on child documents.

7.4.2 Histogram Aggregation

A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5). When the aggregation executes, the price field of every document will be evaluated and will be rounded down to its closest bucket - for example, if the price is 32 and the bucket size is 5 then the rounding will yield 30 and thus the document will “fall” into the bucket that is associated with the key 30.

From the rounding function above it can be seen that the intervals themselves must be integers.

{
  "size": 0,
  "query": {
    "term": {
      "brandName": "vans"
    }
  },
  "aggs": {
    "prices": {
      "histogram": {
        "field": "salesPrice",
        "interval": 200
      }
    }
  }
}

"aggregations": {
    "prices": {
    "buckets": [
        {
        "key": 0,
        "doc_count": 838
        }
        ,
        {
        "key": 200,
        "doc_count": 1123
        }
        ,
        {
        "key": 400,
        "doc_count": 804
        }
        ,
        {
        "key": 600,
        "doc_count": 283
        }
        ,
        {
        "key": 800,
        "doc_count": 64
        }
        ,
        {
        "key": 1000,
        "doc_count": 16
        }
        ,
        {
        "key": 1200,
        "doc_count": 18
        }
        ,
        {
        "key": 1400,
        "doc_count": 8
        }
        ,
        {
        "key": 1600,
        "doc_count": 7
        }
        ]
    }
}

7.4.3 Date Histogram Aggregation

A multi-bucket aggregation similar to the histogram except it can only be applied on date values. Since dates are represented in elasticsearch internally as long values, it is possible to use the normal histogram on dates as well, though accuracy will be compromised. The reason for this is in the fact that time based intervals are not fixed (think of leap years and on the number of days in a month). For this reason, we need special support for time based data. From a functionality perspective, this histogram supports the same features as the normal histogram. The main difference is that the interval can be specified by date/time expressions.

Requesting bucket intervals of a month.

{
    "aggs" : {
        "articles_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            }
        }
    }
}

7.4.4 Range Aggregation

A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and “bucket” the relevant/matching document. Note that this aggregation includes the from value and excludes the to value for each range.

{
  "size": 0,
  "query": {
    "term": {
      "brandName": "vans"
    }
  },
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "salesPrice",
        "ranges": [
          {
            "to": 200
          },
          {
            "from": 200,
            "to": 500
          },
          {
            "from": 500
          }
        ]
      }
    }
  }
}

"aggregations": {
    "price_ranges": {
        "buckets": [
            {
                "key": "*-200.0",
                "to": 200,
                "to_as_string": "200.0",
                "doc_count": 838
            }
            ,
            {
                "key": "200.0-500.0",
                "from": 200,
                "from_as_string": "200.0",
                "to": 500,
                "to_as_string": "500.0",
                "doc_count": 1594
            }
            ,
            {
                "key": "500.0-*",
                "from": 500,
                "from_as_string": "500.0",
                "doc_count": 729
            }
        ]
    }
}

7.4.5 Date Range Aggregation

A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned. Note that this aggregation includes the from value and excludes the to value for each range.

{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [
                    { "to": "now-10M/M" }, 
                    { "from": "now-10M/M" } 
                ]
            }
        }
    }
}

7.4.6 Filter Aggregation

Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.

{
  "size": 0,
  "query": {
    "term": {
      "brandName": "vans"
    }
  },
  "aggs": {
    "red_products": {
      "filter": {
        "term": {
          "colorNames": "紅色"
        }
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "salesPrice"
          }
        }
      }
    }
  }
}

7.4.7 Filters Aggregation

Defines a multi bucket aggregation where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.

{
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "term" : { "body" : "error"   }},
          "warnings" : { "term" : { "body" : "warning" }}
        }
      },
      "aggs" : {
        "monthly" : {
          "histogram" : {
            "field" : "timestamp",
            "interval" : "1M"
          }
        }
      }
    }
  }
}

In the above example, we analyze log messages. The aggregation will build two collection (buckets) of log messages - one for all those containing an error, and another for all those containing a warning. And for each of these buckets it will break them down by month.Response:

  "aggs" : {
    "messages" : {
      "buckets" : {
        "errors" : {
          "doc_count" : 34,
          "monthly" : {
            "buckets" : [
              ... // the histogram monthly breakdown
            ]
          }
        },
        "warnings" : {
          "doc_count" : 439,
          "monthly" : {
            "buckets" : [
               ... // the histogram monthly breakdown
            ]
          }
        }
      }