1. 程式人生 > >elasticsearch 6.x 基本概念解讀

elasticsearch 6.x 基本概念解讀

官方網址

Elasticsearch最新基本請參考官方介紹:
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-concepts.html

備註:以下是官方文件+谷歌翻譯(機器翻譯效果還可以,加上個別人工調整)

Basic Concepts/基本概念

There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process.
有一些概念是Elasticsearch的核心。 從一開始就理解這些概念將極大地幫助簡化學習過程。

Near Realtime (NRT)/近實時

Elasticsearch is a near-realtime search platform. What this means is there is a slight latency (normally one second) from the time you index a document until the time it becomes searchable.
Elasticsearch是一個近實時搜尋平臺。 這意味著從索引文件到可搜尋文件的時間有一點延遲(通常是一秒)。

Cluster/叢集

A cluster is a collection of one or more nodes (servers) that together holds your entire data and provides federated indexing and search capabilities across all nodes. A cluster is identified by a unique name which by default is “elasticsearch”. This name is important because a node can only be part of a cluster if the node is set up to join the cluster by its name.
叢集是一個或多個節點(伺服器)的集合,它們共同儲存您的整個資料,並提供跨所有節點的聯合索引和搜尋功能。 群集由唯一名稱標識,預設情況下為“elasticsearch”。 此名稱很重要,因為如果節點設定為按名稱加入群集,則該節點只能是群集的一部分。

Make sure that you don’t reuse the same cluster names in different environments, otherwise you might end up with nodes joining the wrong cluster. For instance you could use logging-dev, logging-stage, and logging-prod for the development, staging, and production clusters.
確保不要在不同的環境中重用相同的群集名稱,否則最終會導致節點加入錯誤的群集。 例如,您可以將logging-dev,logging-stage和logging-prod用於開發、測試和生產叢集。

Note that it is valid and perfectly fine to have a cluster with only a single node in it. Furthermore, you may also have multiple independent clusters each with its own unique cluster name.
請注意,如果叢集中只有一個節點,那麼它是完全正常的。 此外,您還可以擁有多個獨立的叢集,每個叢集都有自己唯一的叢集名稱。

Node/節點

A node is a single server that is part of your cluster, stores your data, and participates in the cluster’s indexing and search capabilities. Just like a cluster, a node is identified by a name which by default is a random Universally Unique IDentifier (UUID) that is assigned to the node at startup. You can define any node name you want if you do not want the default. This name is important for administration purposes where you want to identify which servers in your network correspond to which nodes in your Elasticsearch cluster.
節點是作為群集一部分的單個伺服器,儲存資料並參與群集的索引和搜尋功能。 就像叢集一樣,節點由名稱標識,預設情況下,該名稱是在啟動時分配給節點的隨機通用唯一識別符號(UUID)。 如果不需要預設值,可以定義所需的任何節點名稱。 此名稱對於管理目的非常重要,您可以在其中識別網路中哪些伺服器與Elasticsearch叢集中的哪些節點相對應。

A node can be configured to join a specific cluster by the cluster name. By default, each node is set up to join a cluster named elasticsearch which means that if you start up a number of nodes on your network and—assuming they can discover each other—they will all automatically form and join a single cluster named elasticsearch.
可以將節點配置為按群集名稱加入特定群集。 預設情況下,每個節點都設定為加入名為elasticsearch的群集,這意味著如果您在網路上啟動了許多節點並且 - 假設他們可以相互發現 - 他們將自動形成並加入名為elasticsearch的單個群集。

In a single cluster, you can have as many nodes as you want. Furthermore, if there are no other Elasticsearch nodes currently running on your network, starting a single node will by default form a new single-node cluster named elasticsearch.
在單個群集中,您可以擁有任意數量的節點。 此外,如果您的網路上當前沒有其他Elasticsearch節點正在執行,則預設情況下,啟動單個節點將形成名為elasticsearch的新單節點叢集。

Index/索引

An index is a collection of documents that have somewhat similar characteristics. For example, you can have an index for customer data, another index for a product catalog, and yet another index for order data. An index is identified by a name (that must be all lowercase) and this name is used to refer to the index when performing indexing, search, update, and delete operations against the documents in it.
索引是具有某些類似特徵的文件集合。 例如,您可以擁有客戶資料的索引,產品目錄的另一個索引以及訂單資料的另一個索引。 索引由名稱標識(必須全部小寫),此名稱用於在對其中的文件執行索引,搜尋,更新和刪除操作時引用索引。

In a single cluster, you can define as many indexes as you want.
在單個群集中,您可以根據需要定義任意數量的索引。

Type/型別

tpye從6.0.0開始棄用。
最明顯的變化就是一個index只能有一個type,推薦的type名是 _doc,7.0版本以後就完全拋棄type了

A type used to be a logical category/partition of your index to allow you to store different types of documents in the same index, e.g. one type for users, another type for blog posts. It is no longer possible to create multiple types in an index, and the whole concept of types will be removed in a later version. See Removal of mapping types for more.
一種型別,曾經是索引的邏輯類別/分割槽,允許您在同一索引中儲存不同型別的文件,例如 一種使用者型別,另一種用於部落格帖子。現在已經不可以在一個索引中建立多個型別了,並且將在更高版本中刪除型別的整個概念。 請參閱刪除對映型別以獲取更多資訊。
(Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Mapping types will be completely removed in Elasticsearch 7.0.0.
在Elasticsearch 6.0.0或更高版本中建立的索引可能只包含單個對映型別。 在具有多種對映型別的5.x中建立的索引將繼續像以前一樣在Elasticsearch 6.x中執行。 對映型別將在Elasticsearch 7.0.0中完全刪除。)

Document/文件

A document is a basic unit of information that can be indexed. For example, you can have a document for a single customer, another document for a single product, and yet another for a single order. This document is expressed in JSON (JavaScript Object Notation) which is a ubiquitous internet data interchange format.
文件是可以被索引到的基本資訊單元。 例如,您可以為單個客戶提供文件,為單個產品提供另一個文件,為單個訂單提供另一個文件。 該文件以JSON(JavaScript Object Notation)表示,JSON是一種普遍存在的網際網路資料交換格式。
(簡單地說,ES中的文件可以是一篇新聞報道、一份DOC文件或一份PDF檔案內容)

Within an index/type, you can store as many documents as you want. Note that although a document physically resides in an index, a document actually must be indexed/assigned to a type inside an index.
在索引/型別中,您可以根據需要儲存任意數量的文件。 請注意,儘管文件實際上駐留在索引中,但實際上必須將文件編入索引/分配給索引中的型別。

Shards & Replicas/分片和副本

An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.
索引可能儲存大量可能超過單個節點的硬體限制的資料。 例如,佔用1TB磁碟空間的十億個文件的單個索引可能不適合單個節點的磁碟,或者可能太慢而無法單獨從單個節點提供搜尋請求。

To solve this problem, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. When you create an index, you can simply define the number of shards that you want. Each shard is in itself a fully-functional and independent “index” that can be hosted on any node in the cluster.
為了解決這個問題,Elasticsearch提供了將索引細分為多個稱為分片的功能。 建立索引時,只需定義所需的分片數即可。 每個分片本身都是一個功能齊全且獨立的“索引”,可以託管在叢集中的任何節點上。

Sharding is important for two primary reasons:
分片很重要,主要有兩個原因

  • It allows you to horizontally split/scale your content volume . 它允許您水平拆分/縮放內容量
  • It allows you to distribute and parallelize operations across shards (potentially on multiple nodes) thus increasing performance/throughput . 它允許您跨分片(可能在多個節點上)分發和並行化操作,從而提高效能/吞吐量

The mechanics of how a shard is distributed and also how its documents are aggregated back into search requests are completely managed by Elasticsearch and is transparent to you as the user.
分片的分佈方式以及如何將其文件聚合回搜尋請求的機制完全由Elasticsearch管理,對使用者而言是透明的。

In a network/cloud environment where failures can be expected anytime, it is very useful and highly recommended to have a failover mechanism in case a shard/node somehow goes offline or disappears for whatever reason. To this end, Elasticsearch allows you to make one or more copies of your index’s shards into what are called replica shards, or replicas for short.
在可以隨時發生故障的網路/雲環境中,非常有用,強烈建議使用故障轉移機制,以防分片/節點以某種方式離線或因任何原因消失。 為此,Elasticsearch允許您將索引的分片的一個或多個副本製作成所謂的副本分片或簡稱副本。

Replication is important for two primary reasons: 副本很重要,主要有兩個原因:

  • It provides high availability in case a shard/node fails. For this reason, it is important to note that a replica shard is never allocated on the same node as the original/primary shard that it was copied from.它在碎片/節點出現故障時提供高可用性。 因此,請務必注意,副本分片永遠不要和原始/主分片位於相同的節點上。
  • It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.它允許您擴充套件搜尋量/吞吐量,因為可以在所有副本上並行執行搜尋。

To summarize, each index can be split into multiple shards. An index can also be replicated zero (meaning no replicas) or more times. Once replicated, each index will have primary shards (the original shards that were replicated from) and replica shards (the copies of the primary shards).
總而言之,每個索引可以拆分為多個分片。 索引也可以複製為零(表示沒有副本)或更多次。 複製後,每個索引都將具有主分片(從中複製的原始分片)和副本分片(主分片的副本)。

The number of shards and replicas can be defined per index at the time the index is created. After the index is created, you may also change the number of replicas dynamically anytime. You can change the number of shards for an existing index using the _shrink and _split APIs, however this is not a trivial task and pre-planning for the correct number of shards is the optimal approach.
可以在建立索引時為每個索引定義分片和副本的數量。 建立索引後,您還可以隨時動態更改副本數。 您可以使用_shrink和_split API更改現有索引的分片數,但這不是一項簡單的任務,預先計劃正確數量的分片是最佳方法。

By default, each index in Elasticsearch is allocated 5 primary shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
預設情況下,Elasticsearch中的每個索引都分配了5個主分片和1個副本,這意味著如果群集中至少有兩個節點,則索引將包含5個主分片和另外5個副本分片(1個完整副本),總計為 每個索引10個分片。