solrCloud架構

阿新 • • 發佈：2019-01-21

本文為調研分散式檢索系統的筆記整理，之前調研sphinx和coreseek的時候，發現國內的部落格，還是講怎麼配置怎麼安裝多，原理性的東西並不多。本文為：官網文件閱讀筆記（有些會附帶上文件英文原文，如果讀者覺得我哪個地方說的不清楚，可以看英文）+各種原理部落格+各種原始碼分析部落格。

內容涉及：SolrCloud的基礎知識，架構，索引建立更新、查詢，故障恢復，負載均衡，leader選舉等的原理。文章比較長，可以根據標題選需要的看。

一、SolrCloud與Solr，lucene關係

1、 solr與luence的關係

Many people new to Lucene and Solr will ask the obvious question: Should I use Lucene or Solr?

The answer is simple: if you're asking yourself this question, in 99% of situations, what you want to use is Solr.

A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can't drive an engine, but you can drive a car. Similarly, Lucene is a programmatic library which you can't use as-is, whereas Solr is a complete application which you can use out-of-box.

網上有這樣的比喻:

(1) lucene是資料庫的話，solr就是jdbc

(2) lucene是jar，solr就是一個引用這些jar來寫的搜尋客戶端。Solr是一個可以直接用的應用，而lucene只是一些程式設計用的庫。

2、 Solr與SolrCloud

SolrCloud是Solr4.0版本開發出的具有開創意義的基於Solr和Zookeeper的分散式搜尋方案，或者可以說，SolrCloud是Solr的一種部署方式。Solr可以以多種方式部署，例如單機方式，多機Master-Slaver方式，這些方式部署的Solr不具有SolrCloud的特色功能。

二、SolrCloud配置

有兩種方式：

（1）把solr作為web應用部署到Tomcat容器，然後Tomcat與zookeeper關聯

（2）從solr 5開始， solr自身有jetty容器，不必部署到Tomcat，這樣會容易一些。

我用的第二種，推薦教程：

三、SolrCloud基礎知識

概念

· Collection：在SolrCloud叢集中邏輯意義上的完整的索引。它常常被劃分為一個或多個Shard，它們使用相同的Config Set。如果Shard數超過一個，它就是分散式索引，SolrCloud讓你通過Collection名稱引用它，而不需要關心分散式檢索時需要使用的和Shard相關引數。

· ConfigSet: Solr Core提供服務必須的一組配置檔案。每個config set有一個名字。最小需要包括solrconfig.xml(SolrConfigXml)和schema.xml (SchemaXml)，除此之外，依據這兩個檔案的配置內容，可能還需要包含其它檔案。它儲存在Zookeeper中。Config sets可以重新上傳或者使用upconfig命令更新，使用Solr的啟動引數bootstrap_confdir指定可以初始化或更新它。

· Core:也就是Solr Core，一個Solr中包含一個或者多個Solr Core，每個Solr Core可以獨立提供索引和查詢功能，每個Solr Core對應一個索引或者Collection的Shard，Solr Core的提出是為了增加管理靈活性和共用資源。在SolrCloud中有個不同點是它使用的配置是在Zookeeper中的，傳統的Solr core的配置檔案是在磁碟上的配置目錄中。

· Leader:贏得選舉的Shard replicas。每個Shard有多個Replicas，這幾個Replicas需要選舉來確定一個Leader。選舉可以發生在任何時間，但是通常他們僅在某個Solr例項發生故障時才會觸發。當索引documents時，SolrCloud會傳遞它們到此Shard對應的leader，leader再分發它們到全部Shard的replicas。

· Replica: Shard的一個拷貝。每個Replica存在於Solr的一個Core中。一個命名為“test”的collection以numShards=1建立，並且指定replicationFactor設定為2，這會產生2個replicas，也就是對應會有2個Core，每個在不同的機器或者Solr例項。一個會被命名為test_shard1_replica1，另一個命名為test_shard1_replica2。它們中的一個會被選舉為Leader。

· Shard: Collection的邏輯分片。每個Shard被化成一個或者多個replicas，通過選舉確定哪個是Leader。

· Zookeeper: Zookeeper提供分散式鎖功能，對SolrCloud是必須的。它處理Leader選舉。Solr可以以內嵌的Zookeeper執行，但是建議用獨立的，並且最好有3個以上的主機。

· SolrCor：單機執行時，單獨的索引叫做SolrCor，如果建立多個索引，可以建立的多個SolrCore。

· 索引：一個索引可以在不同的Solr服務上，也就是一個索引可以由不同機器上的SolrCore組成。不同機器上的SolrCore組成邏輯上的索引，這樣的一個索引叫做Collection，組成Collection的SolrCore包括資料索引和備份。

· SolrCloud collection shard關係：一個SolrCloud包含多個collection，collection可以被分為多個shards，每個shard可以有多個備份（replicas），這些備份通過選舉產生一個leader，

· Optimization：是一個程序，壓縮索引和歸併segment，Optimization只會在master node執行，

· Leader 和 replica

（1） leader負責保證replica和leader是一樣的最新的資訊

（2）replica 被分到 shard 按照這樣的順序：在叢集中他們第一次啟動的次序輪詢加入，

除非新的結點被人工地使用shardId引數分派到shard。

人工指定replica屬於哪個shard的方法：

· This parameter is used as a system property, as in -DshardId=1, the value of which is the ID number of theshard the new node should be attached to.

以後重啟的時候，每個node加入之前它第一次被指派的shard。 Node如果以前是replica，當以前的leader不存在的時候，會成為leader。

架構

· 索引（collection）的邏輯圖

· 索引和Solr實體對照圖

一個經典的例子：

SolrCloud是基於Solr和Zookeeper的分散式搜尋方案，是正在開發中的Solr4.0的核心元件之一，它的主要思想是使用Zookeeper作為叢集的配置資訊中心。它有幾個特色功能：1）集中式的配置資訊 2）自動容錯 3）近實時搜尋 4）查詢時自動負載均衡

基本可以用上面這幅圖來概述，這是一個擁有4個Solr節點的叢集，索引分佈在兩個Shard裡面，每個Shard包含兩個Solr節點，一個是Leader節點，一個是Replica節點，此外叢集中有一個負責維護叢集狀態資訊的Overseer節點，它是一個總控制器。叢集的所有狀態資訊都放在Zookeeper叢集中統一維護。從圖中還可以看到，任何一個節點都可以接收索引更新的請求，然後再將這個請求轉發到文件所應該屬於的那個Shard的Leader節點，Leader節點更新結束完成，最後將版本號和文件轉發給同屬於一個Shard的replicas節點。

幾種角色

（1）zookeeper：下面的資訊是zookeeper上儲存的，zookeeper目錄稱為znode

solr在zookeeper中的結點

1、aliases.json 對colletion別名，另有妙用（solrcloud的build search分離），以後再寫部落格說明。

2、clusterstate.json 重要資訊檔案。包含了colletion ,shard replica的具體描述資訊。

3、live_nodes ，下面都是瞬時的zk結點，代表當前存活的solrcloud中的節點。

4、overseer， solrcloud中的重要角色。下面存有三個重要的分散式佇列，代表待執行solrcloud相關的zookeeper操作的任務佇列。collection-queue-work是存放與collection相關的特辦操作，如createcollection ，reloadcollection，createalias，deletealias ，splitshard 等。

5、queue則存放了所有與collection無關的操作，例如deletecore，removecollection，removeshard，leader，createshard，updateshardstate，還有包括節點的狀態（down、active、recovering）的變化。

6、queue-work是一個臨時佇列，指正在處理中的訊息。操作會先儲存到/overseer/queue，在overseser進行處理時，被移到/overseer/queue-work中，處理完後訊息之後在從/overseer/queue-work中刪除。如果overseer中途掛了，新選舉的overseer會選將/overseer/queue-work中的操作執行完，再去處理/overseer/queue中的操作。

注意：以上佇列中存放的所有子結點，都是PERSISTENT_SEQUENTIAL型別的。

7、overseer_elect ,用於overseer的選舉工作

8、colletcion，存放當前collection一些簡單資訊（主要資訊都在clusterstate.json中）。下面的leader_elect自然是用於collection中shard中副本集的leader選舉的。

clusterstate.json

（2）overseer： overseer是經常被忽略的角色，實際上，我測試過，每次加入一臺新的機器的時候，一方面，SolrCloud會多一個Solr，另一方面，會多一個oveseer（當然可能不會起到作用）。整個SolrCloud只有一個overseer會起到作用，所有的overseer經過選舉產生overseer。 Overseer和shard的leader選舉方式一樣，詳見後面leader選舉部分。

Overseer 的zk寫流程

在看solrcloud的官方文件的時候，幾乎也很少有overseer的這個角色的說明介紹。相信不少成功配置solrcloud的開發者，也沒有意識到這個角色的存在。

Overseer，顧名思義，是一個照看全域性的角色，做總控工作。體現在程式碼與zk的相關操作中，就是zookeeper中大多的寫操作，是由overseer去處理的，並且維護好clusterstate.josn與aliases.json這兩個zk結點的內容。與我們“誰建立，誰修改”做法不同。由各個solr node發起的操作，都會publish到/overseer結點下面相應的queue中去，再由overseer去些分散式佇列中去取這些操作資訊，做相應的zk修改，並將整個solrcloud中相關的具體狀態資訊，更新到cluseterstate.json中去，最終會將個操作，從queue中刪除，表示完成操作。

以一個solr node將自身狀態標記為down為例。該node會將這種“state”operation的相關資訊，publish到/overseer/queue中。由Overseer去從中取得這個操作，然後將node state為down的資訊寫入clusterstate.json。最後刪除queue中的這個結點。

當然overseer這個角色，是利用zookeeper在solrcloud中內部選舉出來的。

一般的zk讀操作

Solr將最重要且資訊最全面的內容都放在了cluseterstate.json中。這樣做減少了，普通solr node需要關注的zk 結點數。除了clusterstate.json，普通的solr node在需要當前collection整體狀態的時候，還會獲取zk的/live_nodes中的資訊，根據live_nodes中的資訊，得知collection存活的node, 再從clusterstate.json獲得這些node的資訊。

這種處理，其實也好理解。假如一個solr node非正常下線，clusterstate.json中不一定會有變化，但/live_nodes中這個node對應的zk結點就消失了（因為是瞬時的）。

二、建立索引的過程（索引更新）

具體過程：

1、細節問題：

（1）下面所說的SolrJ是客戶端，CloudSolrServcer是SolrCloud的機器

（2）文件的ＩＤ

　根據我看原始碼，文件ＩＤ生成，一是可以自己配置，二是可以使用預設配置，這時候文件ＩＤ是使用ｊａｖａ的ＵＵＩＤ生成器（這個ＩＤ生成器可以生成全球唯一的ＩＤ）

（3） watch (可以看Zookeeper相關的文件

http://www.cnblogs.com/clara/archive/2013/06/10/3130922.html)

客戶端（watcher）在zookeeper的資料上設定watch， watch是一次性的trigger，當資料改變的時候的時候，會觸發，watch被髮送資訊到設定這個Watch的客戶端。

ZooKeeper Watches

Here is ZooKeeper's definition of a watch: a watch event is one-time trigger, sent to the client that set the watch, which occurs when the data for which the watch was set changes. There are three key points to consider in this definition of a watch:

· One-time trigger

One watch event will be sent to the client when the data has changed. For example, if a client does a getData("/znode1", true) and later the data for /znode1 is changed or deleted, the client will get a watch event for /znode1. If /znode1 changes again, no watch event will be sent unless the client has done another read that sets a new watch.

· Sent to the client

This implies that an event is on the way to the client, but may not reach the client before the successful return code to the change operation reaches the client that initiated the change. Watches are sent asynchronously to watchers. ZooKeeper provides an ordering guarantee: a client will never see a change for which it has set a watch until it first sees the watch event.

· The data for which the watch was set

This refers to the different ways a node can change. It helps to think of ZooKeeper as maintaining two lists of watches: data watches and child watches. getData() and exists() set data watches. getChildren() sets child watches. Alternatively, it may help to think of watches being set according to the kind of data returned. getData() and exists() return information about the data of the node, whereas getChildren() returns a list of children. Thus, setData() will trigger data watches for the znode being set (assuming the set is successful). A successful create() will trigger a data watch for the znode being created and a child watch for the parent znode. A successful delete() will trigger both a data watch and a child watch (since there can be no more children) for a znode being deleted as well as a child watch for the parent znode.

Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected.

(4) lucene index是一個目錄，索引的insert或者刪除保持不變，文件總是被插入新建立的file，被刪除的文件不會真的從file中刪除，而是隻是打上tag，知道index被優化。update就是增加和刪除組合。

（5）文件路由

Solr 4.1 added the ability to co-locate documents toimprove query performance.

Solr4.1添加了文件聚類（譯註：此處翻譯準確性需要權衡,意思是將文件歸類在一起的意思）的功能來提升查詢效能。

Solr 4.5 has added the ability to specify the routerimplementation with the router.name parameter. If you use the"compositeId" router, you can send documents with a prefix in thedocument ID which will be used to calculate the hash Solr uses to determine theshard a document is sent to for indexing. The prefix can be anything you'd likeit to be (it doesn't have to be the shard name, for example), but it must beconsistent so Solr behaves consistently. For example, if you wanted toco-locate documents for a customer, you could use the customer name or ID asthe prefix. If your customer is "IBM", for example, with a documentwith the ID "12345", you would insert the prefix into the document idfield: "IBM!12345". The exclamation mark ('!') is critical here, asit defines the shard to direct the document to.

Solr4.5添加了通過一個router.name引數來指定一個特定的路由器實現的功能。如果你使用“compositeId”路由器，你可以在要傳送到Solr進行索引的文件的ID前面新增一個字首，這個字首將會用來計算一個hash值，Solr使用這個hash值來確定文件傳送到哪個shard來進行索引。這個字首的值沒有任何限制（比如沒有必要是shard的名稱），但是它必須總是保持一致來保證Solr的執行結果一致。例如，你需要為不同的顧客聚類文件，你可能會使用顧客的名字或者是ID作為一個字首。比如你的顧客是“IBM”，如果你有一個文件的ID是“12345”，把字首插入到文件的id欄位中變成：“IBM!12345”，在這裡感嘆號是一個分割符號，這裡的“IBM”定義了這個文件會指向一個特定的shard。

Then at query time, you include the prefix(es) into yourquery with the _route_ parameter (i.e., q=solr&_route_=IBM!)to direct queries to specific shards. In some situations, this may improvequery performance because it overcomes network latency when querying all theshards.

然後在查詢的時候，你需要把這個字首包含到你的_route_引數裡面(比如：q=solr&_route_=IBM!)使查詢指向指定的shard。在某些情況下，這樣操作能提升查詢的效能，因為它省掉了需要在所有shard上查詢耗費的網路傳輸用時。

The _route_ parameterreplaces shard.keys, which has been deprecated and will be removed in afuture Solr release.

使用_route_代替shard.keys引數。shard.keys引數已經過時了，在Solr的未來版本中這個引數會被移除掉。

If you do not want to influence how documents are stored,you don't need to specify a prefix in your document ID.

如果你不想變動文件的儲存過程，那就不需要在文件的ID前面新增字首。

If you created the collection and defined the"implicit" router at the time of creation, you can additionallydefine a router.field parameter to use a field from each document toidentify a shard where the document belongs. If the field specified is missingin the document, however, the document will be rejected. You could also usethe _route_ parameter to name a specific shard.

如果你建立了collection並且在建立的時候指定了一個“implicit”路由器，你可以另外定義一個router.field引數，這個引數定義了通過使用文件中的一個欄位來確定文件是屬於哪個shard的。但是，如果在一個文件中你指定的欄位沒有值得話，這個文件Solr會拒絕處理。同時你也可以使用_route_引數來指定一個特定的shard。

我的理解：添加了這樣聚類的好處：查詢的時候，聲明瞭route=IBM，那麼就可以減少訪問的shard

router.name的值可以是 implicit 或者compositeId， 'implicit' 不自動路由文件到不同的shard，而是會按照你在indexingrequest中暗示的那樣。也就是建索引的時候，如果是implicit是需要自己指定文件到哪個shard的。比如：

curl "http://10.1.30.220:8081/solr/admin/collections?action=CREATE&name=paper&collection.configName=paper_conf&router.name=implicit&shards=shard1,shard2&createNodeSet=10.1.30.220:8081_solr,10.1.30.220:8084_solr"

２具體過程

新增文件的過程：

（1）當SolrJ傳送update請求給CloudSolrServer ，CloudSolrServer會連線至Zookeeper獲取當前SolrCloud的叢集狀態，並會在/clusterstate.json 和/live_nodes 註冊watcher，便於監視Zookeeper和SolrCloud，這樣做的好處有以下幾點：

　　CloudSolrServer獲取到SolrCloud的狀態後，它能跟直接將document發往SolrCloud的leader，降低網路轉發消耗。

　　註冊watcher有利於建索引時候的負載均衡，比如如果有個節點leader下線了，那麼CloudSolrServer會立馬得知，那它就會停止往下線leader傳送document。

　　（2）路由document至正確的shard。CloudSolrServer 在傳送document時候需要知道發往哪個shard，但是這裡需要注意，單個document的路由非常簡單，但是SolrCloud支援批量add，也就是說正常情況下N個document同時進行路由。這個時候SolrCloud就會根據document路由的去向分開存放document即進行分類（這裡應該是有聚類，官方文件的說法見前面說的），然後進行併發傳送至相應的shard，這就需要較高的併發能力。

　　（3）Leader接受到update請求後，先將update資訊存放到本地的update log，同時Leader還會給documrnt分配新的version，對於已存在的document，Leader就會驗證分配的新version與已有的version，如果新的版本高就會拋棄舊版本，最後傳送至replica。

　　（4）當只有一個Replica的時候，replica會進入recoveing狀態並持續一段時間等待leader重新上線，如果在這段時間內，leader沒有上線，replica會轉成leader並有一些文件損失。（我的理解：如果leader掛了，請求還會發送成功？？？？）

　　（5）最後的步驟就是commit了，commit有兩種，一種是softcommit，即在記憶體中生成segment，document是可見的(可查詢到)但是沒有寫入磁碟，斷電後資料丟失。另一種是hardcommit，直接將資料寫入磁碟且資料可見。前一種消耗較少，後一種消耗較大。

每commit一次，就會重新生成一個ulog更新日誌，當伺服器掛掉，記憶體資料丟失，就可以從ulog中恢復

三、查詢

NRT 近實時搜尋

SolrCloud支援近實時搜尋，所謂的近實時搜尋即在較短的時間內使得add的document可見可查，這主要基於softcommit機制(Lucene是沒有softcommit的，只有hardcommit)。

當進行SoftCommit時候，Solr會開啟新的Searcher從而使得新的document可見，同時Solr還會進行預熱快取以及查詢以使得快取的資料也是可見的。所以必須保證預熱快取以及預熱查詢的執行時間必須短於commit的頻率，否則就會由於開啟太多的searcher而造成commit失敗。

最後說說在工作中近實時搜尋的感受吧，近實時搜尋是相對的，對於有些客戶1分鐘就是近實時了，有些3分鐘就是近實時了。而對於Solr來說，softcommit越頻繁實時性更高，而softcommit越頻繁則Solr的負荷越大（commit越頻率越會生成小且多的segment，於是merge出現的更頻繁）。目前我們公司的softcommit頻率是3分鐘，之前設定過1分鐘而使得Solr在Index所佔資源過多大大影響了查詢。所以近實時蠻困擾著我們的，因為客戶會不停的要求你更加實時，目前公司採用加入快取機制來彌補這個實時性。

四、ShardSplit

有以下幾個配置引數：

· path，path是指core0索引最後切分好後存放的路徑，它支援多個，比如cores?action=SPLIT&core=core0&path=path1&path=path2。

· targetCore，就是將core0索引切分好後放入targetCore中(targetCore必須已經建立)，它同樣支援多個，請注意path和targetCore兩個引數必須至少存在一個。

· split.key,根據該key進行切分，預設為unique_id.

· ranges,雜湊區間，預設按切分個數進行均分。

· 由此可見Core的Split api是較底層的藉口，它可以實現將一個core分成任意數量的索引(或者core)

五、負載均衡

查詢的負載均衡還是要自己做的。至於文件放到哪個shard，就是按照id做的，如果是配置route.name=implicit，那麼自己指定去哪個shard

六、故障恢復

1、故障恢復的情況

有幾種情況下回進行recovering :

（1）有下線的replica

當索引更新的時候，不會顧及下線的replica，當上線的時候會有recovery程序對他們進行回覆，如果轉發的replica出於recovering狀態，那麼這個replica會把update放入update transaction日誌。

（2）如果shard（我覺得）只有一個replica

當只有一個Replica的時候，replica會進入recoveing狀態並持續一段時間等待leader重新上線，如果在這段時間內，leader沒有上線，replica會轉成leader並有一些文件損失。（我的理解：如果leader掛了，請求還會發送成功？？？？）

（3）SolrCloud在進行update時候，由於某種原因leader轉發update至replica沒有成功，會迫使replica進行recoverying進行資料同步。

2、Recovery策略

就上面的第三種，講Recovery的策略：

（1）Peer sync，如果中斷的時間較短，recovering node只是丟失少量update請求，那麼它可以從leader的update log中獲取。這個臨界值是100個update請求，如果大於100，就會從leader進行完整的索引快照恢復。

（2）Snapshot replication，如果節點下線太久以至於不能從leader那進行同步，它就會使用solr的基於http進行索引的快照恢復

當你加入新的replica到shard中，它就會進行一個完整的index Snapshot。

3、兩種策略的具體過程

（1）整體的過程

solr 向所有的Replica傳送getVersion的請求，獲取最新的nupdate個version（預設100個），並排序。獲取本分片的100個version，

對比replica和replica的version，看是不是有交集：

a)有交集，就部分更新Peer sync (按document為單位)

b)沒有交集，說明差距太大，那麼就replication (以檔案為最小單位)

（2）replication的具體過程

（a）開始Replication的時候，首先進行一次commitOnLeader操作，即傳送commit命令到leader。它的作用就是將leader的update中的資料刷入到索引檔案中，使得快照snap完整。

（b）各種判斷之後，下載索引資料，進行replication

（c）replication的時候，shard狀態時recoverying，分片可以建索引但是不能查詢，同步的時候，新來的資料會進入ulog，但是這些資料從原始碼看不會進入索引檔案（原因是:

1. 因為一旦有新資料寫入舊索引檔案中，索引檔案傳送變化了，那麼下載好後的資料(索引檔案)就不好替換舊的索引檔案。

2. 在同步過程中，如果isFullCopyNeeded是false，那麼就會close indexwriter，既然關閉了indexwriter就無法寫入新的資料。而如果isFullCopyNeeded是true的話，因為整個index目錄替換，所以是否關閉indexeriter也沒啥意義。

3. 在recovery過程中，當同步replication結束後，會進行replay過程，該過程就是將ulog中的請求重新進行一遍。這樣就可以把之前錯過的都在寫入。

）。

3、容錯的其他方面

容錯

（1）讀

每個搜尋的請求都被一個collection的所有的shards執行，如果有些shard沒有返回結果，那麼查詢是失敗的。這時候根據配置 shards.tolerant 引數，如果是true，那麼部分結果會被返回。

（2）寫

每個節點的組織和內容的改變都會寫入Transaction log，日誌被用來決定節點中的哪些內容應該被包含在replica，當一個新的replica被建立的時候，通過leader和Transaction log去判斷哪些內容應該被包含。同時，這個log也可以用來恢復。TransactionLog由一個儲存了一系列的更新操作的記錄構成，它能增加索引操作的健壯性，因為只要某個節點在索引操作過程中意外中斷了，它可以重做所有未提交的更新操作。

假如一個leader節點宕機了，可能它已經把請求傳送到了一些replica節點但是卻沒有傳送到另一些卻沒有傳送，所以在一個新的leader節點在被選舉出來之前，它會依靠其他replica節點來執行一個同步處理操作。如果這個操作成功了的話，所有節點的資料就都保持一致了，然後leader節點把自己註冊為活動節點，普通的操作就會被處理。如果一個replica節點的資料脫離整體同步太多了的話，系統會請求執行一個全量的基於普通的replication同步恢復。

叢集的overseer會監測各個shard的leader節點，如果leader節點掛了，則會啟動自動的容錯機制，會從同一個shard中的其他replica節點集中重新選舉出一個leader節點，甚至如果overseer節點自己也掛了，同樣會自動在其他節點上啟用新的overseer節點，這樣就確保了叢集的高可用性。

七、選舉的策略

SolrCloud沒有master或slave. Leader被自動選舉，最初按照first-come-first-serve

zk的選舉方式：zk給每個伺服器一個id，新來的機器的id大於之前的。如果leader宕機，所有的應用看著當前最小的編號，然後看

每一個 follower 對 follower 叢集中對應的比自己節點序號小一號的節點（也就是所有序號比自己小的節點中的序號最大的節點）設定一個 watch 。只有當follower 所設定的 watch 被觸發的時候，它才進行 Leader 選舉操作，一般情況下它將成為叢集中的下一個 Leader。很明顯，此 Leader 選舉操作的速度是很快的。因為，每一次 Leader 選舉幾乎只涉及單個 follower 的操作。

（3）如果有新的node啟動，會被分配到replicas最少的shard，如果有tie，被指派到最小的shard ID 那個shard裡面。

（4）我自己試驗過，如果結點關閉在開啟，那麼leader的id會增大而不是用原來的。

原文：

Leader Election

A simple way of doing leader election with ZooKeeper is to use the SEQUENCE|EPHEMERAL flags when creating znodes that represent "proposals" of clients. The idea is to have a znode, say "/election", such that each znode creates a child znode "/election/guid-n_" with both flags SEQUENCE|EPHEMERAL. With the sequence flag, ZooKeeper automatically appends a sequence number that is greater than any one previously appended to a child of "/election". The process that created the znode with the smallest appended sequence number is the leader.

That's not all, though. It is important to watch for failures of the leader, so that a new client arises as the new leader in the case the current leader fails. A trivial solution is to have all application processes watching upon the current smallest znode, and checking if they are the new leader when the smallest znode goes away (note that the smallest znode will go away if the leader fails because the node is ephemeral). But this causes a herd effect: upon of failure of the current leader, all other processes receive a notification, and execute getChildren on "/election" to obtain the current list of children of "/election". If the number of clients is large, it causes a spike on the number of operations that ZooKeeper servers have to process. To avoid the herd effect, it is sufficient to watch for the next znode down on the sequence of znodes. If a client receives a notification that the znode it is watching is gone, then it becomes the new leader in the case that there is no smaller znode. Note that this avoids the herd effect by not having all clients watching the same znode.

Here's the pseudo code:

Let ELECTION be a path of choice of the application. To volunteer to be a leader:

1. Create znode z with path "ELECTION/guid-n_" with both SEQUENCE and EPHEMERAL flags;

2. Let C be the children of "ELECTION", and i be the sequence number of z;

3. Watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;

Upon receiving a notification of znode deletion:

1. Let C be the new set of children of ELECTION;

2. If z is the smallest node in C, then execute leader procedure;

3. Otherwise, watch for changes on "ELECTION/guid-n_j", where j is the largest sequence number such that j < i and n_j is a znode in C;

Notes:

· Note that the znode having no preceding znode on the list of children does not imply that the creator of this znode is aware that it is the current leader. Applications may consider creating a separate znode to acknowledge that the leader has executed the leader procedure.

· See the note for Locks on how to use the guid in the node.

參考文獻（可能不太全）

四篇講原理的部落格

一大坨原始碼分析的部落格

10、 SolrCloud的一個翻譯

solrCloud架構

一、SolrCloud與Solr，lucene關係

二、SolrCloud配置

概念

架構

二、建立索引的過程（索引更新）

ZooKeeper Watches

（5）文件路由

三、查詢

NRT 近實時搜尋

四、ShardSplit

五、負載均衡

六、故障恢復

七、選舉的策略

Leader Election

solrCloud架構

iOS分層架構設計

數據驅動安全架構升級---“花瓶”模型迎來V5.0(一)

《大型網站技術架構：核心原理與案例分析》-- 讀書筆記 (5) ：網購秒殺系統

JEESZ分布式架構平臺介紹

說說大型高並發高負載網站的系統架構(轉載)

高級系統架構師培訓要點：減少資源消耗，靠虛擬代理方案解決了！

免費架構之ADF12C essentials+MYSQL5.5.40+GLASSFISH4.1

SpringMVC流程架構基礎理論

MySQL性能管理及架構設計 --- 理論篇

微服務架構與實踐及雲原生等相關概念

jQuery源碼解析（架構與依賴模塊）

[轉]畢設- 深入HBase架構解析（一）

[轉]畢設- 深入HBase架構解析（二）

系統架構培訓：矩陣，封裝，一個案例教你激發客戶潛藏的需求！

MySQL 高可用集群架構 MHA

Apache Shiro 使用手冊（一）Shiro架構介紹

搭建JEESZ分布式架構3--CentOs下安裝MySQL（環境準備）

MySQL之高可用架構—MHA

PowerBI更新 - 解決方案架構（一圖勝萬字！）

solrCloud架構

一、SolrCloud與Solr，lucene關係

二、SolrCloud配置

概念

架構

二、 建立索引的過程（索引更新）

ZooKeeper Watches

（5）文件路由

三、查詢

NRT 近實時搜尋

四、ShardSplit

五、負載均衡

六、故障恢復

七、選舉的策略

Leader Election

相關推薦

二、建立索引的過程（索引更新）