轉自:http://www.cnblogs.com/sunddenly/articles/4073157.html

Zab協議

 

一、ZooKeeper概述

  ZooKeeper內部有一個in-memory DB,表示為一個樹形結構。每個樹節點稱為Znode(程式碼在DataTree.java和DataNode.java中)。

  客戶端可以連線到zookeeper叢集中的任意一臺。

  對於讀請求,直接返回本地znode資料。寫操作則轉換為一個事務,並轉發到叢集的Leader處理。Zookeeper提交事務保證寫操作(更新)對於zookeeper叢集所有機器都是一致的。

二、Zab協議介紹

  Zookeeper使用了一種稱為Zab(Zookeeper Atomic Broadcast)的協議作為其一致性複製的核心,據其作者說這是一種新發演算法,其特點是充分考慮了Yahoo的具體情況:高吞吐量、低延遲、健壯、簡單,但不過分要求其擴充套件性。
  (1)Zookeeper的實現是由Client、Server構成
     Server端提供了一個一致性複製、儲存服務;
    ② Client端會提供一些具體的語義,比如分散式鎖、選舉演算法、分散式互斥等;
  (2)從儲存內容來說,Server端更多的是儲存一些資料的狀態,而非資料內容本身,因此Zookeeper可以作為一個小檔案系統使用。資料狀態的儲存量相對不大,完全可以全部載入到記憶體中,從而極大地消除了通訊延遲。
  (3)Server可以Crash後重啟,考慮到容錯性,Server必須“記住”之前的資料狀態,因此資料需要持久化,但吞吐量很高時,磁碟的IO便成為系統瓶頸,其解決辦法是使用快取,把隨機寫變為連續寫。☆☆☆
  (4)安全屬性
  考慮到Zookeeper主要操作資料的狀態,為了保證狀態的一致性,Zookeeper提出了兩個安全屬性(Safety Property)
     全序(Total order):如果訊息a在訊息b之前傳送,則所有Server應該看到相同的結果
     因果順序(Causal order):如果訊息a在訊息b之前發生(a導致了b),並被一起傳送,則a始終在b之前被執行。☆☆
  (5)安全保證
  為了保證上述兩個安全屬性,Zookeeper使用了TCP協議和Leader。
     通過使用TCP協議保證了訊息的全序特性(先發先到)
     通過Leader解決了因果順序問題:先到Leader的先執行。
  因為有了Leader,Zookeeper的架構就變為:Master-Slave模式,但在該模式中Master(Leader)會Crash,因此,Zookeeper引入了Leader選舉演算法,以保證系統的健壯性。
  (6)Zookeeper整個工作分兩個階段
     Atomic Broadcast
     Leader選舉
  (7)Zab特性

  ZooKeeper中提交事務的協議並不是Paxos,而是由二階段提交協議改編的ZAB協議。Zab可以滿足以下特性:

    ①可靠提交 Reliable delivery:如果訊息m被一個server遞交了,那麼m也將最終被所有server遞交。
    ②全域性有序 Total order:如果server在遞交b之前遞交了a,那麼所有遞交了a、b的server也會在遞交b之前遞交a。
    ③因果有序 Casual order:對於兩個遞交了的訊息a、b,如果a因果關係優先於(causally precedes)b,那麼a將在b之前遞交。 

  第三條的因果優先指的是同一個傳送者傳送的兩個訊息a先於b傳送,或者上一個leader傳送的訊息a先於當前leader傳送的訊息。

2.1 Zab工作模式

Zab協議中Server有兩個模式:broadcast模式、recovery模式

(1)恢復模式

  Leader在開始broadcast之前,必須有一個同步更新過的follower的quorum。 
  Server在Leader服務期間恢復線上時,將進入recovery模式,與Leader進行同步。

(2)廣播模式

  Broadcast模式使用二階段提交,但是簡化了協議,不需要abort。follower要麼ack,要麼拋棄Leader,因為zookeeper保證了每次只有一個Leader。另外也不需要等待所有Server的ACK,只需要一個quorum應答就可以了。

 aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfcAAAESCAIAAABvngtBAAAgAElEQVR4nOydd3xT19n4FTIgCatpRpNmvBnNaJpmFpKXNKFJyK9NSQgZhJcwYoMHeA+8MHjvPcCTeC9523gPTUuW95Jsg8F7StawtjXu74+buC4rYEvWlfx8P88fcC3rnKv73K+Pzj0DhwAAoA9UVVX98MMPoaEhFy5ccHd3d3V1GR0dWbPSr169SiKR+Hz+mpUIaAqcrisAAMAdkZWV9eOPP165cmV6erqlpeXrr7+urKxEf6RWq5lMZnV19ZUrV5Zez+VyKRRKa2vr1atX2Ww2giB8Pv/atWtqtVqtVk9PT09OTi4uLiIIIhKJmpubSSTS3Nzc0q+PjY0RCIS2tjaZTIYgSEZGxoEDB/B4vEgkWtPTBlYNWB4A9IPc3FwLCws2my2Tya5evWppadnQ0KBSqRYXF+vr6yMiIqKjo2NjY+l0ukqlmpiYCAgIOHfuXHR0tJGRUUREhFqtJpPJ586dk0gkUqk0IyPj4sWLbDaby+Xm5eWFhoaGh4cnJSUNDw8vLi62tbUFBQVFR0d7eHiUlZVJJJL4+PidO3dGRETweDxdfxLA3QGWBwD9ID8/f//+/VlZWWVlZYGBgWfPnh0dHUUQpL+/38rKyt7evqioyNbW1sHBYX5+Pj093cTEpLS0lE6n79mzx8jISKVSFRQU7N+/f2FhQSwW+/v7nzt3bnp6urq62tjYODQ0NCEh4fjx48nJySKRKCkp6d///ndiYmJqampVVZVEIiktLTUxMamvr1cqlbr+JIC7AywPAPpBfn7+999/n5+fX1paGhUVZWtre/nyZQRBysvLjxw5EhAQQKPR4uLiQkJCRkZGTp8+nZGRoVKpEASJjo4+e/asTCYrLy8/fvy4SqVSKBQxMTHBwcFsNjsqKuqnn37KycmpqqqKiIhIS0uTy+UMBsPPzy8uLq6goKCvr0+lUlVXV587d25wcFDXHwNw14DlAUA/SE9PNzU1RbvF5+bm9u3bh/bL02g0c3Pz2NjYa9euVVVVNTY2stlsLy8vLy+vy5cvs9lsBwcHJycnhUJRVlZ24MCB6enp8fFxKysrT09PDoeTkZFx6tSpysrKrq6u4uLi9vZ2hUIxNDTU0NBQWlr6448/uru7C4XC6upqS0tLGo2G/uUA9AiwPADoB1lZWWZmZvPz8wiCcDgcIyOjwsJClUolEokiIyNPnz598eJFHx+fwsJCuVxOJBIPHz4cEBCQkpLy8ccf29jYKJVKBoNx9OjR2NjYzMzML774wsvLi8vlDg0Nubu7e3l5RUdHe3p6dnR0SKXSyspKFxeX1NRUGxubsLAwoVBIpVKNjY3j4uKEQqGuPwng7gDLA4B+wGKxqqurxWIxgiAymayqqmppAMzs7GxxcXFUVFR1dTVqYYVC0dTUlJCQgMfjbW1tvb29ORzO4uIihUKJj48vKSmprKxsbm4WiURqtXpwcDAtLS0pKam1tVWhUKjVaoFAUF5eHhcXl5+fPzk5qVaruVxuSUlJTk4Oh8PR8QcB3CVgeQDQD5RK5eLiolqtRv+7uLi4/L8ymYzP58vl8qXXq1QqoVAoFotDQ0Pd3d2npqbQg+jTV4VCgQodfbFIJFpYWFj+ZFUmkwkEAqlUunRELpeLRCJ4+qp3gOUBwMAhEonl5eUwoWndApYHAANHLBaLRCJ4arpuAcsDAAAYMmB5AAAAQwYsDwAAYMiA5QEAAAwZsDwAAIAhA5YHAAAwZMDyAAAAhgxYHgAAwJABywMAABgyYHkAAABDBiwPAABgyIDlAQAADBmwPAAAgCEDlgcAADBkwPIAAACGDFgeAADAkAHLAwAAGDJgeQAAAEMGLA8ABotarZ6dnR0bGxOLxbquC6AzwPIAYLCoVCo7O7tdu3ZVVlbqui6AzgDLA4DBsri4uHv3bhwOl5aWpuu6ADoDLA8ABotYLP74448feeSRsrIyXdcF0BlgeQAwWDgczq5du1566SUCgaDrugA6AywPAAbL4ODg22+//fbbbzMYDF3XBdAZYHkAMFgaGxtffPHFTz75hMVi6bougM4AywOAwZKSkrJ9+/YjR45wOBxd1wXQGWB5ADBMVCqVtbU1Dofz9fXVdV0AXQKWBwDDZG5ubseOHVu3bi0tLdV1XQBdApYHAMOkrq7u6aef/vjjj6FTfp0DlgcAA0Qul9vZ2d13330uLi4ikUjX1QF0CVgeAAwQJpP58ssvP/roo3V1dbquC6BjwPIAYGhIpdJz587df//9R48enZ2d1XV1AB0DlgcAQ4NGo73wwgtPPvlkdXW1SqXSdXUAHQOWBwCDYmJiYv/+/Tgczs7ObmFhQdfVAXQPWB4ADAeJROLp6YnD4Xbs2NHb26vr6gCYACwPAAaCVCpNSEh49NFHn3/++bKyMuirAVDA8gBgCAgEgvj4+Keeeuqxxx6LiIhYXFzUdY0ArACWBwC9h81m+/v7P/roo4899lhgYCDs/wcsBywPAHqMQqHo6ek5efLkpk2bnn322YiICIlEoutKAdgCLA8A+gqfz8/NzX377bc3bdr0xhtvZGVlQUcNcCNgeQDQP+RyOYFAMDIy+sMf/vDAAw8cPHiQRqOB4oGbApYHAH1CIpHU1NRYWFi88MILGzZseP3116Ojo0dHR3VdLwC7gOUBQD8YGhq6ePHioUOHXnjhBRwO9/LLL7u4uNDpdKlUquuqAZgGLA8AmGZ4eLi4uNjW1nbXrl3bt2/H4XDvvfees7MzmUwWCAS6rh2gB4DlAQBzCASCjo6OvLw8W1vbzz///KmnnsLhcE899dTXX38dEhLCYDBkMpmu6wjoDWB5AMAEfD6/v7+/rq4uPDzc2Nj4gw8+2Lp1Kyr3f/zjH46OjqWlpRMTE7quJqB/gOUBQGdwOJzBwcH6+vrY2FhLS8s9e/Y888wzGzZs2Lp168svv/zpp5+6uroWFRVdu3ZN1zUF9BiwPACsHSqVan5+fmhoqLGxMT4+3tra+vPPP3/iiSfuueeebdu2/c///M/f/vY3Y2Pj+Ph4IpE4Pz+v6/oChgBYHgC0jlgsnpycZDAYqampdnZ2n3zyyaOPPvrAAw/87ne/e/rpp998800jI6OYmJi6urqJiQmlUqnr+gIGBVgeALSFRCKZnJysr68PDg7+8ssvn3zyyc2bN//ud797/PHH33rrLVNT0+jo6MbGxvHxcZlMBktIAloCLA8AGkalUvH5fAKB4OPj89FHHz3++OPbt29/+OGHX3nlFWNj46ioKAKBMDIyIhQK5XK5risLGD5geQDQJOPj46mpqd99990zzzyzZcuWLVu2fPjhhy4uLnl5eUwmk8PhwCQmYI0BywOAZpieno6Jidm9e/fjjz+Ow+HeeOMNJyen0tLSy5cv83g8XdcOWL+A5QFgtchksuLi4v3792/btm3Dhg1ffPFFeHh4W1sbyB3AAmB5AFgVw8PDrq6uzz33HA6H27NnT3x8/ODgoK4rBQD/ASwPACunq6vr8OHDOBzuhRde8PDw6Ovr03WNAOB6wPIAsEK6urr27t2Lw+F27doFu2kDmAUsDwAr4dq1a/v370d74VtbW3VdHQC4JWB5ALhreDyepaXlfffd99lnn3V1dem6OgBwO8DyAHB3qFQqPB6/efPmF198kUgk6ro6APAbgOUB4O4YGxv7/PPPN27cGBUVBeu8A9gHLA8Ad0dRUdHWrVs/+ugjWBAY0AvA8gBwF3C5XBsbm82bN4eEhMAqNIBeAJYHgJujVCqvXbvG5/OXH2SxWO++++5jjz1Gp9OXDk5PT4+Pj695BQHgjgDLA8DNUalUNTU1jo6O6enpYrEYPVhXV7d58+Zdu3ahW3xwOJzk5GQ3NzcKhaLTygLALQHLA8AtGR4e/uKLL55++mkzM7PKysqFhYW8vLwHH3zQ1tZ2ZGSkuLj48OHDjz322PHjx6EtD2AWsDxGQbsLGAwGlUptbW2FzeF0xc8//7xt2zYcDvfmm2/a2tr+9NNP27Zt++KLL44ePfrqq6+iu2+XlZXpupraQiKRXL58mU6nk8nk7u5ukUik6xoBdw1YHqPweDwXFxcHB4ewsDAXF5fz58+Pjo7qulLrkdnZ2UOHDuF+ZfPmzRs3brz33nuXjjg5ORnw3+Cenp6zZ886ODiEhoa6ubklJSVxuVxdVwq4O8DyGGViYuLTTz8NCwvr7OzE4/HGxsbR0dEIgsjl8oWFBTabLRAI1Go1giAzMzNDQ0NoI0utVstkMoFAMD09PTIysri4uPSGUql0aGhoampqeSnT09ODg4PLF8iVSCRDQ0NjY2PomwMIgmRkZPzxj3/E3YwXXnihtrZW1xXUIpWVlUeOHImMjOzu7o6JiTl8+HBpaalSqZRIJEKhcHp6Gk08lUo1MTExOjqKjjtSq9VisVggEExMTExOTi7fyVYikVy9evXGPLxy5cryPOTz+VevXp2ZmVmrEzVkwPIYZXJy8siRI01NTQiCyGSyc+fOOTs7IwjS0dERHh6elJTU2dm5uLjY1dUVGxvr5eVVUFDA4/FUKlVbW1t4eHhiYmJwcHB5efnCwgKCIFwut7S01MfHJyoqqqWlRa1Wq1QqBoORlJQUHByckZGBflHgcrmVlZXR0dFxcXEMBgOm/KCg1+Kmlj99+vTc3JyuK6hFqqqq3Nzc0DxkMpmnT5+OjY0VCoV1dXXnz5+/cOHC8PDw4uIilUoNDg4OCgoqLy+XSqVyuby6ujo2NjYmJiY0NJREIqE7ZLHZ7Pz8fH9//8jIyJ6eHoVCoVAo6HR6QkJCcHBwZmbm5OSkWq2enZ3F4/ExMTFJSUksFgtGrK4SsDxGmZqaOnDgwMWLF8fGxioqKuzs7HJyctRqdUpKyhtvvBEWFnb16tWJiQlbW1tPT8/U1FQLC4uff/5ZrVanp6e/8847SUlJKSkphw4dqqmpUalUWVlZJiYmKSkpgYGBpqams7OzYrHYyMjI1dU1NzfXz8+voaFBoVAUFRWZmprm5ubGxsaeOHECVmhZIjMz88knn7xO8c8//3xjY6Ouq6Zdamtrrays0tLSJicnMzIyrKysqFSqQCDw8PD46KOPIiMj+Xx+R0eHjY2Nj49PdHT08ePHCQSCQCDw8vL65z//mZSUFBAQYGFhQafTVSpVbm7usWPHMjMz/fz8bGxsRkdHeTyemZmZo6NjZmZmQEAAmUwWCASJiYm2trYZGRlnz551d3cfGBjQ9ceg34DlMcrMzMy+fftsbGxSU1OjoqIqKyslEolcLk9LSzt69CibzVapVBUVFehdp1AoIiMj7e3tpVIpHo+3sLDgcrkKhcLR0TE+Pp7L5fr6+np7eysUiu7ubjMzs4aGhoWFBQ8PDycnp9zcXDqdLhQKZ2ZmPDw8/u///q+oqCgtLe3IkSMGr7A7Z2xs7OjRo8sVv2HDBgcHB8NuyCMIQiAQjh49am9vn56eHh8f39jYqFQq+Xy+p6enu7v7yMiIWq2Oi4tzcXEZHx+fnZ0NCQkJDg6enZ0NCAgICwubnZ3l8Xh2dnYXL17kcrl+fn5BQUEKhWJ8fPy7775ramoSi8VeXl62trYFBQUMBkMkEg0PDxsbGzs7O+fk5MTExNjY2CyfmgCsALA8RpmcnDx48GBZWdn8/Pz8/Dz6pVUikWRnZzs6OqJ9KaWlpdbW1gwGA0GQmJgYW1tbiUSCx+MdHBzQ8d3oY1sOh+Pj4+Pn54cgCIvFOnnyZEVFhUwm6+vrq62tTUxMtLGxqa+v53K5Xl5eNjY2g4ODPT09eXl5MIN/CbVanZOT88QTTyxZ/tlnnyWRSAb/9KK6utre3r68vHx+fp7P5ysUCgRB0HZDdHQ02pMeHR3t5uY2NzfH4/EiIiL8/f1nZ2cDAwPPnz/P4/FkMpm9vX1iYuL8/LyPj09YWBiCIHNzc99++y2RSFSpVCwWq7q6Ojw83MnJiUKhDA8Pm5qaRkdH9/T0NDc3l5WVjY2N6fhT0HPA8hhlbGzsq6++amhoWH5QLBanpqaePHlSIBAgCDI8PHzy5Ek/P7/c3NxTp04lJCSoVKr09PQPP/wwLS0tJyfnhx9+KC8vVygUqampJiYmubm5YWFhxsbGo6OjMpkM7d/PzMz84YcfkpOTlUolHo83MjIqKSlJT0/39/cfGRnR0dljkdHR0aXe+fvvv9/Kymo9jDYpLS21srIikUjLD87Pz589exZtsyMIQiKRTp06FRISkpiYaGxsXFNTIxAIfHx89u3bl5aWFh4ebmpqSiKRFApFenr6sWPHCgsLQ0NDzc3Nr169KhKJYmJi4uLizp8/f+TIkezsbIFAcOHCBQsLi8LCQjRFp6endXT2BgJYHqPMz8+jAxuWH5TJZGQyOTU1FW2qq1QqCoUSFhZ29uzZjIyM2dlZmUyWmZn5/fffh4aGBgQE5ObmooP8pqens7Ozz507FxoaijagVCoV2oAKDQ3NzMxEm+1zc3O5ubkBAQHh4eEEAmFpwieAIIhKpcrPz//973+Pw+Gee+45Mpls8A15BEE6OjrS09Ov2+lQJBIVFhZeunQJbW2IxeKamhofHx9fX9/c3FyhUCgUCv39/U1MTHx9fX19fauqqtChOJOTk6mpqd7e3oGBgQwGY3FxcXFxsbKyMiwsLCQkJCsra2JiQq1Wj4+Po+MCIiMj29vb0Se3wIoBy2MUhUIxNzcnkUiWH1SpVEKhcH5+fvnmc6Ojoz09Pej9xufzL168aG9v393dPTAwsPz2EAqFPT09w8PDy4sYGhrq6upavlSLSCTq7e29cuUK7G93I2hv8r333mtqaop+4AaPWCyen5+/zrMqlYrH4/H5/KUhkgqF4urVqwMDA2jG8ng8Nze3yMjI1tbWq1evLh/Ru7Cw0NvbuzwP5XL50NBQb2/v8jycn59nMpnQV6MRwPIGhUgkqq+vz8rKWj5CGdAgeHzBrl0fEgiwecjtEAqFWVlZ9fX1kIdYQPOWl8kkPB6bz+esh1hY4AoE8wIBVy7HxNBydPwDh8OBlrhEIuTx5vh8tkCgycs9OMhMT/95fHxYIJjXefothVDIEwjmeTyOTIaJzg2VSsXhcHg8HlheLEbzUJfpoWHLi0SCwcGOri5Sb29Tby/NsIPJbO7vb2lsLKipyZuYgAeVGGJhgdfb29zRQeztberr09gV7+uj9/c3X77cwmTSNfi2qwwWq4XFYlRXZ1+6lDk1BV0cGILHYzOZjK4ukm6zRZOWF4sFLBaDSi1ra6ttb6837OjsbOzpIREI+aGh1pGRToODPRr8JIHVsLDA6+ykEolFLS3VnZ2Nmr3uHR0N3d2Ejo4GnWcgGl1dhJ4eUlVVhpfX8fDw0yMjg7r++IFf4PE4nZ1k1Ie6TRiNWX5J8Z2djb29lO5ukgFHXx+FxWqiUkvi4z3c3A7FxZ29dq1fU58ksBoEAl5nJ4VILO7sbFgHeUhlsZoaGvIiIk6fPv19crLvxARMccAEfD67o4PU1HSpq4ug8zzUjOUlkgVU8R0djT09ZJ1nv1ajt5fCYjVRKKXx8ed8fU08PI4mJ3tfu8bSyCcJrAahkNfVRSWRijs6DF/xvb1UJrOpoQEfGenk6Wns6XksJSVgfBwsr3tQxdNoFVhQfLdGLC+RLLBYzVRq+XpQPNp6olBK4uPP+fqaBgaa+/oaJyWB5XWPSCRgMtGmRoPB52FvL4XJbGpsxEdFOXl7Hw8KOunjY/Tzz/5geZ3D4/2nFY+RPFyt5aVSIZPJaGoqXw+31rKOml8UHxpq5eNjBJbXOUKhgMVi0GiXOjsNPw+XFB8d7eztfSIo6GRoqKW3909geZ2zvKMGO3m4KsvL5ZL+/jYa7VJ7e31vL1ZOSWu3FrW/n/ZrK94kMNA8NNQyJMQSLK9zhEI+k9lCp1/q6Ggw+Dzs6/ulowZV/FIeguV1Do83hyq+sxMTHTVLsXLLy+WSgYF2Gq2ivb1uHdxalP5++nLFh4RYogGW1y1CIZ/FYtDplzo66nt6MHRraScPqUwmraEhLyrKGe2oQRUPltc5PN5cZycZVTx2WvForNDyMplkcLCdRqtoa0MVj62z0vitNTDwi+J9fE4EBZkv3Vpged0iEv2i+Pb2eky1nrSTh00s1nWKt1rKQ7C8DuHx2J2dWGzFo7ESy8tk4oGBDhrtUnt7XU8PBWt/uDQbvb2UgYFmCqU4Lu6cl9fxwMD/urXA8jpEJBKwWC10esWvijfgPCT29VGXKd44OPjk8qYGWF6HoIqnUssxO4j8ri0vkYiYzBYKpby9vaGnh2zYiu/rowwM0CmU4gsX3D09jQICrlc8WF5XCIUCJpOxpHgDzsOuLiLaiq+vz4uKcvLyMg4OPnWd4sHyugL7iu++W8tLJMK+PkZjY6HB31rdv7TiaRRK8fnz7h4eRgEBZjcqXlOWVyqVsPLMnYMOmqTTKzo6Ggz+2+TyvnhPT+OQkJso3rAtj9l9X3m8OSx31CzFXVj+V8UXrAfF9/WhI2rQVrxxQID5TRW/estzudympqa+vj5Y1+kOEYn4TCaDRqv4dfCuIedhby/1v1vxFjdVvGFb/urVqzQaDWurEPP5HFTxGJn6dJu4U8tLpSJU8R0d6+Ex1y9Tn86f/w3Fr8byQqGwvb39zJkz33333aVLl+7219cnIpGAyWxZpnjdZ4v2oreXstSK9/I6ftOOmvVgeRaLZWZmdvDgwbq6OnRXHJ0jEvG6uqhNTZc6OrCu+O47tLxUKu7ra25sLGhvr8P+Ka0y+vooLBYNVbyX1/FbddSsxvISiaSrq8vX1/fNN9984IEHrKysOBzOSvNtHYH2xdNol9aN4psaGvC/jqi5neIN2/JKpTImJubBBx984oknLC0t0b3pdbhRl0jE6+lpamq61NnZqBeDd3/b8lKpuLe3mUAobGszfMX39lJYLBqZvKT427XiV2B5uVze3d3t6en55ptvPvzwwzgc7pVXXqmpqVl14hk+aEcNOvVpHSgenfp0k3Hx69DyCIIMDQ3t378fh8Nt3Ljx6aefPnnyZF1dHbrL4BojEvG6u5uamiowOC7+VvEblpfLJb29zQRCUVtb7Troi6ewWDQqtSQu7uwdKv7OLa9UKjs7O93d3d9+++2tW7fifuXMmTOwvepvIhKhCxisl46avr6mxsa8qCinpQUMfjMPDdvyCILk5OQ8/vjj6F2zadOmZ5991tTUtKamZi13hRUK0Va8Pim++/aWl8mkly93UqllLS2Gr/je3v/MbvXxOXHjuPjVWL69vd3Nze29997btm0bbhmvv/46mUzWdCrqBu19g16+gIFhJ2H3fxYwQBV/fGkBgxVb3mC2IB8eHv7hhx+W3z4bN27805/+dOLEiaqqqjVwPZ8/39lJamoq1y/Fd9/G8jKZ5PLlThqtCu2o0a+zuttAFzBAlyHz9j5xh63437Q8n8+vr693dHTcuXPnli1bcDfw7LPPfvvtt8ePHz+izxw4cMDPz29oaEgbtxaqeBqtYt089qf9qvg7bcXf3vJKpTIpKenAgQOHDx/WdaasnJ9++unw4cNvvfXWjTfRvffe+/rrrxsZGeXm5o6Pj2sjCREE4fM5bW2NBEIhlsfF3ypubnmZTHL5chedXrkeFN/b+8sCBgkJ57y9bzK7dcWWv3LliqWl5Y15ucT9999/m5/qEe+++y6NRtP4rbU0aHIdLGBAZDKblikefdx6F3l4K8svLi4ePHhQ1wmiGTZs2HDPPffc6qf//Oc/iUSt7LqOKn5pELmuU+Wu4yaWl8nES4o3+Nmt6AIG6K5Pd94Xf4eWZ7PZpaWlNjY277///r333ntjXr7zzjsODg6+vr7n9BkXF5eLFy9qvBm11Bdv8AsYLM1uRRX/m4Mm77YtX1xc7OLicvbsWV1nysrx8vI6c+bMnj17bmr5V1991cjIKCUl5erVq5pNQgRBBIL51lZU8Q19fU06z5YVxPWWl8nEV650obeWwSt+eUeNp+dKFH8n/fIKhYJMJjs5Of3tb3/btGnT8uzcsWNHTw9sGHsTfp3desngp+B1daGtePrtFzBYseUNhpmZmSNHjlzn99dee+3EiRNFRUV8Pl8bhQoE3La2RgKhoL29gcnUS8V3X2f5XxV/qb29vrubZMCtp+5fHrfSlrXi766j5s4tj6JSqahU6unTp998883Nmzcvpamfn99aDhLQC9CpT78+bjVkxXcv64uPjna+/ezW9Wx5tVqdnp7+2GOPoXfNfffd9/LLLxsZGRUXFy8sLGip0IUFbltbY2NjYUeHvrbi0fiP5VHFNzX90orXec20GujsVrQV7+V1d49bV2Z5FJVKRaPRnJ2d//znP6PPY9944w0CgaClNNVH/lvxBr+AAaW/n97QgEdb8bdaowYsPzQ0tHfvXhwO98ADD7z44os//fRTWVmZUCjUXolCIQ993NrR0dDXR9V5qqwmfrE82he/1FGj82ppNdDtuanU0hWMqFml5VGUSiWDwXB1dX3llVcefPBBe3t7rearHoF21NBoFetkY7/+/iYyuTg21m3FHTXrwfJyuTwiImL79u1//OMfjYyMKisrtdd+RxGJBF1dFCKx2AAU341aHh1Rs04Uv7wV7+NjskrFr8zyKDKZrKWlxcbG5sCBA/X19VrIVT3jV8WjE8cNPA/RpgaFUnThwhkfn7sbNLneLD8wMHDs2LG9e/fW1NQIBAJtzwAQifg9PU1oHurjiJobA6dWK39VfN06ubWo1NK4OM0ofjWWRxEIBDQaraura52vSSmRCAcG2tbPGjUsVhOFUnz+/Jkbdx8Dy1/HwMAAiUQaGxtbgxleqOL1cerTbQI3MTFIp1ca/EiG7v/cWujUJ80ofvWWRxBErVbL5XKDmaO4AqRS0eXLXXR6VWdnw/rIQxqZXIQq/s5nt65by8tksrXZfUEo5Pf20mm0is5OQl8f1WDyEHf5MqOrizAw0Ah6MgwAACAASURBVDwwQO/vp/X30w0yBgebr1xpodMvJSR4+Pisti9es5Zf58hkkuHhvvb2ht5eyuCgwechY3CQQaGUoh01mlK8YVt+bRCJ+IOD7a2ttUwm1cB8iCsqSsLj4/PzEww7CgoSCwsTk5P9/PxMAwNRxYPlMcGVK725uXHp6REFBUk6zxPt52FSfn58fLynr6+JBhUPll89AwMdWVnRmZnRhYWGloe4tLTgjIywtLQQQ4+w1NSQ2Fg3f39N3lpg+dVDpdZGR59NTPRJTzf8PExPD0tJCQoPdwgIMA8JsQDLY4fm5rqQkNOJiX5paaE6zxPNBq62No1KLSAQcgw7SCR8Y2NudnZoYOCp4GALsDx2IJNrUlODq6tTiUS8zvNE20Em4xsasi5e9A4MPBkUdEqDeQiWXyUtLfUXLpytqPiZQMjTeZ5oNnBEYi6DUUalFhp20GjFVGphXl4EWB5rUCi1WVkRjY3ZdHqJzvNE20Gnl5DJ+NRUX7A81mhtbUxK8qmvz0RdYUiBQ+8uMjnfsINKLSST83Nzw8HyWINCqc3ICK+vz2xqKtJ5nmg7mpqKiMTclBTfgACwPLZobW1MTPSurU1HXWFIAZYHy+sYsDxYHguA5fU+wPKYBSwPlscCYHm9D7A8ZgHLg+WxAFhe7wMsj1nA8mB5LACW1/sAy2MWsDxYHgsYsuXXz0hKCqUwNxdGUmKOpZGUhjeC7caAkZSYBR1JWVeX2dRUpPM80WzgKBR8W9slGq3YsKO5uYxGK8bjI8HyWINCqcvOjiIS89BrZNjR0lJOpRampflBWx5rtLY2Jif7NjTkNDeX6jxPNBu40NDTYWFOISGnDT5CQ51DQx2CgixCQsDyGKKpqd7Pz9bb2yIszFnnSaL9JHQKDnYMCLAKCtJkEoLlV097O9HPz9rb2zIkxNB8iDtwYPe//71j377/NezYu3fnt9/utrM7FBJiBevYYAoard7E5OtPP33j66936TxPtBdff73r66//d+/e97/+epeV1XehoVZhYdZgeezQ20t3dDz2+edvf/XVBzrPFs0G7vz5wIgIr8hIH8OO4GD3iAjvCxf8QkKsoMcGUzQ11QUGugYEuEVH++o8T7QXUVG+ERHeISEeoaHnQkKcgoM1/J0SLL8a1GrV4GD7zz+HhYR4REZ66zxbNBu40dH+mZnh8fHLBh1XhodZQ0O9DQ2FQUEWQUEnwfLYgUyuqajIHRlhTU1d1XWeaDdGRwdGRwcvX+4uKUkOCDAPDDTX1PLXYPlVolIpWawWIrF0ZKR/YuKKzlNFs4GjUArb2qro9DKDjtK2tioGowKPj4anr1iDQqnNzo4mk/NbWip0nSfaTUI0D2m0kvT0ABhjgzVaWwk//xxAJOYxGJd0nSoaDlxDQyaNVkQi5Rl2UKkFZHJ+bm4YWB5rUCh1GRnhdXXpVGqBzvNkDfKQSMxJSfGBMTZYo6WlMT7es6YmlUIxtDyEWVGYtvzi4iKPx5NKpbd5jVwuFwqFcrlcGxVYA9bZrKhCfZwVJZFI+Hy+QqG4zWtkMplIJFqbDVq1gSHPigLLY9nyFRUV+/bty8zMvM1rmpubo6OjGQyGNiqwBqwzy+vl3NfY2NjvvvuOQCDc5jUNDQ3h4eHT09PaqMAaAJbX+9BHy6tUKnd392eeecbV1ZXD4Sz/0eLiokwmQ/9dUVFhbW1dVVWF/srScX0BLI9xy09NTVlYWDzzzDMhISHX/Ugmky19iczJyTl58uTo6CiCIHK5XKlUarwmWgUsr/ehj5YfHBx0cHCwsLBwdXVtbGxED8rlcgKB4OnpeebMmcLCQqFQ2NDQ4OrqSqfTmUxmQkLCyMiIxmuiVcDyGLd8eXn52bNnTUxM3NzcUIkjCKJSqWpqanx9fZ2dnUtKShYXF4uLi11cXKanp2k0WkpKytzcnMZrolXA8nofemd5pVKZlpYWGho6ODjo6+sbHh7+ay62njx5Mi4uLi0tzcPD4+rVq3Q63d3d3dfXNywsLCsri8vlarYm2gYsj3HLe3l5paSkdHV1nTt3Do/HIwiiUqlaWlosLS1jY2MTExODg4OvXbtWX19vbW2dkJBw7ty56urqhYUFjddEq4Dl9T70zvISieTEiRN2dnZUKvXo0aMmJiYikUgul6ekpJiZmYnFYgRBmEwmj8djMBjGxsZ/+ctfzM3Nr13TvyEWYHksW356evrgwYPot8kff/zRzs4OQRCxWBwZGXn69OnZ2VkEQS5fvsxms2tra7/44ovdu3cbGxsvLi5qthprAFhe70O/LK9Sqfr6+vbt2+fu7n7p0iU7O7tDhw5RKBSZTJacnGxhYYF2vg8ODs7Pz1OpVCMjI3Nzcx8fn8TERL3rDwXLY9nyRUVFhw4d8vT0LC8vP3ny5IEDB2ZmZhYWFoKDg93c3NhsNoIgQ0NDMzMzJSUl33//vbu7u5OTU21trd4NtgHL633ol+XFYnFMTIyjoyOTyVSr1UNDQ3Z2dj4+PgiCUCgUIyMjPB5fWVl55swZJpNJIBCcnJxaWlrIZLKxsXFTU5N+3WBgecxaXiaTOTg4BAYGTkxMIAhCpVLNzMwyMzPlcjmRSET/XVJSEhAQ0N3dXVJS4ujoOD09XVJSYmpqOjAwoF95CJbX+9AvywuFwoSEhOrqaolEgh7Jy8uLjo5WqVQSiaS4uNjBwcHOzi49PV0gEPT09CQmJvb29vL5/MzMzMzMzNuPa8YaYHnMWp7L5YaGhhKJRPS/8/Pz2dnZKSkparVaoVDk5OQ4OztbWlrm5OTI5fLm5ua4uDihUMhms8PCwkgkkn7124Dl9T70y/JqtVoikajV6qUjCoVCIpGgvTFqtZrH43G5XPS/CoVCJpOhZlcqlSKRaPkvYh+wPGYtvzzrlo6gz4RQ0DxEcw/NQ7T9LpfLxWIxtOUxEmB5LFp+XQGWx6zl1xVgeb0PsDxmAcuD5bEAWF7vAyyPWcDyYHksAJbX+wDLYxawPFgeC4Dl9T7A8pgFLA+WxwLrxPJ4IjGXSMwhEnO1EQRCDpGYSybjl8omkfKIxBytlkgi5aFlgeUxyw2Wx6Opos3IW3YP4JeSU3tBIuHB8hjnOstrVYZoyi2XIYVS8KuBNR+/WJ5EwpNIeAajoq+PyGKRNB79/eTeXgKDUb6U7iQSnkot6uqqY7HI2iiRxSJ1dtZSqYVoiWB5zPLflsdTqYXt7dVMJqm/Xyt52NPTSKUWL+UhmZxHp5f19DRoozg0Wlur0IQHy2OZ6yzf1FTS3V2vDTv195OYTFJHRw2FUrDch83N5UymVvT7i+UbG3OamkqbmqoKC9MyM+OzshKyszUYiampMXl5qWRyZXNzGXpKBEJuS0t1bW1hTk5SZmacRotLyMyMT0+Pq6oqbG6uaGoqJJHwYHnMsmR5KrWQSi1oaakkkyuzsxM0nhXZ2fEZGXGFhRktLXUUyi/apdOLW1rqLl3KTUuL0XRxCVlZ8VlZiURiRUtLJZGYC5bHMkuWp1AKqNSCjo6Gyso8VCaazYqMjAvZ2QkEwqXm5goqtQC1U0NDTmFhclZWgqbdm5CdnYBrbMxubi5pbMxua6suKEj5/POPXn75+VdeefHVVzUWr7320nPPPfnJJ//Izk7p6an/ta8mr7W1zsXF+q9/feXll/9Ho8W9+PLLz7/00v+cPm1Po1W2tVUQiXlgecyyZHkKpaC5ubSzsy4uLvyNN1596aXnNJgVr7764quvvvDCC89+/fU+BqOeTi9Gv792dFQTieVHj/7w3HNPabq4F1955flXXvnThQsRfX0ksDzGWWb5/Obmks5OwsmTR//855defvl5zWbFiy8+99e/vhoc7NveXstglBKJeXR6SWNjrrv7qXff/bNm3YvGfyzf0VGTnZ3w+ut/wmmH119/PTMzsbe3nkLJJ5HySCR8W1udmdmxBx/cqJ0C7zE3N6HTK9rbwfKY5kbLh4X5bNq0STtZgfvf//3f5uba5uZfeik7O2sIhJLPP/9ES8Xde+/9wcE+LBYZfUQElscsyy3PYJS2tzd8//2X9923QRtZ8dBDD7m5ObW317S0lKGWJ5HwTk7HH310izaK+y/L5+Ymvfnma1opBod74403srKSl9ryJFJeW1udhYXx5s0PaaO4e+7ZYGFhTqdXguUxznWW7+qqj4jwe+ihB7WRFTgc7h//+Li5uXbpWVRnZzWJVPrPf36qpeIeemhzeHgAk0kCy2Oc/27Ll3Z0NPzf/+1/4IH7tZEVDz/88NmzrtdZ3sXF7Mknf6+N4q63/Ftv/VkrxeBwf/3rX2+0vKXl8S1bHtZGcRs23AuW1wtuavmHH9bK334cDvfJJ7tvtPy//vWZlop7+OEtYHm94EbLHzr0zcaND2gjKzZv3nxTyz/11KPaKA4sD5bXMWB5sDwWAMtrALC83rE2K8eC5bFjeZVKpV8LmmoQsLwGAMvrHUNDQ3g8vqenR6ulgOWxYPmpqamKior29nbNXlw9AiyvAcDyegeHw7Gysvr000/DwsKuXr2qpVLW2PKffvqPtbT85s1bMW55LpdbUFCwd+/eQ4cODQ4OauMS6wXrxfLZ2fF//vNLWikGh3vttdcyMhJ7ext+HUmZ19ZWZ25+dNMmrTzFxuFwpqYn6PTKjo5KsPyKwePxjz322LZt2/bv33/hwgVtWOBGy4eF+Wjp7sLhcB988P51IymJxJI9e/6hpeI2bLgvKMgbmyMp5+bmioqKTE1NX3nllY0bN549e3bddtcgN4yk7Oho+P77vffco5Ws2Lhxk6vr6essb2//0/bt2mncoJYnELI7OmqzspLeeusvW7Zs3rp1y1aN8uCDm3bs2JGT8/PSeHkyGd/WVm9ubvzoo7/fsmWzZotD39Da2qq5uQosvxrEYvGJEyfQVHnkkUe++eabhISE/v5+DRZxneU7OurCwvwee+zRzZsf1nRWbHn44c2fffZZS0tDc3MxOmmjs7OmoaF4795/bdq0SbPFbd26dcuWzb/73e8jI4OwZvmpqanS0lJzc/OXXvqlVffll18ODw9r8LLqHf9t+ZK2tsaDB/dv375d43Z6+OGHH3/8CU/Ps+3ttajlabRiEinf1/f0X/7y6kMPPaxx/f6ywgGBkNPSUllTU+Lp6e7gYHX6tI1mw8rKLCDAu7GxvLX1ErrCAYmEb2urT04+f/q0rcZLdHCwcnCwyspKaWmpRv9OguVXDIlEeu+995aaBVu3bv3mm2+Sk5M1tX3zcsvTaEUtLdVFRdmnT9vY21tqNiscHa3t7KxiYkLb2xup1ALU8i0t5VRqVUREkKWlqcbT3sHBytX1dGVlfkdHLUbmvk5OTl66dMnc3Pz5559fuqZPPvlkamrq6i+lXrPc8k1NRe3tjbGxYY6ONhq3k52dhbOzXWFhdmtrNY1WRCLhKZQCMjm/oiInKirU1vaUxvNwabWyPAqlsL29jslsYjKpWoqOjtqlpXmIxDwarbynh8Ri0bRUXG8vmU4v1eFqZRKJZH5+ns1mc/SZmZkZc3PzDRv+axLg5s2b9+3bl5KScvnyZZlMtpq7a/k6NmQynkYr6e4mMJlUFkvzKcFi0ZhMSnNzBdrOQNekbG2t7uujaCkJBwZoXV31FEoRuhCmriyvUqmmpqaqqqrMzc2feeaZ677Q79+/v62tbX5+Xte5tirm5uYWFhZW3Om0fB0bMjmfwajo7aWwWJr3IfqePT3EpqZidFlKEglPIuW1tVX399O0kfb/WXkYLYlMxlMo+doItC9++WKbRGIemZynpeKWlaiblYfVanVFRYWLi4uVlZWtPuPs7Lx79+777rvvxu6+Bx98cM+ePfHx8deurXwA340rD5NIWswKMjnvupWH0TzRWono++ty5WEul1tSUnLs2LE//OEPN+223blzp7W1tZ2dna5zbVVYWlrGx8dLpdKV5eENKw9r3U7LZbhkYG2Udd0uInjtBYmEv+6s1qTEfF1ZXqVSeXh4PPXUU5rtYlt7tm/f/tBDt3wotGnTpnfeeScnJ2fFvTc320VEi1lxsyTM/zU5tV6iTizf3Ny8d+/erVu33uoiPvjgg9u3b9+2bZuuc21VbNmy5auvvlpYWFhZHt5sF5E1spO28xD2itJiW55IJIaEhHh5efnqLX5+fsHBwZ999tk9N4w2uO+++1599VV7e3sSibTiWwuBvaK0b3mpVNrV1RUcHLxz584HHrjJ4KWdO3e6u7v7+/vrOt1WhaenZ05Ozor7D9fJXlGGHDrpl19cXBSLxUKhUKTPTExMHD58+DovvPbaa66urnQ6ncfjreymWgIsr23Lo4jFYhaLFR4evmvXruuu5ieffNLU1CQWi3Wda6tCKBRKpdLV98uD5fU1YIzNisnPz3/xxReXjPCXv/zlzJkzTU1N8/PzGnl/sPzaWB5FJBL19vaGhYXt3r176Zpu3bo1KChII1dTfwHL632A5VfG7OzsN998g7rgrbfecnd3J5PJXC5Xg0WA5dfS8igSiaS7uzs0NHSpL+6DDz5Yz8sbIGB5Awiw/MqIiIh47LHH3njjDTc3t8bGRj6fr/EiwPJrb3mUxcXFjo6OoKCgPXv2bNy40cTERCKRaPz66gtgeb0PsPwKmJmZcXBwOHXqVE1NjVgs1lIpYHldWR5FrVa3t7c7ODiYmZl1dXVp4xLrBWB5vQ+w/AqYnJxsbm7mcDhaLQUsr1vLoywuLvb09AwODq7bpWzA8nofYHnMApbHguWXAMuD5fU1wPKYBSyPKcuvW8Dyeh9gecwClgfLYwGwvN4HWB6zgOXB8lgALK/3AZbHLGB5sDwWAMvrfYDlMQtYHiyPBcDyeh9gecwClgfLYwGwvN4HWB6zgOXB8lgALK/3AZbHLGB5sDwWAMvrfYDlMQtYHiyPBcDyeh9gecwClgfLYwGwvN4HWB6zgOXB8lgALK/3AZbHLGB5sDwWAMvrfYDlMQtYHiyPBcDyeh9gecwClgfLYwGwvN4HWB6zgOXB8lgALK/3AZbHLGB5sDwWAMvrfYDlMQtYHiyPBcDyeh9gecwClgfLYwGwvN4HWB6zgOXB8lgALK/3oaeW53K5nZ2dLS0t2t5ie3Z29tq1azKZTKul3BSwPPYtPzs7y2Awurq6BAKBNt4fRalUTk9PDw8Pq1Qq7ZVyK8Dyeh/6aPnLly/HxcX5+Ph4enpGRER0dHRovIglWlpaSktLFxYW5HL56Ogom83WXlnXAZbHuOVbW1sjIiK8vb3PnTsXFxd39epVjReBIpPJaDRaWVmZSqVSq9VMJlMikWiprBsBy+t96J3l2Wy2q6urs7NzcXFxZWWlpaWlh4cHn89fesHU1NTY2JhCoUD/q1AoxGIxl8udmpoSCoWLi4uTk5NsNlutViMIolKppFKpQCCYnp6em5tTqVQcDmdyclIul6O/3t/fT6VSFQrFyMiIh4dHQkLC7OysZs/oVoDlsWz5y5cv29nZOTs719XVZWVlWVtbR0RELKUNmmbT09NLrW+FQrGwsMDlcicmJmQymVQqnZiYWMpbpVIpFov5fP7U1BSfz1cqlTMzMzMzM0qlEkEQuVze19dHo9HUavXY2NiJEyfKy8vXTPRgeb0P/bK8QqG4dOmSiYlJT08PeqSvr6+srGx8fBxBkMXFxebm5sjIyMDAwPr6erFYjCDIyMhITk5OUlJSQkJCcnJyTU1NfHx8eHh4f3+/Wq2WSCS1tbVxcXGpqalRUVEVFRU5OTnor0ulUgRB2tvbq6qqFApFR0fH//t//+/w4cOtra3oXwhtA5bHrOXVanVERMTZs2eHhobQI0wmMy8vDzWvTCarqakJCwsLDQ2l0+mo+oeGhmJjY/Py8qKiorKzsysqKmJiYpKTk69du4YgyNzcXGFhYXJycmJiYkJCQnV1dUpKSlBQEIPBUKlUcrm8paWlrq5OqVQSicR33nnH2toazfk1ACyv96FflheJRBERET4+PpOTk0sH1Wr10jdZIyOjiIiIhIQEY2PjqqoqBEFoNNoPP/xga2ublZX15ZdfmpmZ5ebmHjp0KCwsTCQSSSSS06dPf/nll2VlZXZ2dp999llmZqaHh8epU6fQPyTJycm2trZ8Pv/KlSsnTpxwd3cfGRnR4BndBrA8Zi2vVCrt7OwSEhKW/71Hm+0qlYpAIBgbG0dERPj6+trY2DQ3NyMIUldXt3Pnzvj4+MzMzA8//NDNzQ2Pxx8+fDg2NhZBkMHBwePHjxsZGeXl5R08ePDQoUPZ2dlmZmYuLi5sNlsqlSYkJJw+fVqpVLJYrM8//zw2NlYoFGrwjG4DWF7vQ+8sHx4e7u3tvWR5mUyGfsmVyWSFhYWmpqbj4+NSqdTOzi4qKgpBEAqFYm1tfenSpdHRURcXl+jo6NnZ2bCwMH9//6mpKZFI5OPj4+XlJZPJioqKTpw4MTY2RqfTLS0tm5qaEARJSUlxcHDg8Xjj4+MeHh6pqakikUiDZ3QbwPIYt3x8fPzSEalUOjY2tri4KJVKAwMD/fz8xsbGpqamnJ2dL1y4gCBIQ0PDt99+OzExwePxvvvuu/LycqlU6uzs7O3tjSAIk8m0t7dPS0ubmZkJCAhA2zGpqakuLi6Dg4OLi4tJSUnOzs4IgnA4nMOHD5NIJA2ezu0By+t96Jfl0R4bY2Pjzs5O9AiRSPT09GxubpbL5bm5uadOneJyuQiCODk5hYWFIQhCpVKdnJzIZPLc3Jyvr29qaqpQKDx//nxgYODk5KRQKAwODo6MjEQQpLKy0sHBYX5+vrOz09bWlkajIQiSnp5++vTpJctnZGSsWX8oWB6zller1eHh4a6urgMDA+iRxsZGa2trDocjk8m8vb3DwsLm5+dFItGZM2fQ1gaBQDh8+LBAIJBIJEePHiUSiQiCnDlzxtfXF0EQJpPp6upaUlKysLAQFRUVGRnJ5/Pz8vLc3d0HBgbkcvnFixddXV0RBGGz2YcOHUKbIGsDWF7vQ78sjyAIh8NxcnKysbHB4/Hl5eWmpqaOjo7o09T29vZDhw4lJyfn5uYeOXKksLAQQZDGxkYLC4va2trJyUlnZ+e4uDgejxcUFOTp6Tk5ObmwsODh4eHn54cgSH5+vqmp6czMDI1GMzExoVAoCIIkJCSgfzkmJycdHBwcHByGh4c1e0a3AiyPWcsjCHLlyhUrKytbW9vq6uqMjAwTExMvLy+5XK5Wq8vLy42MjJKTk6Oiok6dOtXQ0IAgSFVV1d69e9lsNo/H++qrr9DuRGtr6zNnziAI0tvba21tnZ2dzeVy/fz8/Pz8OBzOzz//7ODggFo+NjbW2toaQRA2m713797o6OiFhQXNntGtAMvrfeid5REEuXLlyoULF7y9vb29vc+fP89kMtHjEonk0qVL/v7+Hh4eeXl56FD6vr6+lJSUnp4eLpebnZ1dV1cnFovLy8sLCwu5XK5EIiksLCwpKUEQhMFgJCcn8/n8y5cvJyUloc20xsbGlJSUhYUFiURSXFx89uzZ1tZWdOSDtgHLY9nyCIK0trai/YdeXl4XL16cmJhAj/P5/KysLHSkb2VlJdqB3tXVFRYWtrCwIBaLIyIiuru7EQTJyMjIz89HEGRsbCw1NbWpqUkkEpWUlKCNehKJlJmZOTk5qVAoamtr09PTEQSRy+VxcXH+/v5LD361DVhe70MfLY8gCI/H6+np6ejoQPtnllCpVP39/d3d3egIGQRBJBIJm80Wi8UKhYLD4QgEApVKxefzuVyuQqFQqVRcLpfH4yEIIhKJ2Gy2UqmUSqVzc3PoOywsLLDZbHRcplAoHBgYQAdcauOkrgMsj3HLIwgyNzfX1tbW19d33bNQsVjMZDIHBgYWFxeXjqAjI1Uq1ezsLNrvx+Fw0ASWy+VsNlsoFCqVSh6Px+PxlErlwsICh8NBvx8IBIKlCYBcLpfFYqFJuwaA5fU+9NTy6wGwPPYtvx4Ay+t9gOUxC1geLI8FwPJ6H2B5zAKWB8tjAbC83gdYHrOA5cHyWAAsr/cBlscsYHmwPBYAy+t9gOUxC1geLI8FwPJ6H2B5zAKWB8tjAbC83gdYHrOA5cHyWAAsr/cBlscsYHmwPBYAy+t9gOUxC1geLI8FwPJ6H2B5zAKWB8tjAbC83gdYHrOA5cHyWAAsr/ehVcsnJHgMDfXpOkv1FbC8RsLL61hSkvfo6BVdX099BSyv96E9y587dzgx0XNk5LKus1TzCIXCNdgxCiy/6rAKDrY4d+7wxYt+U1Nj2r5ea4xarebz+WuQh2B5vQ8tWd7Pz8TPz6y6OlcgWKP1UdcMtVp97dq17OxsCoWi1b03wfKa+EJpHBxsSSCUSqVrtMPXWtLV1ZWenk4mk9GN7LUEWF7vQwuWt/D3N/X3N6+oyOTx2NpLPh0iEAhMTU137twZFBTU19enpXsMLL/KPPTzO+Hvb15TgxeJ1mgj7DVmbGzs4MGDf/rTnyIiIphMppb2qgTL631o2vK/KL6yMpPP52gj5zBCTU3Nm2+++cADD7zzzjshISFMJlMul2u2CLD8asLf3yQgwLyqKm/Nds5be1QqVVZW1tNPP71x48YdO3aEhob29/drPA/B8nofmrX8r634LENtxS+hUChsbGxwOBwOh9u6devbb78dGBjIZDI1uI0UWH6VeVhVlScQ8DV1ObAJh8MxNjZeysMdO3YEBgb29/drcNNKsLzeh+Ysb+HnZ/JrR40ht+KXIJPJ7733Hu5Xfve737377rt+fn69vb0aeX+w/MoiIMDUx8ekoiKLz+f+9qes/5SXl7/00kvL8/D999/38/NjsTQziBksr/ehKcujfaAG31FzHS4uLvfffz9uGY888siOHTu8vb3R7ZtXA1h+Za14H58T5eUZXO5v5KFarV7lBcII8/PzlpaWuP/mkUce+fDDD319fVff5gDL632s3vKhoZa+vsd/Vfz8wgOPEwAAEWNJREFUbyaNWq2uqqqytbU1MjIy0VtMTU2trKx27969ceNG3A1s27bt/ffft7e3b25uXvF3Z7D83Tc1THx9TcrL039T8QiCKJXK1NTUY8eO6TqVVoWZmZm5ufnHH398zz333JiH27dvf//9921tbTs7O1eWhAhY3gBi9Zb38TmOdtTcieIRBFGr1R4eHhs2bLgxKfWRe++991Y/2rt3L4FAWHFPPVh+Bd8mL13KuMMOw8XFxR9++GEtU0V73HPPPTe1PA6Hu//++3/44QcGg7GyJETA8gYQq7S8j4/x3fbFq9XqxsZGDw8POzu703qLk5OTi4vLjh07bryvfv/73+/Zsyc0NJTBYKxmgAdY/o7Dwtf3eGDgycrKbIHgTvvilUolHo+3sbFxdHTUdTatHDc3NwsLizfffPPGPHzqqaf2798fFxfHZDJXM9gXLK/3sRrL+/gYBwSso8et18Fmsw8cOLD8vnr88cd3794dFhbGZDJX//5g+TtWvHFg4KmqquyFBQMfUXNTurq6/v73vy/Pw6effvqrr76Kj48fHR1d/fuD5fU+Vmx59AvyulU8giCJiYlPPfXUkt///ve/h4eHDwwMaOr9wfJ32IoPClq/ipfL5QEBAUv9n3/84x///e9/JyQkTExMaKoIsLzex4os/8ugycrK9av40dHRjz/+GO2f2bVrV2Rk5JUrGl4PCyx/B4o/ERh4ct0qHkGQhoaGt99+G4fDPfHEE//6178uXrw4OTmp2SLA8nofK7D80uzWdat4hUIRHh7+9NNPv/vuu9HR0VevXtXgZKglwPK3Vzw6u7W6Otvw1kq6Q6RSqbW19datWz///PO0tLTJyUltDA8Fy+t93K3lAwLM/PzMDHiNmjthYmLC1dXV19f3ypUrGp9QvgRY/vZNjYAAs6qqnHWreARBWlpanJ2do6KixsfHFQqFlkoBy+t93JXlAwJMfX1PFBQkczizWkop7KNWq+fn50dGRrS9QApY/jZNDX9/s+rqnDsfUWN4qNXqmZmZsbExLS1StgRYXu/jzi0fEGDm42OSn58wNzet1azCPtron7kRsPyNERyMLodnVl2ds04WMNA5YHm9jzuxfGiolb+/qa+vaX5+wtzclK6zbr0Alr8x0Mf+VVWg+LUDLK/38ZuWDw21Cggw8fExzs09D634tQQsf+PIroCAk+u8o2btAcvrfdze8qGhVv7+Jt7extnZMdPThranGsYByy/rqLH08zseGIgqfv0+btUJYHm9j9tYPjTUyt//hLe3UVZW9OSkBibRAXcFWH7ZuPjjQUGnqqtz1u24eB0Cltf7uJXl0Va8j8/xnJyYiYkRXWfaegQsj7bifX2PBwaC4nUGWF7v41aWDwgw9fE5gcfHTU+P6zrN1ilg+eDgpQUMQPE6Ayyv93FTywcEmPn5ncDj42ZmNLYaBnC3rHvLW/j5nUD74kHxOgQsr/dxo+UDAsx8fU/k58fNzoLidcn6tryFv78JKB4LgOX1Pq6zfECAOdqKn53V8JpHwN2yPi0fGHgyKOiUv79pQIA5jKjBAmB5vY/llvf3N/X1PYHHx83NgeJ1z/q0PNrO8Pc3r6nJhXHxWAAsr/eBXrm8vHBf3xM+Psfx+LjZWZjdignWoeV//tnH29vYz8+0piYPFI8RDNnyFAq+vb2iubnYsKOlpay5uTQnJ8Tb+3he3gXoqMEOJFJ1ZmYEmZzX2lqu8zzRdrS2ltNohampvgEBp6qqYHYrhmAw6hITvQiEnJaWMp3niWYDl5UVlpcXnZ0dYdiRmxudnh4aEeGUmhoGiscUdDrh/Hmvixf91kcexqSnh4SGOmZkRK/nFU8xSHc3PTr6bHKyX25ulM7zRLOBo1BqGhouNTYaeBAIFbW1xfX1ZcPDGt7qCFgl09MTZHJNdXUhgVCh8zzRflTU1ZXU1hZfudKvja0wgBXD5bJptMbq6kJdZ4jmA4cgiEKhVCpVhh0KhUqhUOo6kYBbolSq1kMeogF6xyxKpUqpNLQ8xOn6UwUAAAC0CFgeAADAkAHLAwAAGDIrt7xSqZTJZBKJRCwWSySSW+26q1arlUolurecSqVam03mgHWCSqVSKpUaeYwJyQmsGLVarVAopFKpWCwWi8VSqfRWuYT6EM1YTaXub7JCyysUChKJlJycHB8fHx8fn5GRMTAwcNNXCgSCpqamy5cvIwjS19dHp9PhXgI0RXt7O41G4/FWuzzA4uJiR0dHa2urRmoFrDd4PF51dfX58+cTExPj4uJycnLm5uZu+srp6WkCgcDlcmUyWV1d3ejoWmxosULLi0QiBwcHCwuLuLi4lJSU7Oxs1OM3Mjo66ubmlpubiyBIcnLy2bNnb9XqB4C7JSws7Ny5c1eurHZ0rFQqjYqK8vf310itgPVGT0+PlZWVqalpZmZmUlISHo+/leWbm5tPnTo1MDAwPz9/7Nix2traNajeCi0vEAgsLCySk5Onpqbm5+fn5+elUimCIGq1emBgoKCggEQiLS4uIggyPDxsa2ubnp6OIMj58+cdHR2VSiWCIBwOp7KysqioaHp6GkEQNpvd19fH4/GEQmFbW9vU1JRKpert7R0f/2XZ9/7+/pycnMbGRrFYjCCIXC7v6enp6Ohoa2u71QcKGDx+fn6Ojo43fo/k8/nV1dV4PH54eBg9olQqWSxWbm5uRUXFzMzM0it7e3sLCwurq6vd3d29vb3Rg9PT06WlpSUlJbOzv0xcGh0dbW1t7ejoGBwc1PpZAfoGnU63s7NLTk7mcrkcDofL5aL2U6vVLS0t2dnZnZ2daB8GlUo9duwYi8Vis9nffvttZWUl+g4jIyP5+fl1dXUCgQBBEA6H09vbK5FI2Gx2e3s7m81eWFhgMpkTExPo2zIYjIKCgpaWFrlcjiDI3Nwck8mk0+mtra03NqNXbnlHR8fU1NSZmRk+n4+aF0EQGo3m4ODg4+Pj4OCQm5srEAimpqZcXV3z8vIQBElOTj537hyCIPPz835+fp6enm5ubsHBwdeuXRsZGYmJiSGRSL29vf/85z8zMzO5XK63tzeRSEQQhEKhhIaGJiQkhIeHJycnS6XShYUFExOT48ePZ2RkoH8ngHVIWFjY2bNnr/seOTc3Fx8fHxsbGx8fHxgY2NnZiSBIfX29ra1tWFhYYGCgl5cXn89HEKSurs7R0dHX19fT0/OLL74ICQlBEGR0dDQiIiIuLi42NjY8PBz9onDx4sW9e/eGhIS0tLTo4kQBTNPS0uLq6pqeni4UCvl8PtrkValUeDzezc3Ny8vL2dm5vLwcfeWpU6cGBwc5HM6xY8fq6+sRBOnq6vL19XV3dz9z5kxERIREIrl27dqZM2eGh4erqqq++eabmpqa7u7ukJAQJpMpFotramqCg4MTEhK8vb0vXbokk8laW1vNzMyMjIzKy8s1ZnmhUGhvb29jY5OWlpabm9vU1KRQKGZmZgICAuzt7Ts7O3NycoyMjPr6+rhcrpubW05ODoIgSUlJnp6eMpmsvr7+2LFj1dXVDAbDyMgoMzNTKBTGxcVFRERUVVV98MEHHh4eFArFycmppaVFIpG4ubn9+OOPJSUl3t7eBw8eHBwc5PF4n332mZOTE4vFkslkq7pEgN4SGhrq7u6+3PJKpbK2tnbv3r0xMTF5eXnffvvt+fPnpVJpd3d3XFxcaWmph4fHRx99NDk5KZVKzc3Nvb29u7q6amtrDx06FBgYqFAocnNzv/nmm+Tk5NTU1C+//DI/Px9BEF9f3927dxMIhPn5ed2dLoBRuru7LS0tLS0ti4qKCgoKOjo61Gr13NycsbFxbGxsT0+Pp6enk5MTn89vb29fsvzRo0cbGhoQBAkKCrK2tm5tbc3NzTU2NqbT6VNTU7a2to2NjTExMTt37kxPT8/OznZ3d+fxeNeuXTM2NnZzc8vKyrK2tra3tx8bGyMSifv27XN3dx8fH7/xie7KLe/o6Ojv719fX08kElkslkql6unpsbe3z8rKQhBEJBJ99913FApFIBCcOXMGtXxycrKvry+Px0tKSnJzc1tYWEAQxN7eHu0PJZPJNjY2cXFxaAPN398/KSmJzWZPT0+bmppaWlr29PQUFRVFRkYODQ2hnxH6lxBYt4SGhnp7e4+NjS0dkUql6enp//73v8vKylpaWiIjI0tKSqRS6ZUrV/Ly8urr6yMjI7/88su5uTkej/fVV18RCAQEQVQqVWhoaFBQkFgs/v/t3etLYlsUAPD7hxXBENHHmoIeQ/XB+tD7iZY9DCU7AxKNPYjIfBQTdGymlzkV48TxkdVxLIrRU+MUaFKWYlmSR2XNh30R6Qp3ZriXe3HW7+PheDjKcrn2PmtvFQpFVVXV1taW3W6nKIqMJkdGRrq7u/+rt4n+55xOZ2dnp1QqNZvNFovF4/GQuevy8nKWZQFgeXm5ra3t27dvh4eHjY2NqVqehF9jY2NqHCkSibRa7d3dnUajITEplUqVSiUZKwCAw+EoLCxUqVQMwywuLk5PT19eXppMJqFQaDabM97eL2b5cDhcX1+/uLiYftDr9cpksp6eHo7j1tbWqqurj46OgsFga2urRqMBgLGxMZFIFI1GDQaDQCBgGObk5EQgEExNTQEAx3EVFRUlJSU2m00sFufk5CwvL8fj8XA4LBaLOzo63G731tbW/Pz84+NjIBAoKioyGAy/dv8oOwwNDTU0NDAM4/f7fT5fOBzmeX5zc7OsrGxjY+P4+FitVn/+/BkAJicnS0tLTSbT+Ph4bm7u2dlZNBqtq6tTKBRnZ2dWq/XVq1dkoPn27duamhqz2exwODQaDZmIl8lkzc3NuPMMyshqtXZ2di4tLaUf9Pl8NTU1KpXq69evw8PDXV1dwWBwb2+vurqazMsXFxd/+PABAGQyWXt7+5cvX1ZXV2tra7e3t3met1qteXl5XV1dNptNIBAUFBTs7u4CAMdxtbW14+PjTqeTpmmj0cjzvMFgaGlpISODv/rFLH9/fy8Wi2majkajqYPJZPLjx48NDQ0URbW3t6tUqlAodH19PTAwoNfrAUCtVstkMgC4uroSi8VSqbS/v18ikRwdHQHA7e2tSCR6+fJlMBicnp5+8eIF+RkEAKPR2NPT8+bNG4qiJiYmnp6egsGgQCDAWv43NzU1VVlZOTg4SJp6bTYbAPj9frlcLpfLlUqlRCIhWX5paampqYmm6devXxcWFm5vbyeTyfX19dbWVoqi5HJ5UVHR6OgoAHg8nt7eXoVCQSbxyXTQ6OhoX18fNgGjjFiWHRwcfP/+ffrBeDw+OzsrEokoiuro6CB9hizLtrW1nZ6ehkKhuro6Usvb7fbe3l6JREJaAEKhEABcX1/n5+cPDQ1FIpGWlpaKigrSNRCJRPR6vVAoJAX+6upqIpEwmUwDAwP/cC0fi8UODg48Hs+zmX6e5x0Oh1arNRqNDw8P5J5YliWtDhzHka8cAPh8Ppqm5+bmLi4uUh+Ky+Wy2+3xePz8/NxisZDHzQCQTCadTqdKpTIajaQ5OhqNMgxzeYl7CP/WOI7T6/Wzs7NqtVqn06Uejd7c3Lx7906n07ndblKAPzw8MAxD0/T+/r7ZbGZZljQn7O3tabXazc1Nu93ucrnIyy8uLhYWFhYWFlJzQScnJw6HA2t5lFEgEEhluXSxWOzTp08zMzM2m4103QQCgd3d3XA4HIvFLBaL3//nfxm5XC6NRrOyspLqGEwkElarlQwlDw8P9/f3SXciAPA8v7Ozo9PpGIYhnS9er/fg4CB1tWf+lR0OfnBNF1m4+FOXxWIKPUMWE8bJhpZp4ZExulKRmX5mxrhKJBI/FZwIZUTi80fO/Kn8lr6G9m/hPjYIIZTNMMsjhFA2wyyPEELZDLM8QghlM8zyCCGUzTDLI4RQNsMsjxBC2QyzPEIIZTPM8gghlM3+cLu5Y4QQQlnqOzM8jbSN9/8gAAAAAElFTkSuQmCC" alt="" />

 Follower收到proposal後,寫到磁碟(儘可能批處理),返回ACK。

 Leader收到大多數ACK後,廣播COMMIT訊息,自己也deliver該訊息。

 Follower收到COMMIT之後,deliver該訊息。

(4)面臨問題

然而,這個簡化的二階段提交不能處理Leader失效的情況,所以增加了recovery模式。切換Leader時,需要解決下面兩個問題。
  ① Never forget delivered messages
      Leader在COMMIT投遞到任何一臺follower之前宕機,只有它自己commit了。新Leader必須保證這個事務也必須commit。
    ② Let go of messages that are skipped
      Leader產生某個proposal,但是在宕機之前,沒有follower看到這個proposal。該server恢復時,必須丟棄這個proposal。

(5)解決方案
  ① 新Leader在propose新訊息之前,必須保證事務日誌中的所有訊息都proposed並且committed。為了保證follower看到所有proposal,以及遞交的訊息,Leader向follower傳送follower沒有見過的PROPOSAL,以及最後提交的訊息的編號之前的COMMIT。
    因為Proposal是儲存在follower的事務日誌中,並且順序有保證,因此COMMIT的順序也是確定的。解決的第一個問題。
     上個沒有把proposal傳送出去的Leader重啟後,新Leader將告訴它截斷事務日誌,一直截斷到follower的epoch對應的最後一個commit位置。

三、 Zab與Paxos

3.1 Paxos瓶頸

  ZooKeeper的主要功能是維護一個高可用且一致的資料庫,資料庫內容複製在多個節點上,總共2f+1個節點中只要不超過f個失效,系統就可用。實現這一點的核心是ZAB,一種Atomic Broadcast協議。所謂Atomic Broadcast協議,形象的說就是能夠保證發給各複本的訊息順序相同。
  由於Paxos的名氣太大,所以我看ZAB的時候首先就想為什麼要搞個 ZAB,ZAB相比Paxos有什麼優點?這裡首要一點是Paxos的一致性不能達到ZooKeeper的要求。舉個例子。

  假設一開始Paxos系統中的 leader是P1,他發起了兩個事務<t1, v1>(表示序號為t1的事務要寫的值是v1)和<t2, v2>的過程中掛了。新來個leader是P2,他發起了事務<t1, v1'>。而後又來個新leader是P3,他彙總了一下,得出最終的執行序列<t1, v1'>和<t2, v2>,即P2的t1在前,P1的t2在後。

注意:在這我們可以看出,對於序號為t1的事務,Leader2將Leader1的覆蓋了 

  這樣的序列為什麼不能滿足ZooKeeper的需求呢?ZooKeeper是一個樹形結構,很多操作都要先檢查才能確定能不能執行,比如P1的事務t1可能是建立節點“/a”,t2可能是建立節點“/a/aa”,只有先建立了父節點“/a”,才能建立子節點“/a/aa”。而P2所發起的事務t1可能變成了建立“/b”。這樣P3彙總後的序列是先建立“/b”再建立“/a/aa”,由於“/a”還 沒建,建立“a/aa”就搞不定了。

3.2 解決局方案

  為了保證這一點,ZAB要保證同一個leader的發起的事務要按順序被apply,同時還要保證只有先前的leader的所有事務都被apply之後,新選的leader才能在發起事務。

  ZAB 的核心思想,形象的說就是保證任意時刻只有一個節點是leader,所有更新事務由leader發起去更新所有複本(稱為follower),更新時用的就是兩階段提交協議,只要多數節點prepare成功,就通知他們commit。各follower要按當初leader讓他們prepare的順序來 apply事務。因為ZAB處理的事務永遠不會回滾,ZAB的2PC做了點優化,多個事務只要通知zxid最大的那個commit,之前的各 follower會統統commit。☆☆☆
  如果沒有節點失效,那ZAB上面這樣搞下就完了,麻煩在於leader失效或leader得不到多數節點的支援時怎麼處理。這裡有幾個關鍵點:
   leader所follower之間通過心跳來檢測異常;
   檢測到異常之後的節點若試圖成為新的leader,首先要獲得大多數節點的支援,然後從狀態最新的節點同步事務,完成後才可正式成為leader發起事務;
  ③區分新老leader的關鍵是一個會一直增長的epoch;當然細節很多了,這裡就不說了因為我也沒完全搞懂,要了解詳情請看《Zab: High-performance broadcast for primary-backup systems.》這篇論文。
  除了能保證順序外,ZAB的效能也能不錯,基於千兆網路的測試,一般的5節點部署的TPS達到25000左右,而響應時間只有約6ms。

3.3 Zab與Paoxs 

  Zab的作者認為Zab與paxos並不相同,只所以沒有采用Paxos是因為Paxos保證不了全序順序:
  Because multiple leaders can propose a value for a given instance two problems arise.
  First, proposals can conflict. Paxos uses ballots to detect and resolve conflicting proposals. 
  Second, it is not enough to know that a given instance number has been committed, processes must also be able to figure out which value has been committed.
  Paxos演算法的確是不關心請求之間的邏輯順序,而只考慮資料之間的全序,但很少有人直接使用paxos演算法,都會經過一定的簡化、優化。
一般Paxos都會有幾種簡化形式,其中之一便是,在存在Leader的情況下,可以簡化為1個階段(Phase2)。僅有一個階段的場景需要有一個健壯的Leader,因此工作重點就變為Leader選舉,在考慮到Learner的過程,還需要一個”學習“的階段,通過這種方式,Paxos可簡化為兩個階段:
  • 之前的Phase2
  • Learn
  如果再考慮多數派要Learn成功,這其實就是Zab協議。Paxos演算法著重是強調了選舉過程的控制,對決議學習考慮的不多,Zab恰好對此進行了補充。
  之前有人說,所有分散式演算法都是Paxos的簡化形式,雖然不是絕對,但對很多情況的確如此,但不知Zab的作者是否認同這種說法?