1. 程式人生 > >TAO: Facebook's Distributed Data Store for the Social Graph論文閱讀筆記

TAO: Facebook's Distributed Data Store for the Social Graph論文閱讀筆記

Several fundamental problems

在TAO之前,Facebook用的主要的快取系統就是Memcache,但是像Memcache這一類的lookaside cache(旁路快取系統)存在著一些問題:

  • Inefficient edge lists
    • 像Memcache這樣的key-value快取系統並不適合儲存edge lists,因為在Facebook龐大的社交網路圖中,對某個節點的所有edge lists查詢操作很常見,而往往改變其中的某一條邊,就會要求整個edge lists失效(update操作),這樣當用戶在進行查詢時,這個之前改變的edge lists就會要求重新被載入到快取中,那麼可想而知,如果這樣的edge lists很龐大的話,對後端資料庫的負載就會增大。
  • Distributed control logic
    • 根據上一篇文章中提到的,Memcache的分散式控制邏輯是執行在客戶端方面的,這種控制邏輯很難避免”thundering heads”這種情況發生,而TAO中是通過一些簡單的API把控制邏輯移到了快取中。
  • Expensive read-after-write consistency
    • 在Memcache中,如果本地region發起write操作的話, 需要先跨區域向Master region寫,同時在sql語句中嵌入key和遠端標記rk,然後在本地cache中設定遠端標記rk,當本地發生read miss時,查詢本地cache中是否有遠端標記rk,如果有,則向Master資料庫查詢,否則定位到slave region中(因為如果本地cache還快取著rk的話,說明本地資料庫的資料是髒資料,Master的資料還沒同步過來)。由於需要進行跨區域通訊,往往會帶來很大的延時,那麼是否能夠在保證一致性的前提下,能夠直接往本地cache寫呢?當然,TAO實現了這一點。

Goals for TAO

  • Provide a data store with a graph abstraction (vertexes and edges), not keys+values.
  • Optimize heavily for reads(99.8% read requests).
  • Explicitly favor efficiency and availability over consistency.
    • Slightly stale data is often okay (for Facebook).
    • Communication between data centers in different regions is expensive.

TAO Data Model

主要分為兩種,Objects and Associations

  • Objects(Nodes):Object id(64-bit integer), Object type(otype), data, in the form of key-value pairs.
  • Associations:Source Object id(id1), Association type(atype), Destination Object id(id2), 32-bit timestamp, data, in the form of key-value pairs.
  • 資料模型
  • Example:Encoding in TAO
    • Alice used her mobile phone to record her visit to a famous landmark with Bob. She ‘checked in’ to the Golden Gate Bridge and ‘tagged’ Bob to indicate that he is with her. Cathy added a comment that David has ‘liked’.
    • Example

Association queries in TAO

在Facebook中,一種很常見的查詢操作就是給定某個節點(Object)以及邊的型別(atype),返回與給定資訊相關聯的一個邊列表(edge list),比如上面的’check in’行為,為此,Facebook設計了Association List這種結構。

Association List的結構: (id1, atype) → [anew …aold],其中列表中的每個元素都按時間先後排序。

TAO’s queries on associations lists

  • Assoc_get(id1, atype, id2set, high?, low?)
  • Assoc_count(id1, atype)
  • Assoc_range(id1, atype, pos, limit)
  • Assoc_time_range(id1, atype, high, low, limit)

TAO Architecture

  • Storage Layer
    • Objects and Associations儲存在MySQL中,但是面對龐大的facebook社交網路資料,單臺MySQL伺服器難以儲存,因此需要進行資料分片。
      • Divide data into logical shards, every shard is contained in a logical database.
      • Each object id contains an embedded shard_id that identifies its hosting shard.
  • Caching Layer

    • 多臺cache server組成在一起,我們把它稱為一個tier,一個tier能夠服務facebook的所有請求。
    • cache server中主要快取三種資料:objects, association lists, and association counts。

    由於隨著這種業務的增長,需要不斷擴充伺服器的數量,但是一味地增加伺服器數量往往會出現例如hot spots這種問題,於是可以把cache層分為兩層,稱為leader和followers。

    • Leader and Followers
      • Leaders reading from and writing to MySQL, serialize concurrent writes that arrive from followers.
      • Followers forward read misses and writes to a leader.
      • Clients can only interact with followers.

Caching Consistency

由於cache層被分為兩層,follower的功能是處理讀請求以及轉發寫請求,當出現read miss時,才把請求轉發給leader,leader查詢MySQL資料庫,更新快取。但是leader對待不同的update請求操作時,做法是不一樣的,這裡分為兩種:

  • An object update
    • The leader forwards invalidation messages to each corresponding follower.
    • The follower issued the write is updated synchronously on reply from the leader
  • An association update
    • 對於association update,之前討論過,如果在Memcache中,失效一條邊往往需要使整個邊列表重新載入,這樣會大大增加延遲以及資料庫的負載,在這裡TAO是這麼做的:
      • the leader sends a refill message to notify followers。
      • if a follower has cached the association,Asks the leader to update the follower’s now-stale association list。

Scaling Geographically

  • Master-Slave
    A. The master leader sends read misses, writes, and consistency messages to the master database.
    B. Messages are delivered to the slave leader as the replication stream updates the slave database.
    C. Slave leader sends writes to the master leader.
    D. Read misses to the replica DB.

Consistency

  • Changeset
    • TAO synchronously updates the cache with locally written values by having the master leader return a changeset when the write is successful.
    • This changeset is propagated through the slave leader (if any) to the follower tier that originated the write query.
  • Version number(In the persistent store and cache)
    • The version number is incremented during each update.
    • The follower can safely invalidate its local copy of the data if the changeset indicates that its pre-update value was stale.

論文下載