1. 程式人生 > >淺談實時流平臺Kafka的訊息系統設計

淺談實時流平臺Kafka的訊息系統設計

文章目錄

Kafka解決什麼問題?

以下英文來源於http://kafka.apache.org
為了滿足大量資料的實時吞吐。
We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.
It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.
It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.
Kafka建立了新的訊息機制(包含partition)和消費模型
We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.
另外還要提供容錯性,相比其他訊息系統,更像一個數據庫日誌(a database log)。

Kafka的流處理

Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

Kafka如何實現負載均衡Load balancing

生產者將資料直接傳輸給broker,所有結點都可以回答:哪個伺服器是活動的?每個topic的某個partition的leader是誰?
負載均衡可以通過隨機負載或者指定分割槽函式。

Kafka的資料單位是什麼?怎麼傳輸?

Kafka的資料單位是message,是一個位元組陣列。訊息中可以有元資料key,用來寫入特定的分割槽partition。為了提高效率,訊息批量寫入。同一批次的訊息會發送到同一個topic和partition。

消費者如何解析生產者的訊息?

可以根據業務場景選擇合適的Schema,如JSON、XML。
序列化框架:Avro提供了一個壓縮的序列化格式、訊息模式和訊息負載分離。

如何保證一個partition的訊息只被一個consumer消費?

consumer group保證一個partition只會被一個consumer成員消費。
一個consumer group可以包含多個consumer成員,成員間讀取不同的partition進行消費,互不干擾。

如何實現非同步傳輸? Asynchronous send

Kafka將資料快取在記憶體中,然後批量傳送。將不超過一定數量的請求打包,並等待不超過一定時間傳送批量資料。
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.

訊息消費方式Push vs. pull?

訊息消費可以有兩種方式,一種是消費者從broker那裡pull資料,一種是broker將資料push給消費者。Kafka中,生產者將資料push給broker,消費者從broker那裡pull資料,這樣解決了消費者的處理能力落後於生產者時的問題。
The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a “long poll” waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).

怎麼樣定位哪些資料被消費了?

許多訊息系統儲存了元資料來定位哪些訊息被消費了。如果broker在將訊息傳送出去後將資料標記為consumed,如果消費者處理失敗,那麼訊息就丟失了。因此許多訊息系統加入了應答機制:訊息被髮送後先標記為sent,直到消費者返回處理成功才標記為consumed。messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed.這產生了新的問題,如果消費者處理成功,但是沒有返回應答,那麼訊息會被消費兩次;而且broker要維護訊息的多個狀態。
Kafka將資料分割槽(partition),每個partition只被一個consumer group中的一個consumer消費。每個partition只儲存了一個整數(offset),來標記消費位置。消費者也可以定位到一箇舊的位置,來重新消費資料。