淺談實時流平臺Kafka的訊息系統設計

阿新 • • 發佈：2018-12-23

文章目錄

Kafka解決什麼問題？
Kafka的流處理
Kafka如何實現負載均衡Load balancing
Kafka的資料單位是什麼？怎麼傳輸？
消費者如何解析生產者的訊息？
如何保證一個partition的訊息只被一個consumer消費？
如何實現非同步傳輸？ Asynchronous send
訊息消費方式Push vs. pull？
怎麼樣定位哪些資料被消費了？

Kafka解決什麼問題？

以下英文來源於http://kafka.apache.org
為了滿足大量資料的實時吞吐。
We designed Kafka to be able to act as a unified platform for handling all the real-time data feeds a large company might have. To do this we had to think through a fairly broad set of use cases.
It would have to have high-throughput to support high volume event streams such as real-time log aggregation.
It would need to deal gracefully with large data backlogs to be able to support periodic data loads from offline systems.
It also meant the system would have to handle low-latency delivery to handle more traditional messaging use-cases.
Kafka建立了新的訊息機制（包含partition）和消費模型
We wanted to support partitioned, distributed, real-time processing of these feeds to create new, derived feeds. This motivated our partitioning and consumer model.
另外還要提供容錯性，相比其他訊息系統，更像一個數據庫日誌（a database log）。

Kafka的流處理

Starting in 0.10.0.0, a light-weight but powerful stream processing library called Kafka Streams is available in Apache Kafka. Apart from Kafka Streams, alternative open source stream processing tools include Apache Storm and Apache Samza.

Kafka如何實現負載均衡Load balancing

生產者將資料直接傳輸給broker，所有結點都可以回答：哪個伺服器是活動的？每個topic的某個partition的leader是誰？
負載均衡可以通過隨機負載或者指定分割槽函式。

Kafka的資料單位是什麼？怎麼傳輸？

Kafka的資料單位是message，是一個位元組陣列。訊息中可以有元資料key，用來寫入特定的分割槽partition。為了提高效率，訊息批量寫入。同一批次的訊息會發送到同一個topic和partition。

消費者如何解析生產者的訊息？

可以根據業務場景選擇合適的Schema，如JSON、XML。
序列化框架：Avro提供了一個壓縮的序列化格式、訊息模式和訊息負載分離。

如何保證一個partition的訊息只被一個consumer消費？

consumer group保證一個partition只會被一個consumer成員消費。
一個consumer group可以包含多個consumer成員，成員間讀取不同的partition進行消費，互不干擾。

如何實現非同步傳輸？ Asynchronous send

Kafka將資料快取在記憶體中，然後批量傳送。將不超過一定數量的請求打包，並等待不超過一定時間傳送批量資料。
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.

訊息消費方式Push vs. pull？

訊息消費可以有兩種方式，一種是消費者從broker那裡pull資料，一種是broker將資料push給消費者。Kafka中，生產者將資料push給broker，消費者從broker那裡pull資料，這樣解決了消費者的處理能力落後於生產者時的問題。
The deficiency of a naive pull-based system is that if the broker has no data the consumer may end up polling in a tight loop, effectively busy-waiting for data to arrive. To avoid this we have parameters in our pull request that allow the consumer request to block in a “long poll” waiting until data arrives (and optionally waiting until a given number of bytes is available to ensure large transfer sizes).

怎麼樣定位哪些資料被消費了？

許多訊息系統儲存了元資料來定位哪些訊息被消費了。如果broker在將訊息傳送出去後將資料標記為consumed，如果消費者處理失敗，那麼訊息就丟失了。因此許多訊息系統加入了應答機制：訊息被髮送後先標記為sent，直到消費者返回處理成功才標記為consumed。messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed.這產生了新的問題，如果消費者處理成功，但是沒有返回應答，那麼訊息會被消費兩次；而且broker要維護訊息的多個狀態。
Kafka將資料分割槽(partition),每個partition只被一個consumer group中的一個consumer消費。每個partition只儲存了一個整數(offset)，來標記消費位置。消費者也可以定位到一箇舊的位置，來重新消費資料。

淺談實時流平臺Kafka的訊息系統設計

文章目錄

Kafka解決什麼問題？

Kafka的流處理

Kafka如何實現負載均衡Load balancing

Kafka的資料單位是什麼？怎麼傳輸？

消費者如何解析生產者的訊息？

如何保證一個partition的訊息只被一個consumer消費？

如何實現非同步傳輸？ Asynchronous send

訊息消費方式Push vs. pull？

怎麼樣定位哪些資料被消費了？

淺談實時流平臺Kafka的訊息系統設計

淺談工作流排程系統

淺談身為小白學習Linux系統的四點實用建議

淺談基於Prism的軟件系統的架構設計

淺談MYSQL之日誌文件系統

淺談瀑布流

淺談教你如何掌握Linux系統

淺談科學元勘與專家系統

淺談網路流（最大流，最小割，mcmf，最大匹配）

Kafka訊息系統基礎知識索引

"淺談Android"第一篇：Android系統簡介

淺談實時作業系統和分時作業系統

淺談瀑布流

淺談RPG遊戲中的屬性系統設定

淺談java的平臺無關性

淺談linux中的根檔案系統（rootfs的原理和介紹）

淺談檔案描述符及檔案系統

淺談高通平臺NON-HLOS.bin檔案生成和映象載入過程

淺談限流(下)實戰

基於springboot工程淺談整合rabbitmq怎麼樣防止訊息傳送mq不丟失和消費mq的訊息防止丟失

淺談實時流平臺Kafka的訊息系統設計

文章目錄

Kafka解決什麼問題？

Kafka的流處理

Kafka如何實現負載均衡Load balancing

Kafka的資料單位是什麼？怎麼傳輸？

消費者如何解析生產者的訊息？

如何保證一個partition的訊息只被一個consumer消費？

如何實現非同步傳輸？ Asynchronous send

訊息消費方式Push vs. pull？

怎麼樣定位哪些資料被消費了？

相關推薦