1. 程式人生 > >Kafka Introduction 官方文件學習筆記

Kafka Introduction 官方文件學習筆記

補充:例如途Partition 0分割槽的0,1,2....10,11,12...都是分割槽中訊息的偏移。

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

每一個分割槽都是有序的,不可以改變的記錄序列,該序列被持續追加新記錄 -- 一個帶有結構的提交日誌。每一個分割槽的記錄都被指定一個序列號,稱為偏移,該偏移在該分割槽中是唯一標識。

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

Kafka叢集可以儲存所有的記錄(無論該記錄是否被消費)。可以配置一個記錄儲存時間。例如:儲存策略是2天的話,對於記錄被髮布的兩天內,該記錄都是可以被消費的,然而過了兩天將會被刪除釋放空間。Kakfa效能是高效的,可以長時間儲存資料。

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

實際上儲存在每一個消費者上的元資料是消費的偏移或者消費的位置,偏移是有消費者控制的,所以我們可以自行定義消費者的消費位置。

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

這些特徵以為這消費者是非常廉價的,消費者的建立和消費對叢集或者其他消費者幾乎沒影響。例如:使用tail命令列檢視topic的訊息內容,他會不該改變其他已經存在的消費者資訊。

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

日誌中的設定分割槽有如下幾個目的:首先,允許日誌拓展儲存到超過單個節點伺服器大小的其他節點上,每一個有效的分割槽都存在對應主機的伺服器上,但是一個topic可以有多個分割槽,因此可以解決因單個伺服器因儲存空間大小無法處理大量資料的情況。其次,分割槽充當並行度的單元。

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

日誌分割槽分散式在kafka叢集中,每一個分割槽都可以處理資料和接受請求,分割槽可以配置副本在多個伺服器上以達到容錯目的。

每個partition都有一個server為"leader"(一般為了容錯每個分割槽有多個副本,其中之一為Leader,其餘follower);leader負責所有的讀寫操作,follower負責被動的複製保持副本數目。如果leader失敗的話,會重新再foller中選擇一個新的leader,每一個伺服器充當起分割槽的leader,該leader的foller一般在其他機器上,這樣在叢集內會有較好的平衡。

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

生產者釋出資料到指定的topic中,生產者負責選擇那條記錄寫到哪一個topic分割槽中。這個使用round-robin演算法完成的,根據一些語義分割槽功能維持的負載均衡,更多分割槽的使用是在下面消費者。

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

消費者以分組名字給自己定義標籤,每一個釋出到topic的記錄將會被髮到訂閱該topic的組中一個消費者例項中,消費者例項是獨立的程式設計或者分佈在多臺機器上。

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

如果所有的消費者各自的組都不相同的話,每一條記錄將會被髮布到所有的消費中(廣播)

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

通常,topic有少量的消費者組,每一個組是一個邏輯訂閱。為了拓展好容錯每一個組是由大量消費者組成,這個事發布訂閱語義,通常訂閱者都是消費者叢集,而不是單個程序

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka實現消費的方式是將分區劃分給消費者例項,每一個消費者例項在任何時候可以公平佔用該分割槽,組中的消費者可以有Kafka協議動態調整,新的消費者例項假如組中的話,他們可以接管該組中其他消費者的分割槽,如果一個消費者失望的話,該分割槽會分配到其他存活的消費者去

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

Kafka只有在分割槽內記錄有序,分割槽之間的記錄是不保證順序的,如果要保證topic全域性有序的話,可以讓該topic只有一個分割槽,同時這也意味著每一個消費者組只有一個消費者

At a high-level Kafka gives the following guarantees:

在一個高層,Kafka提供如下保證:

  • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.

訊息追加到分割槽的順序是他們傳送的順序。

消費者尋找記錄的順序是他們儲存在日誌中的順序

如果一個topic的副本數是N,我們將會容忍N-1個伺服器宕機,保證不會丟失資料

更多保證的細節將在設計文件處給出。

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Kafka的流概念和傳統的企業訊息系統比較?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

訊息系統模型有兩種:訊息佇列和釋出訂閱。佇列模式,一個消費池可能會在一個伺服器讀取每一條訊息(不是每一個consumer都能獲取到每一條訊息)。在釋出訂閱系統中,記錄將會廣播到所有的消費者,這兩個模型有各自的優缺點。佇列的優點是允許你可劃分訊息給多個執行緒,可以擴充套件處理。

不幸的是,確定沒有多個訂閱者,一旦資料被某一個進行讀取,資料將會消失。釋出訂閱西戎可以讓你廣播資料給多個執行緒,但是沒有辦法擴充套件處理,因為每一條記錄都會發到每一個訂閱者。

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

消費者概念在Kafka中意味著兩層含義:在佇列模式,允許你劃分處理給多個程序。但是在釋出訂閱中,你將廣播訊息給多個使用者組。

The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

Kafka模型的優勢是每一個topic都有這些屬性,他可以擴充套件處理和給多個訂閱者,不需要選擇一個或者多個訂閱者。

Kafka has stronger ordering guarantees than a traditional messaging system, too.

Kafka有強烈的順序保證比傳統的訊息系統

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

傳統佇列按順序儲存記錄, 如果多個消費者從佇列消費,順序是他們儲存的順序。但是儘管伺服器處理資料有順序,但是記錄在非同步傳輸到消費者過程中可能並不是先發先到,因此順序會錯亂。這意味著並行處理會失去順序,訊息系統常常可以使用“獨佔消費”只允許一個執行緒從一個佇列中消費,當然這意味著失去了並行處理。

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

Kafka在這一點處理的比較好。使用topic的分割槽實現並行,Kafka可以保證順序和負載均衡。這個實現是使用topic的分割槽,每一個分割槽只有分組中的一個消費者消費。這麼做我們就可以確保沒一個分割槽消費資料有序。因此這要要注意:每一個分組總的消費者數目不可以大於分組數目。

Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

任何訊息系統都允許生產訊息和消費訊息解耦,使其充當一個動態訊息儲存系統。Kafka作為儲存系統有什麼不同呢?

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

資料寫到Kafka之後在寫到磁碟並且為了容錯複製副本。kafka允許生產者等待訊息寫入操作結果的確認,Kafka使用的磁碟結構拓展較好,在50KK和50TB持久化效能相同

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

由於Kafka花費大量儲存,允許客戶端控制讀取位置(佔用儲存大,offset就大)。你可以認為Kafka是一種特定的分散式檔案系統,高效能,低延遲,有副本和可以傳播。

Kafka for Stream Processing(Kafka流處理器)

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

Kafka不僅僅可以讀寫儲存流資料,還是以進行實時處理。

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

Kafka處理器可以處理topic的輸入資料,進行執行操作之後在輸出到topic

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

例如零售應用可以輸入銷售和出貨量的資料,然後進行排序和價格調整

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

簡單的處理可以直接使用生產者和消費者API,但是複雜的操作可以使用Streams API,這允許你侯建有意義的處理,進行聚集或者連線

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

Putting the Pieces Together(總結)

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.

By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, apis, and capabilities Kafka provides see the rest of the documentation.