Storm消費Kafka值得注意的坑

阿新 • • 發佈：2019-02-09

問題描述：
　　kafka是之前早就搭建好的，新建的storm叢集要消費kafka的主題，由於kafka中已經記錄了很多訊息，storm消費時從最開始消費

問題解決：

　　下面是摘自官網的一段話：

How KafkaSpout stores offsets of a Kafka topic and recovers in case of failures
As shown in the above KafkaConfig properties, you can control from where in the Kafka topic the spout begins to read by setting KafkaConfig.startOffsetTime as follows:

kafka.api.OffsetRequest.EarliestTime(): read from the beginning of the topic (i.e. from the oldest messages onwards)
kafka.api.OffsetRequest.LatestTime(): read from the end of the topic (i.e. any new messsages that are being written to the topic)
A Unix timestamp aka seconds since the epoch (e.g. via System.currentTimeMillis()): see How do I accurately get offsets of messages for a certain timestamp using OffsetRequest? in the Kafka FAQ
As the topology runs the Kafka spout keeps track of the offsets it has read and emitted by storing state information under the ZooKeeper path SpoutConfig.zkRoot+ "/" + SpoutConfig.id. In the case of failures it recovers from the last written offset in ZooKeeper.

Important: When re-deploying a topology make sure that the settings for SpoutConfig.zkRoot and SpoutConfig.id were not modified, otherwise the spout will not be able to read its previous consumer state information (i.e. the offsets) from ZooKeeper -- which may lead to unexpected behavior and/or to data loss, depending on your use case.

This means that when a topology has run once the setting KafkaConfig.startOffsetTime will not have an effect for subsequent runs of the topology because now the topology will rely on the consumer state information (offsets) in ZooKeeper to determine from where it should begin (more precisely: resume) reading. If you want to force the spout to ignore any consumer state information stored in ZooKeeper, then you should set the parameter KafkaConfig.ignoreZkOffsets to true. If true, the spout will always begin reading from the offset defined by KafkaConfig.startOffsetTime as described above.

　這段話的包含的內容大概有，通過SpoutConfig物件的startOffsetTime欄位設定消費進度，預設值是kafka.api.OffsetRequest.EarliestTime()，也就是從最早的訊息開始消費，如果想從最新的訊息開始消費需要手動設定成kafka.api.OffsetRequest.LatestTime()。另外還有一個問題是，這個欄位只會在第一次消費訊息時起作用，之後消費的offset是從zookeeper中記錄的offset開始的（存放消費記錄的地方是SpoutConfig物件的zkroot欄位，未驗證）

　如果想要當前的topology的消費進度接著上一個topology的消費進度繼續消費，那麼不要修改SpoutConfig物件的id。換言之，如果你第一次已經從最早的訊息開始消費了，那麼如果不換id的話，它就要從最早的訊息一直消費到最新的訊息，這個時候如果想要跳過中間的訊息直接從最新的訊息開始消費，那麼修改SpoutConfig物件的id就可以了

　下面是SpoutConfig物件的一些欄位的含義，其實是繼承的KafkaConfig的欄位，可看原始碼

　public int fetchSizeBytes = 1024 * 1024; //發給Kafka的每個FetchRequest中，用此指定想要的response中總的訊息的大小
public int socketTimeoutMs = 10000;//與Kafka broker的連線的socket超時時間
public int fetchMaxWait = 10000; //當伺服器沒有新訊息時，消費者會等待這些時間
public int bufferSizeBytes = 1024 * 1024;//SimpleConsumer所使用的SocketChannel的讀緩衝區大小
public MultiScheme scheme = new RawMultiScheme();//從Kafka中取出的byte[]，該如何反序列化
public boolean forceFromStart = false;//是否強制從Kafka中offset最小的開始讀起
public long startOffsetTime = kafka.api.OffsetRequest.EarliestTime();//從何時的offset時間開始讀，預設為最舊的offset
public long maxOffsetBehind = Long.MAX_VALUE;//KafkaSpout讀取的進度與目標進度相差多少，相差太多，Spout會丟棄中間的訊息
　 public boolean useStartOffsetTimeIfOffsetOutOfRange = true;//如果所請求的offset對應的訊息在Kafka中不存在，是否使用startOffsetTime
　 public int metricsTimeBucketSizeInSecs = 60;//多長時間統計一次metrics

Storm消費Kafka值得注意的坑

Storm消費Kafka值得注意的坑

kafka中topic的partition數量和customerGroup的customer數量關係以及storm消費kafka時並行度設定問題總結：

簡單Storm消費Kafka資料並存儲到redis例項（訂單資訊處理）

storm消費kafka資料

springboot kafka整合（包括java程式碼不能傳送和消費kafka訊息的採坑記錄）

Storm-Kafka模組常用介面分析及消費kafka資料例子

用flume-ng-sql-source 從mysql 抽取資料到kafka被storm消費

使用storm trident消費kafka訊息

storm實時消費kafka資料

SpringBoot通過kafka實現訊息傳送與接收（包括不能傳送和消費kafka訊息的採坑記錄）

spark streaming從指定offset處消費Kafka數據

storm 整合 kafka之保存MySQL數據庫

storm和kafka的wordCount

向spark集群提交消費kafka應用時kafka鑒權配置問題

經驗與教訓：值得注意的測試資料

SparkStreaming消費kafka數據

Storm整合Kafka應用的開發

java程式設計值得注意的地方

38套大資料，雲端計算，架構，資料分析師，Hadoop，Spark，Storm，Kafka，人工智慧，機器學習，深度學習，專案實戰視訊教程

記用 SpringBoot 消費 Kafka 過程中的一次問題排查

Storm消費Kafka值得注意的坑

相關推薦