1. 程式人生 > >apache kafka系列之kafka.common.ConsumerRebalanceFailedException異常解決辦法

apache kafka系列之kafka.common.ConsumerRebalanceFailedException異常解決辦法

kafka.common.ConsumerRebalanceFailedException :log-push-record-consumer-group_mobile-pushremind02.lf.xxx.com-1399456594831-99f15e63 can't rebalance after 3 retries

at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(Unknown Source)
at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(Unknown Source)
at kafka.consumer.ZookeeperConsumerConnector.consume(Unknown Source)
at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreams(Unknown Source)
at com.xxx.mafka.client.consumer.DefaultConsumerProcessor.getKafkaStreams(DefaultConsumerProcessor.java:149)
at com.xxx.mafka.client.consumer.DefaultConsumerProcessor.recvMessage(DefaultConsumerProcessor.java:63)
at com.xxx.service.mobile.push.kafka.MafkaPushRecordConsumer.main(MafkaPushRecordConsumer.java:22)

at com.xxx.service.mobile.push.Bootstrap.main(Bootstrap.java:34)

出現以上問題原因分析:

同一個消費者組(consumer group)有多個consumer先後啟動,就是一個消費者組內有多個consumer同時負載消費多個partition資料.

解決辦法:

1.配置zk問題(kafka的consumer配置)

zookeeper.session.timeout.ms=5000

zookeeper.connection.timeout.ms=10000

rebalance.backoff.ms=2000

rebalance.max.retries=10


在使用高階API過程中,一般出現這個問題是zookeeper.sync.time.ms時間間隔配置過短,不排除有其他原因引起,但筆者遇到一般是這個原因。

給大家解釋一下原因:一個消費者組中(consumer數量<partitions數量)每當有consumer傳送變化,會觸發負載均衡。第一件事就是釋放當consumer資源,無則免之,呼叫ConsumerFetcherThread關閉並釋放當前kafka broker所有連線,釋放當前消費的partitons,實際就是刪除臨時節點(/xxx/consumer/owners/topic-xxx/partitions[0-n]),所有同一個consumer group內所有consumer通過計算獲取本consumer要消費的partitions,然後本consumer註冊相應臨時節點卡位,代表我擁有該partition的消費所有權,其他consumer不能使用。

如果大家理解上面解釋,下面就更容易了,當consumer呼叫Rebalance時,它是按照時間間隔和最大次數採取失敗重試原則,每當獲取partitions失敗後會重試獲取。舉個例子,假如某個公司有個會議,B部門在某個時間段預訂該會議室,但是時間到了去會議室看時,發現A部門還在使用。這時B部門只有等待了,每隔一段時間去詢問一下。如果時間過於頻繁,則會議室一直會處於佔用狀態,如果時間間隔設定長點,可能去個2次,A部門就讓出來了。

同理,當新consumer加入重新觸發rebalance時,已有(old)的consumer會重新計算並釋放佔用partitions,但是會消耗一定處理時間,此時新(new)consumer去搶佔該partitions很有可能就會失敗。我們假設設定足夠old consumer釋放資源的時間,就不會出現這個問題。

官方解釋:

consumer rebalancing fails (you will see ConsumerRebalanceFailedException): This is due to conflicts when two consumers are trying to own the same topic partition. The log will show you what caused the conflict (search for "conflict in ").

  • If your consumer subscribes to many topics and your ZK server is busy, this could be caused by consumers not having enough time to see a consistent view of all consumers in the same group. If this is the case, try Increasing rebalance.max.retries and rebalance.backoff.ms.
  • Another reason could be that one of the consumers is hard killed. Other consumers during rebalancing won't realize that consumer is gone after zookeeper.session.timeout.ms time. In the case, make sure that rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms.

rebalance.backoff.ms時間設定過短就會導致old consumer還沒有來得及釋放資源,new consumer重試失敗多次到達閥值就退出了。

確保rebalance.max.retries * rebalance.backoff.ms > zookeeper.session.timeout.ms

參考資料:https://cwiki.apache.org/confluence/display/KAFKA/FAQ