1. 程式人生 > >SparkStreaming Direct 方式中手動管理 Kafka Offset 的示例程式碼

SparkStreaming Direct 方式中手動管理 Kafka Offset 的示例程式碼

在大資料的場景下,流式處理都會藉助 Kafka 作為訊息接入的中介軟體,且 SparkStreaming 中 Direct 方式的優越性,現在可以說都使用 Direct 方式來獲取 Kafka 資料

Direct 方式是採用 Kafka 低階的 API 來獲取資料,也就是說我們要自己來管理 這個offset
SparkStreaming 中可以用 StreamingContext 的 checkpiont 方法來自動幫我們管理 offset。但是有一些缺點:

  • checkpoint 是在每次處理完成後自動幫我們提交的,但是如果我們想實現 at most onec 語義時,checkpoint就不滿足
  • 當 Spark 版本升級後,新版本不識別老版本 checkpoint 的資訊

所以我們可以自己手動來管理 offset 來達到不同語義的要求,下面是將 offset 儲存到 zookeeper 的樣例程式碼:

main類:

import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.{HasOffsetRanges, OffsetRange, KafkaUtils}
import
org.apache.spark.streaming.{Seconds, StreamingContext} import sql.StreamingExamples object OffsetTest extends App{ StreamingExamples.setStreamingLogLevels() val topic = "iso8583-r3p3" val brokers = "ido001.gzcb.com:9092,ido002.gzcb.com:9092,ido003.gzcb.com:9092" val sparkConf = new SparkConf().setAppName("Iso8583_KafkaDirect"
).setIfMissing("spark.master","local[*]") val ssc = new StreamingContext(sparkConf, Seconds(3)) val fromOffSets = ZkUtil.getOffset(topic) val messageHandler = (mmd: MessageAndMetadata[String,String]) => (mmd.message()) val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers, "group.id" -> "lwj") val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](ssc, kafkaParams, fromOffSets, messageHandler) //儲存每個批次的offset var offsetRanges = Array[OffsetRange]() messages.transform(rdd => { offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges rdd }).foreachRDD(rdd => { //offset管理 val offsets = scala.collection.mutable.ArrayBuffer[String]() for (o <- offsetRanges){ println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") offsets += s"${o.topic},${o.partition},${o.untilOffset}" } //todo offset儲存的時間點 根據需求而定 ZkUtil.setOffset(offsets.toArray) //todo 業務邏輯 println("#################") //rdd.foreach(println) println(rdd.count()) }) ssc.start() ssc.awaitTermination() }

ZkUtil 工具類:

import java.util
import java.util.concurrent.CountDownLatch
import kafka.common.TopicAndPartition
import org.apache.zookeeper.Watcher.Event
import org.apache.zookeeper._

/**
  * Zookeeper工具類
  *
  * @author lwj
  * @date 2018/04/25
  */
object ZkUtil extends Watcher with Serializable{

  protected var countDownLatch: CountDownLatch = new CountDownLatch(1)
  override def process(event: WatchedEvent): Unit = {
    if (event.getState eq Event.KeeperState.SyncConnected) {
      countDownLatch.countDown
    }
  }

  val zk = new ZooKeeper("181.137.128.151:2181,181.137.128.152:2181,181.137.128.153:2181", 5000, ZkUtil)
  val parentPath = "/lwj"
  //預設partition的數量
  val initPartitions = 3
  //預設offset的值
  val initOffset = 0+""
  //這裡雖然沒有顯示的呼叫,但是會被執行
  if (zk.exists(parentPath, false) == null){
    zk.create(parentPath, "0".getBytes, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
  }

  /**
    * 通過topic獲取partition以及相應的offset
    *
    * @param topic
    * @return
    */
  def getOffset(topic:String): Map[TopicAndPartition, Long] ={
    val zkPath = parentPath + "/" + topic
    val map = scala.collection.mutable.Map[TopicAndPartition, Long]()

    /**
      * 如果topic節點不存在,那麼就建立
      * 並且直接初始化partition節點,而且初始化值都為 initOffset
      */
    if (zk.exists(zkPath, false) == null){
      zk.create(zkPath, "0".getBytes, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
      for(i <- 0 to initPartitions - 1){
        zk.create(zkPath + "/" + i, initOffset.getBytes, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
      }
    }
    /**
      * 返回offset
      */
    val children = zk.getChildren(zkPath, false)
    val iterator: util.Iterator[String] = children.iterator()
    while (iterator.hasNext){
        val child: String = iterator.next()
        val offset = new String(zk.getData(zkPath +"/"+ child, false, null))
        val tp = new TopicAndPartition(topic, child.toInt)
        map += (tp -> offset.toLong)

    }
    map.toMap
  }

  /**
    * 設定偏移量
    *
    * @param offsets "topic,partition,offset"
    */
  def setOffset(offsets : Array[String]): Unit ={
    offsets.foreach(off =>{
      val splits: Array[String] = off.split(",")
      val partitionPath = parentPath + "/" + splits(0) + "/" + splits(1)
      if (zk.exists(partitionPath, false) == null){
        //預設值是0
        zk.create(partitionPath, splits(2).getBytes, ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT)
      }else{
        zk.setData(partitionPath, splits(2).getBytes, -1)
      }
    })
  }
}

以上程式碼僅供參考,有什麼問題或者更好的想法可以留言討論討論哈~