spark streaming從指定offset處消費Kafka數據

阿新 • • 發佈：2017-08-30

tpc asi 4.2 nes 配置 sof 我們 erl examples

 spark streaming從指定offset處消費Kafka數據
2017-06-13 15:19 770人閱讀 評論(2) 收藏 舉報
 分類： spark（5）  

原文地址:http://blog.csdn.net/high2011/article/details/53706446

      首先很感謝原文作者，看到這篇文章我少走了很多彎路，轉載此文章是為了保留一份供復習用，請大家支持原作者，移步到上面的連接去看，謝謝


一、情景：當Spark streaming程序意外退出時，數據仍然再往Kafka中推送，然而由於Kafka默認是從latest的offset讀取，這會導致數據丟失。為了避免數據丟失，那麽我們需要記錄每次消費的offset，以便下次檢查並且從指定的offset開始讀取
二、環境：kafka 
-0.9.0、Spark-1.6.0、jdk-1.7、Scala-2.10.5、idea16
三、實現代碼：
      1、引入spark和kafka的相關依賴包
[html] view plain copy
<?xml version="1.0" encoding="UTF-8"?>  
<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
         xmlns="http://maven.apache.org/POM/4.0.0"  
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd 
">  
    <modelVersion>4.0.0</modelVersion>  
  
    <groupId>com.ngaa</groupId>  
    <artifactId>test-my</artifactId>  
    <version>1.0-SNAPSHOT</version>  
    <inceptionYear>2008</inceptionYear>  
    <properties>  
        <project.build.sourceEncoding>UTF-8 
</project.build.sourceEncoding>  
        <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>  
        <!--add  maven release-->  
        <maven.compiler.source>1.7</maven.compiler.source>  
        <maven.compiler.target>1.7</maven.compiler.target>  
        <encoding>UTF-8</encoding>  
        <!--scala版本-->  
        <scala.version>2.10.5</scala.version>  
        <!--測試機器上的scala版本-->  
        <test.scala.version>2.11.7</test.scala.version>  
  
        <jackson.version>2.3.0</jackson.version>  
        <!--slf4j版本-->  
        <slf4j-version>1.7.20</slf4j-version>  
        <!--cdh-spark-->  
        <spark.cdh.version>1.6.0-cdh5.8.0</spark.cdh.version>  
        <spark.streaming.cdh.version>1.6.0-cdh5.8.0</spark.streaming.cdh.version>  
        <kafka.spark.cdh.version>1.6.0-cdh5.8.0</kafka.spark.cdh.version>  
        <!--cdh-hadoop-->  
        <hadoop.cdh.version>2.6.0-cdh5.8.0</hadoop.cdh.version>  
        <!--http client必需要兼容CDH中的hadoop版本（cd /opt/cloudera/parcels/CDH/lib/hadoop/lib）-->  
        <httpclient.version>4.2.5</httpclient.version>  
  
        <!--http copre-->  
        <httpcore.version>4.2.5</httpcore.version>  
        <!--fastjson-->  
        <fastjson.version>1.1.39</fastjson.version>  
  
    </properties>  
  
    <repositories>  
        <repository>  
            <id>scala-tools.org</id>  
            <name>Scala-Tools Maven2 Repository</name>  
            <url>http://scala-tools.org/repo-releases</url>  
        </repository>  
        <!--配置依賴庫地址（用於加載CDH依賴的jar包） -->  
        <repository>  
            <id>cloudera</id>  
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>  
        </repository>  
    </repositories>  
  
    <pluginRepositories>  
        <pluginRepository>  
            <id>scala-tools.org</id>  
            <name>Scala-Tools Maven2 Repository</name>  
            <url>http://scala-tools.org/repo-releases</url>  
        </pluginRepository>  
    </pluginRepositories>  
  
    <dependencies>  
  
        <!--fastjson-->  
        <dependency>  
            <groupId>com.alibaba</groupId>  
            <artifactId>fastjson</artifactId>  
            <version>${fastjson.version}</version>  
        </dependency>  
        <!--httpclient-->  
        <dependency>  
            <groupId>org.apache.httpcomponents</groupId>  
            <artifactId>httpclient</artifactId>  
            <version>${httpclient.version}</version>  
        </dependency>  
  
        <!--http core-->  
        <dependency>  
            <groupId>org.apache.httpcomponents</groupId>  
            <artifactId>httpcore</artifactId>  
            <version>${httpcore.version}</version>  
        </dependency>  
  
        <!--slf4j-->  
        <dependency>  
            <groupId>org.slf4j</groupId>  
            <artifactId>slf4j-log4j12</artifactId>  
            <version>${slf4j-version}</version>  
        </dependency>  
        <!--hadoop-->  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-client</artifactId>  
            <version>${hadoop.cdh.version}</version>  
            <exclusions>  
                <exclusion>  
                    <groupId>javax.servlet</groupId>  
                    <artifactId>*</artifactId>  
                </exclusion>  
            </exclusions>  
        </dependency>  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-common</artifactId>  
            <version>${hadoop.cdh.version}</version>  
            <exclusions>  
                <exclusion>  
                    <groupId>javax.servlet</groupId>  
                    <artifactId>*</artifactId>  
                </exclusion>  
            </exclusions>  
        </dependency>  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-hdfs</artifactId>  
            <version>${hadoop.cdh.version}</version>  
            <exclusions>  
                <exclusion>  
                    <groupId>javax.servlet</groupId>  
                    <artifactId>*</artifactId>  
                </exclusion>  
            </exclusions>  
        </dependency>  
        <!--spark scala-->  
        <dependency>  
            <groupId>org.scala-lang</groupId>  
            <artifactId>scala-library</artifactId>  
            <version>${scala.version}</version>  
        </dependency>  
        <dependency>  
            <groupId>com.fasterxml.jackson.core</groupId>  
            <artifactId>jackson-databind</artifactId>  
            <version>${jackson.version}</version>  
        </dependency>  
  
        <!--spark streaming和kafka的相關包-->  
        <dependency>  
            <groupId>org.apache.spark</groupId>  
            <artifactId>spark-streaming_2.10</artifactId>  
            <version>${spark.streaming.cdh.version}</version>  
        </dependency>  
        <dependency>  
            <groupId>org.apache.spark</groupId>  
            <artifactId>spark-streaming-kafka_2.10</artifactId>  
            <version>${kafka.spark.cdh.version}</version>  
        </dependency>  
        <dependency>  
            <groupId>junit</groupId>  
            <artifactId>junit</artifactId>  
            <version>4.12</version>  
            <scope>test</scope>  
        </dependency>  
  
        <!--引入windows本地庫的spark包-->  
        <dependency>  
        <groupId>org.apache.spark</groupId>  
        <artifactId>spark-assembly_2.10</artifactId>  
        <version>${spark.cdh.version}</version>  
        <scope>system</scope>  
        <systemPath>D:/crt_send_document/spark-assembly-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar</systemPath>  
        </dependency>  
  
        <!--引入測試環境linux本地庫的spark包-->  
        <!--<dependency>-->  
            <!--<groupId>org.apache.spark</groupId>-->  
            <!--<artifactId>spark-assembly_2.10</artifactId>-->  
            <!--<version>${spark.cdh.version}</version>-->  
            <!--<scope>system</scope>-->  
            <!--<systemPath>/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.6.0-cdh5.8.0-hadoop2.6.0-cdh5.8.0.jar-->  
            <!--</systemPath>-->  
        <!--</dependency>-->  
  
        <!--引入中央倉庫的spark包-->  
        <!--<dependency>-->  
        <!--<groupId>org.apache.spark</groupId>-->  
        <!--<artifactId>spark-assembly_2.10</artifactId>-->  
        <!--<version>${spark.cdh.version}</version>-->  
        <!--</dependency>-->  
  
        <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-yarn-server-web-proxy -->  
        <dependency>  
            <groupId>org.apache.hadoop</groupId>  
            <artifactId>hadoop-yarn-server-web-proxy</artifactId>  
            <version>2.6.0-cdh5.8.0</version>  
        </dependency>  
  
    </dependencies>  
  
    <!--maven打包-->  
    <build>  
        <finalName>test-my</finalName>  
        <sourceDirectory>src/main/scala</sourceDirectory>  
        <testSourceDirectory>src/test/scala</testSourceDirectory>  
        <plugins>  
            <plugin>  
                <groupId>org.scala-tools</groupId>  
                <artifactId>maven-scala-plugin</artifactId>  
                <version>2.15.2</version>  
                <executions>  
                    <execution>  
                        <goals>  
                            <goal>compile</goal>  
                            <goal>testCompile</goal>  
                        </goals>  
                    </execution>  
                </executions>  
                <configuration>  
                    <scalaVersion>${scala.version}</scalaVersion>  
                    <args>  
                        <arg>-target:jvm-1.7</arg>  
                    </args>  
                </configuration>  
            </plugin>  
            <plugin>  
                <groupId>org.apache.maven.plugins</groupId>  
                <artifactId>maven-eclipse-plugin</artifactId>  
                <configuration>  
                    <downloadSources>true</downloadSources>  
                    <buildcommands>  
                        <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>  
                    </buildcommands>  
                    <additionalProjectnatures>  
                        <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>  
                    </additionalProjectnatures>  
                    <classpathContainers>  
                        <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>  
                        <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>  
                    </classpathContainers>  
                </configuration>  
            </plugin>  
            <plugin>  
                <artifactId>maven-assembly-plugin</artifactId>  
                <configuration>  
                    <descriptorRefs>  
                        <descriptorRef>jar-with-dependencies</descriptorRef>  
                    </descriptorRefs>  
                    <archive>  
                        <manifest>  
                            <mainClass></mainClass>  
                        </manifest>  
                    </archive>  
                </configuration>  
                <executions>  
                    <execution>  
                        <id>make-assembly</id>  
                        <phase>package</phase>  
                        <goals>  
                            <goal>single</goal>  
                        </goals>  
                    </execution>  
                </executions>  
            </plugin>  
        </plugins>  
    </build>  
    <reporting>  
        <plugins>  
            <plugin>  
                <groupId>org.scala-tools</groupId>  
                <artifactId>maven-scala-plugin</artifactId>  
                <configuration>  
                    <scalaVersion>${scala.version}</scalaVersion>  
                </configuration>  
            </plugin>  
        </plugins>  
    </reporting>  
  
</project>  

 2、新建測試類
[java] view plain copy
import kafka.common.TopicAndPartition  
import kafka.message.MessageAndMetadata  
import kafka.serializer.StringDecoder  
import org.apache.log4j.{Level, Logger}  
import org.apache.spark.{SparkConf, TaskContext}  
import org.apache.spark.streaming.dstream.InputDStream  
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}  
import org.apache.spark.streaming.{Seconds, StreamingContext}  
import org.slf4j.LoggerFactory  
  
/** 
  * Created by yangjf on 2016/12/18 
  * Update date: 
  * Time: 11:10 
  * Describle :從指定偏移量讀取kafka數據 
  * Result of Test: 
  * Command: 
  * Email: [email protected] 
  */  
object ReadBySureOffsetTest {  
  val logger = LoggerFactory.getLogger(ReadBySureOffsetTest.getClass)  
  
  def main(args: Array[String]) {  
    //設置打印日誌級別  
    Logger.getLogger("org.apache.kafka").setLevel(Level.ERROR)  
    Logger.getLogger("org.apache.zookeeper").setLevel(Level.ERROR)  
    Logger.getLogger("org.apache.spark").setLevel(Level.ERROR)  
    logger.info("測試從指定offset消費kafka的主程序開始")  
    if (args.length < 1) {  
      System.err.println("Your arguments were " + args.mkString(","))  
      System.exit(1)  
      logger.info("主程序意外退出")  
    }  
    //hdfs://hadoop1:8020/user/root/spark/checkpoint  
    val Array(checkpointDirectory) = args  
    logger.info("checkpoint檢查：" + checkpointDirectory)  
    val ssc = StreamingContext.getOrCreate(checkpointDirectory,  
      () => {  
        createContext(checkpointDirectory)  
      })  
    logger.info("streaming開始啟動")  
    ssc.start()  
    ssc.awaitTermination()  
  }  
  
  def createContext(checkpointDirectory: String): StreamingContext = {  
    //獲取配置  
    val brokers = "hadoop3:9092,hadoop4:9092"  
    val topics = "20161218a"  
  
    //默認為5秒  
    val split_rdd_time = 8  
    // 創建上下文  
    val sparkConf = new SparkConf()  
      .setAppName("SendSampleKafkaDataToApple").setMaster("local[2]")  
      .set("spark.app.id", "streaming_kafka")  
  
    val ssc = new StreamingContext(sparkConf, Seconds(split_rdd_time))  
  
    ssc.checkpoint(checkpointDirectory)  
  
    // 創建包含brokers和topic的直接kafka流  
    val topicsSet: Set[String] = topics.split(",").toSet  
    //kafka配置參數  
    val kafkaParams: Map[String, String] = Map[String, String](  
      "metadata.broker.list" -> brokers,  
      "group.id" -> "apple_sample",  
      "serializer.class" -> "kafka.serializer.StringEncoder"  
//      "auto.offset.reset" -> "largest"   //自動將偏移重置為最新偏移（默認）  
//      "auto.offset.reset" -> "earliest"  //自動將偏移重置為最早的偏移  
//      "auto.offset.reset" -> "none"      //如果沒有為消費者組找到以前的偏移，則向消費者拋出異常  
    )  
    /** 
      * 從指定位置開始讀取kakfa數據 
      * 註意：由於Exactly  Once的機制，所以任何情況下，數據只會被消費一次！ 
      *      指定了開始的offset後，將會從上一次Streaming程序停止處，開始讀取kafka數據 
      */  
    val offsetList = List((topics, 0, 22753623L),(topics, 1, 327041L))                          //指定topic，partition_no，offset  
    val fromOffsets = setFromOffsets(offsetList)     //構建參數  
    val messageHandler = (mam: MessageAndMetadata[String, String]) => (mam.topic, mam.message()) //構建MessageAndMetadata  
   //使用高級API從指定的offset開始消費，欲了解詳情，  
   //請進入"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$"查看  
    val messages: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](ssc, kafkaParams, fromOffsets, messageHandler)  
  
    //數據操作  
    messages.foreachRDD(mess => {  
      //獲取offset集合  
      val offsetsList = mess.asInstanceOf[HasOffsetRanges].offsetRanges  
      mess.foreachPartition(lines => {  
        lines.foreach(line => {  
          val o: OffsetRange = offsetsList(TaskContext.get.partitionId)  
          logger.info("++++++++++++++++++++++++++++++此處記錄offset+++++++++++++++++++++++++++++++++++++++")  
          logger.info(s"${o.topic}  ${o.partition}  ${o.fromOffset}  ${o.untilOffset}")  
          logger.info("+++++++++++++++++++++++++++++++此處消費數據操作++++++++++++++++++++++++++++++++++++++")  
          logger.info("The kafka  line is " + line)  
        })  
      })  
    })  
    ssc  
  }  
  
  //構建Map  
  def setFromOffsets(list: List[(String, Int, Long)]): Map[TopicAndPartition, Long] = {  
    var fromOffsets: Map[TopicAndPartition, Long] = Map()  
    for (offset <- list) {  
      val tp = TopicAndPartition(offset._1, offset._2)//topic和分區數  
      fromOffsets += (tp -> offset._3)           // offset位置  
    }  
    fromOffsets  
  }  
}  

四、參考文檔：
    1、spark API  http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$
    2、Kafka官方配置說明：http://kafka.apache.org/documentation.html#configuration
    3、Kafka SampleConsumer：https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example
    4、Spark streaming 消費遍歷offset說明：http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html
    5、Kafka官方API說明：http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
註：以上測試通過，可以根據需要修改。如有疑問，請留言！

tpc asi 4.2 nes 配置 sof 我們 erl examples spark streaming從指定offset處消費Kafka數據 2017-06-13 15:19 770人閱讀評論(2) 收藏舉報分類： spark（5）原文地址:htt

[Spark]Spark-streaming通過Receiver方式實時消費Kafka流程（Yarn-cluster）

1.啟動zookeeper 2.啟動kafka服務（broker） [[email protected] kafka_2.11-0.10.2.1]# ./bin/kafka-server-start.sh config/server.properties 3.啟動kafka的producer（

SparkStreaming消費kafka數據

字符串 targe val offset 1.0 error .org 依賴 oot 概要：本例子為SparkStreaming消費kafka消息的例子，實現的功能是將數據實時的進行抽取、過濾、轉換，然後存儲到HDFS中。實例代碼 package com.fwmagic.

SparkStreaming消費Kafka數據限速問題

使用 font cor 計算 ont 消息易用 per stream SparkStreaming消費Kafka數據的時候，當有大量初始化數據時會拖累整個streaming程序的運行，問有什麽辦法？總體來說這個問題大概有兩種解決思路： 1.在Spark端設置限速；2

Spark Streaming從Kafka中獲取數據，並進行實時單詞統計，統計URL出現的次數

scrip 發送消息 rip mark 3.2 umt 過程 bject ttr 1、創建Maven項目創建的過程參考：http://blog.csdn.net/tototuzuoquan/article/details/74571374 2、啟動Kafka A:安裝ka

Spark Streaming從Kafka中獲取資料，並進行實時單詞統計，統計URL出現的次數

1、建立Maven專案 2、啟動Kafka 3、編寫Pom檔案 <?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.or

node--初步了解-03 從指定位置處開始寫入文件

了解 nod 定位 rom open 寫入 let 多少開始 let fs =require("fs");let paht =require("path"); fs.open(paht.join(__dirname,"a.txt"),"w",(err,fd)=>{

0073 spark streaming從埠接受資料進行實時處理的方法

一，環境 Windows_x64 系統 Java1.8 Scala2.10.6 spark1.6.0 hadoop2.7.5 IDEA IntelliJ 2017.2 nmap工具（用到其中的nc

Spark Streaming從Flume Poll資料案例實戰和內幕原始碼解密

本博文內容主要包括以下幾點內容： 1、Spark Streaming on Polling from Flume實戰 2、Spark Streaming on Polling from Flume原始碼一、推模式(Flume push SparkStre

spark streaming 同時處理兩個不同kafka叢集的資料

如題，總是不那麼完美，要處理的資料在兩個不同的kafka叢集裡面，日子得過，問題也得解決，我們建立兩個DStream,連線兩個不同的kafka叢集的不同topic,然後再把這兩個DStream union在一起處理，程式碼如下： package com.king

工具篇-Spark-Streaming獲取kafka數據的兩種方式（轉載）

min 但是 col 必須 hdfs span 保存 memory 簡單轉載自：https://blog.csdn.net/wisgood/article/details/51815845 一、基於Receiver的方式原理 Receiver從Kafka中獲取的數

Spark Streaming，Flink，Storm，Kafka Streams，Samza：如何選擇流處理框架

![](https://img2020.cnblogs.com/blog/1089984/202006/1089984-20200610080225004-690722209.png) 根據最新的統計顯示，僅在過去的兩年中，當今世界上90％的資料都是在新產生的，每天建立2.5萬億位元組的資料，並且隨著新裝

Python3 tkinter基礎 Spinbox 可輸入能調整的從指定範圍內選擇參數的控件

溝通 oot 分享 ubuntu 2.4 __name__ ubunt char download ? ???????Python : 3.7.0 ?????????OS : Ubuntu 18.04.1 LTS ????????IDE : PyCharm 2018.2

angular js 多處獲取ajax數據的方法

list ont listctrl lct module 方法 detail car 獲取 angular js 多處獲取ajax數據的方法var app=angular.module("cart",[]);app.service("getData",function ($

使用spark對hive表中的多列數據判重

個數 stack duplicate house transient this dataframe except cti 本文處理的場景如下，hive表中的數據，對其中的多列進行判重deduplicate。 1、先解決依賴，spark相關的所有包，pom.xml spa

C# 操作地址從內存中讀取寫入數據(初級)

mode .com 陽光 pen bsp api bject str ddr 本示例以植物大戰僵屍為例, 實現功能為每1秒讓陽光刷新為 9999.本示例使用的遊戲版本為 [植物大戰僵屍2010年度版], 使用的輔助查看內存地址的工具是 CE. 由於每次啟動遊戲, 遊戲

Kafka數據輔助和Failover

ssa over 會有更新操作 namo 多個版本兩個數據輔助與Failover CAP理論（它具有一致性、可用性、分區容忍性） CAP理論：分布式系統中，一致性、可用性、分區容忍性最多只可同時滿足兩個。一般分區容忍性都要求有保障，因此很多時候在可用

限制指定機器IP訪問oracle數據庫

oracle安全通過使用數據庫服務器端的sqlnet.ora文件可以實現禁止指定IP主機訪問數據庫的功能，這對於提升數據庫的安全性有很大的幫助，與此同時，這個技術為我們管理和約束數據庫訪問控制提供了有效的手段。下面是實現這個目的的具體步驟僅供參考：1.默認的服務器端sqlnet.ora文件的內容# sqlne

上船容易——從阿裏雲遷移SQL數據庫到Azure雲的嘗試之一

azure ali sql 開始之前先說我的情況，SQL 2000的DBA，之後沒怎麽搞過SQL。所以以下寫的內容如果有錯誤，請DBA大拿不要笑噴~??? 有個朋友之前把應用搬到了阿裏雲，數據庫也搬上去了，用的RDS（別問我為啥這名字和AWS一樣哦）。最近這朋友忽然想把架構遷移到Azure上，於是

從零開發分布式數據庫中間件一、讀寫分離的數據庫中間件（轉）

mark str 日誌系統 arraylist none views gpo arr 體系從零開發分布式數據庫中間件一、讀寫分離的數據庫中間件

spark streaming從指定offset處消費Kafka數據

相關推薦