Spark Streaming第三部分

阿新 • • 發佈：2018-12-13

updateStateByKey運算元

需求，統計到目前為止，累計出現的單詞個數(需要保持之前的狀態)

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

如果使用了帶狀態的運算元，必須指定checkpoint,來連線老值和新值
在生產環境中，建議大家把checkPoint設定到HDFS某個資料夾中

傳進去的引數就是定義的方法，其中包含了隱式轉換

官網給出的解釋如下：

The update function will be called for each word, with newValues having a sequence of 1’s (from the (word, 1) pairs) and the runningCount having the previous count.

Note that using updateStateByKey requires the checkpoint directory to be configured, which is discussed in detail in the checkpointing section.

import org.apache.spark.streaming.{Seconds, StreamingContext}

object StatefulWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("StatefulWordCount")
      .setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    val lines = ssc.socketTextStream("192.168.1.6",1111)
    //在生產環境中，建議大家把checkPoint設定到HDFS某個資料夾中
//如果使用了帶狀態的運算元，必須指定checkpoint,來連線老值和新值
    ssc.checkpoint(".")
    val result = lines.flatMap(_.split(" ")).map((_,1))
   //這裡傳進去的引數就是定義的方法，其中包含了隱式轉換
    val state = result.updateStateByKey[Int](updateFunction _)

    state.print()
    ssc.start()
    ssc.awaitTermination()
  }

  /**
    * 用當前的資料去更新已有的或者是老的資料
    * @param newValues
    * @param preValues
    * @return
    */
  def updateFunction(newValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    val newCount = newValues.sum
    val pre = preValues.getOrElse(0)
    Some(newCount+pre)
  }
}

將統計結果寫入到MySql資料庫中

首先將之前程式產生的checkPoint刪掉

其中用到foreachRDD運算元

其中官網的解釋如下

The most generic output operator that applies a function, func, 
to each RDD generated from the stream. 
This function should push the data in each RDD to an external system,
such as saving the RDD to files, or writing it over the network to a database. 
Note that the function func is executed in the driver process running the streaming application, 
and will usually have RDD actions in it that will force the computation of the streaming RDDs.

dstream.foreachRDD is a powerful primitive that allows data to be sent out to external systems. 
However, it is important to understand how to use this primitive correctly and efficiently

其中操作foreachRDD運算元常出現的錯誤有如下：

序列化異常

dstream.foreachRDD { rdd =>
  val connection = createNewConnection()  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

成本

dstream.foreachRDD { rdd =>
  rdd.foreach { record =>
    val connection = createNewConnection()
    connection.send(record)
    connection.close()
  }
}

優化版本1；

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = createNewConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

終極優化版本

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

將統計結果寫入到MySQL中

資料庫建立表

create table wordcount(
word varchar(50) default null,
wordcount int(10) default null
);

Mysql連線池建立

  def createConnection()={
    Class.forName("com.mysql.jdbc.Driver")
    DriverManager.getConnection("jdbc:mysql://localhost:3306/imooc_spark","root","root")
  }

將資料寫輸入Mysql中

 result.foreachRDD { rdd =>
        rdd.foreachPartition { partitionOfRecords =>{
            val connection = createConnection()
            partitionOfRecords.foreach(record => {
              val sql = "insert into wordcount(word,wordcount) values('"+record._1+"',"+record._2+")"
              connection.createStatement().execute(sql)
            })
            connection.close()
        }
      }

最後結果驗證...

總結：

通過該sql將統計結果寫入到MySQL

insert into wordcount(word, wordcount) values('" + record._1 + "'," + record._2 + ")"

存在的問題：

1) 對於已有的資料做更新，而是所有的資料均為insert

改進思路：

a) 在插入資料前先判斷單詞是否存在，如果存在就update，不存在則insert

b) 工作中：HBase/Redis

2) 每個rdd的partition建立connection，建議大家改成連線池

視窗函式的使用

window：定時的進行一個時間段內的資料處理

window length ：視窗的長度

sliding interval：視窗的間隔

這2個引數和我們的batch size有關係：倍數

每隔多久計算某個範圍內的資料：每隔10秒計算前10分鐘的wc ==> 每隔sliding interval統計前window length的值

官網給出的例子

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

黑名單過濾：

將訪問日誌轉化為DStream
黑名單列表轉為為RDD
對DStream和RDD進行LeftJoin操作篩選出非黑名單的日誌資訊

訪問日誌   ==> DStream
20180808,zs
20180808,ls
20180808,ww
   ==>  (zs: 20180808,zs)(ls: 20180808,ls)(ww: 20180808,ww)

黑名單列表  ==> RDD
zs
ls
   ==>(zs: true)(ls: true)
==> 20180808,ww

leftjoin
(zs: [<20180808,zs>, <true>])  x 
(ls: [<20180808,ls>, <true>])  x
(ww: [<20180808,ww>, <false>])  ==> tuple 1

程式碼如下：

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
  * 黑名單過濾
  */
object TransformApp {


  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

    /**
      * 建立StreamingContext需要兩個引數：SparkConf和batch interval
      */
    val ssc = new StreamingContext(sparkConf, Seconds(5))


    /**
      * 構建黑名單
      */
    val blacks = List("zs", "ls")
    val blacksRDD = ssc.sparkContext.parallelize(blacks).map(x => (x, true))

    val lines = ssc.socketTextStream("localhost", 6789)
    val clicklog = lines.map(x => (x.split(",")(1), x)).transform(rdd => {
      rdd.leftOuterJoin(blacksRDD)
        .filter(x=> x._2._2.getOrElse(false) != true)
        .map(x=>x._2._1)
    })

    clicklog.print()

    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Sql和Spark Streaming整合

在POM.xml中加入Spark sql的依賴

    <!--SparkSQL-->
    <dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-sql_2.11</artifactId>
      <version>${spark.version}</version>
      <!--
      <scope>provided</scope>
      -->
    </dependency>

SparkSession是單例的

import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{Seconds, StreamingContext, Time}

/**
  * Spark Streaming整合Spark SQL完成詞頻統計操作
  */
object SqlNetworkWordCount {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("ForeachRDDApp").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    val lines = ssc.socketTextStream("localhost", 6789)
    val words = lines.flatMap(_.split(" "))

    // Convert RDDs of the words DStream to DataFrame and run SQL query
    words.foreachRDD { (rdd: RDD[String], time: Time) =>
      val spark = SparkSessionSingleton.getInstance(rdd.sparkContext.getConf)
      import spark.implicits._

      // Convert RDD[String] to RDD[case class] to DataFrame
      val wordsDataFrame = rdd.map(w => Record(w)).toDF()

      // Creates a temporary view using the DataFrame
      wordsDataFrame.createOrReplaceTempView("words")

      // Do word count on table using SQL and print it
      val wordCountsDataFrame =
        spark.sql("select word, count(*) as total from words group by word")
      println(s"========= $time =========")
      wordCountsDataFrame.show()
    }


    ssc.start()
    ssc.awaitTermination()
  }


  /** Case class for converting RDD to DataFrame */
  case class Record(word: String)


  /** Lazily instantiated singleton instance of SparkSession */
  object SparkSessionSingleton {

    @transient  private var instance: SparkSession = _

    def getInstance(sparkConf: SparkConf): SparkSession = {
      if (instance == null) {
        instance = SparkSession
          .builder
          .config(sparkConf)
          .getOrCreate()
      }
      instance
    }
  }
}

效果驗證

Spark Streaming第三部分

updateStateByKey運算元需求，統計到目前為止，累計出現的單詞個數(需要保持之前的狀態) def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {

sql優化（oracle）- 第三部分 sql優化總結

mit 設計 tinc 重復 tin spa 替代嵌套多個第三部分 sql優化總結 1. 優化一般原則 2. 具體註意事項 1. SQL優化一般性原則　　1）目標：減少服務器資源消耗（主要是磁盤IO）　　2）設計：　　　　1. 盡量依

linux操作指令第三部分

groupdel 手動左右 del cat pass useradd user 用戶操作文件簡單操作 1.文件內容查看 > cat filename //一次性把全部內容都輸出到終端 > more filename

管理信息系統第三部分作業

終其一生應用層以及方法一個數兩種而且配送未來 1.DDS和IDDS的組成。 DSS決策支持系統：決策支持系統基本結構主要由四個部分組成，即數據部分、模型部分、推理部分和人機交互部分：數據部分是一個數據庫系統; 模型部分包括模型庫(MB)及其管理系統(MBM

第三部分作業

企業 class 緩解類型信息服務 nbsp 內部關聯就業閱讀教材，思考並回答以下問題： DDS和IDDS的組成。電子商務系統的結構。電子政務系統的類型與應用。電子健康系統應用的影響。供應鏈管理的概念。數據挖掘的主要功能。 1.DDS和IDDS的組成

Python培訓知識總結系列- 第二章Python數據結構第三部分-字典，集合

而是結構 move 原子返回總結刪除添加元素 pen 編寫一個函數 remove_duplicates，該函數將列表作為參數，並返回一個包含源列表中唯一元素的新列表。新列表中未重復出現的元素可采用任何順序。target=[]def remove_duplicate

Learning Spark中文版--第三章--RDD編程（2）

翻譯瓶頸並集 ria multi guide 第六章 rabl 函數式 Common Transformations and Actions ??本章中，我們瀏覽了Spark中大多數常見的transformation（轉換）和action（動作）。在包含特定數據類型的R

《Spring Security3》第四章第三部分翻譯下（密碼加salt）

文件 auth ans 大小 ack 工程師新的 bool get 你是否願意在密碼上添加點salt？如果安全審計人員檢查數據庫中編碼過的密碼，在網站安全方面，他可能還會找到一些令其感到擔心的地方。讓我們查看一下存儲的admin和guest用戶的用戶名和密碼值

Python 學習第三部分函數——第一章函數基礎

code 就是 class 函數空間 yield python python函數 num 函數是python 為了代碼最大程度的重用和最小代碼冗余而提供的最基本的程序結構。使用它我們可以將復雜的系統分解為可管理的部件。函數相關語句 def...

Cron 觸發器及相關內容 (第三部分)

四. 為 CronTrigger 使用起迄日期Cron 表示式是用來決定一個 Trigger 被觸發執行一個 Job 的日期和次數。當你建立一個 CronTrigger 例項，假如沒為它指定一個開始時間，這個 Trigger 當然就會假定是在依賴於 Cron 表示式儘早的被觸發。例如，如果你用這

部署 Job (第三部分)

5. 易失性、永續性和可恢復性這三個屬性有些類似的，由於它們影響的都是 Job 的執行時行為。我們下面依次討論它們。·Job 的易失性一個易失性的 Job 是在程式關閉之後不會被持久化。一個 Job 是通過呼叫 JobDetail 的 setVolatility(true) 被設定為易失性的。

Hello Quartz (第三部分)

宣告式部署一個 Job前面我們討論過，儘可能的用宣告式處理軟體配置，其次才才慮程式設計式。再來看程式碼 3.6，如果我們要在 Job 啟動之後改變它的執行時間和頻度，必須去修改原始碼重新編譯。這種方式只適用於小的例子程式，但是對於一個大且複雜的系統，這就成了一個問題了。因此，假如能以

【譯】統計建模：兩種文化（第三部分）

謝絕任何不通知本人的轉載，尤其是抄襲。 Abstract 1. Introduction 2. ROAD MAP 3. Projects in consulting 4. Return to the university 5. The

小白讀《HTML５權威指南》第三部分 CSS

點選連結：http://note.youdao.com/noteshare?id=02dbcb3ade7cd5a7fe2c327fc6404ee0 下面是直接複製貼上，沒有圖片且亂版理解CSS 1.CSS標準化一開始：具有相同名稱的屬性採用不同的方式處理，只能用瀏覽器特定的屬性訪

CTF-web 第三部分程式碼審計

http://www.mxcz.net/tools/rot13.aspx rot-13加密解密 http://www.zjslove.com/3.decode/ 凱撒當鋪倒敘維吉尼亞密碼實際上就是閱讀有關的校驗程式碼，人為構造特殊的輸入或者引數才能拿到flag。需要了解一般的變數

《php與mysql權威指南》第三部分02

第14章 mysql資料庫開發14.1 mysql資料型別　　1.數值型別: 　　　　五種整型：tinyint,smallint,mediumint,int,bigint分別為1,2,3,4,8位元組數　　　　三種浮點型：float,double,decimal分別4,8,m位元組　　　　# 宣告一個整

mysql8學習手冊第三部分查詢和子查詢

Selecting data into a file and table To save the output into a file, you need the FILE privilege. FILE is a global privilege, which means you

《Oracle PL/SQL例項精講》學習筆記26——優化PL/SQL（第三部分——子程式內聯）

本章內容： 1. PL/SQL調優工具 2. PL/SQL優化級別 3. 子程式內聯程式碼如下： 1. 檢視dbmshptab.sql指令碼內容 Rem Rem $Header: rdbms/admin/dbmshptab.sql /main/4 20

紅孩兒編輯器的詳細設計第三部分

紅孩兒編輯器的詳細設計第三部分這一部分詳細地列出各個模組的介面定義輸入子系統鍵盤模組 key_operation() key_up() 滑鼠模組 input(val) 核心控制子系統控制模組 &nbs

CVPR 2018摘要：第三部分

標題 NeuroNuggets: CVPR 2018 in Review, Part III CVPR 2018摘要：第三部分 by 啦啦啦 2

Spark Streaming第三部分

updateStateByKey運算元

將統計結果寫入到MySql資料庫中

視窗函式的使用

Spark Sql和Spark Streaming整合

相關推薦