Spark學習（捌）- Spark Streaming入門

阿新 • • 發佈：2018-12-09

文章目錄

spark概念
Spark Streaming應用場景
Spark Streaming整合Spark生態系統的使用
Spark Streaming發展史
從詞頻統計功能著手入門Spark Streaming

spark-submit提交
spark-shell提交

Spark Streaming工作原理(粗粒度)
Spark Streaming工作原理(細粒度)

spark概念

Spark流是核心Spark API的擴充套件，它支援對實時資料流進行可伸縮、高吞吐量、容錯的流處理。資料可以從Kafka、Flume、Kinesis或TCP sockets等許多來源獲取，也可以使用map、reduce、join和window等高階函式表示的複雜演算法進行處理。最後，可以將處理後的資料推送到檔案系統、資料庫和實時儀表板。事實上，您可以將Spark的機器學習和圖形處理演算法應用於資料流。
在這裡插入圖片描述

Spark Streaming個人的定義：
將不同的資料來源的資料經過Spark Streaming處理之後將結果輸出到外部檔案系統

特點
低延時
能從錯誤中高效的恢復：fault-tolerant
能夠執行在成百上千的節點
能夠將批處理、機器學習、圖計算等子框架和Spark Streaming綜合起來使用

Spark Streaming是否需要獨立安裝？
不需要；因為spark是一棧式服務框架
One stack to rule them all ：一棧式

Spark Streaming應用場景

在這裡插入圖片描述
上半圖是實時交易欺詐的應用
下半圖是實時電子感測器監控

現實生產中應用更廣

Spark Streaming整合Spark生態系統的使用

在這裡插入圖片描述
將批處理與流處理相結合

上圖中；後續文章會有講解實現

離線學習模型可以接入sparkstreaming，線上應用它們
在這裡插入圖片描述
使用SQL互動式地查詢流資料

上圖中；後續文章會有講解實現

Spark Streaming發展史

在這裡插入圖片描述
Spark Streaming從0.9版本畢業；開始進入生產環境。

從詞頻統計功能著手入門Spark Streaming

spark原始碼地址 GitHub
https://github.com/apache/spark
在裡面有很多examples供學習。

NetworkWordCount測試

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

/**
 * Counts words in UTF8 encoded, '\n' delimited text received from the network every second.
 *
 * Usage: NetworkWordCount <hostname> <port>
 * <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive data.
 *
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example org.apache.spark.examples.streaming.NetworkWordCount localhost 9999`
 */
object NetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: NetworkWordCount <hostname> <port>")
      System.exit(1)
    }

    StreamingExamples.setStreamingLogLevels()

    // Create the context with a 1 second batch size
    val sparkConf = new SparkConf().setAppName("NetworkWordCount")
    val ssc = new StreamingContext(sparkConf, Seconds(1))

    // Create a socket stream on target ip:port and count the
    // words in input stream of \n delimited text (eg. generated by 'nc')
    // Note that no duplication in storage level only for running locally.
    // Replication necessary in distributed scenario for fault tolerance.
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}
// scalastyle:on println

spark-submit提交

example中提示開啟9999埠
在這裡插入圖片描述
使用spark-submit來提交我們的spark應用程式執行的指令碼(生產)

./spark-submit --master local[2] \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--name NetworkWordCount \
/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.2.0.jar hadoop000 9999

開啟另一個client端
在這裡插入圖片描述
測試：輸入

檢視spark-submit提交的介面

輸入

檢視spark-submit提交的介面

spark-shell提交

如何使用spark-shell來提交(測試)

./spark-shell --master local[2]

只需要在spark-shell啟動介面貼上以下程式碼即可

import org.apache.spark.streaming.{Seconds, StreamingContext}

val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("hadoop000", 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

測試步驟和spark-submit一樣；都是在一個client輸入測試資料；spark-shell介面檢視結果。

Spark Streaming工作原理(粗粒度)

在這裡插入圖片描述
工作原理：粗粒度
Spark Streaming接收到實時資料流，把資料按照指定的時間段切成一片片小的資料塊，然後把小的資料塊傳給Spark Engine處理。

Spark Streaming工作原理(細粒度)

在這裡插入圖片描述
1、在Driver端會構建context來準備處理Application；SparkContext是StreamingContext的底層
2、Dirver端啟動一些Receiver來接受資料（處理資料的互動）
3、把receiver作為一個任務來執行
4、資料input進來；receiver把資料拆分為多個block放入記憶體中。如果設定副本就會拷貝到其他Executor上
5、receiver反饋給StreamingContext的blocks資訊；StreamingContext提交jobs給SparkContext
6、SparkContext將jobs分發給各個Executor處理作業。

Spark學習（捌）- Spark Streaming入門

文章目錄

spark概念

Spark Streaming應用場景

Spark Streaming整合Spark生態系統的使用

Spark Streaming發展史

從詞頻統計功能著手入門Spark Streaming

spark-submit提交

spark-shell提交

Spark Streaming工作原理(粗粒度)

Spark Streaming工作原理(細粒度)

Spark學習（捌）- Spark Streaming入門

Spark學習（拾）- Spark Streaming進階與案例實戰

Spark學習（玖）- Spark Streaming核心概念與程式設計

Spark學習（一）--Spark入門介紹和安裝

Spark學習（八）---Spark streaming原理

Spark學習（柒）- Spark SQL擴充套件和總結

Spark學習（陸）- Spark操作外部資料來源

Spark學習（一）Spark介紹

spark中flatMap函數用法--spark學習（基礎）

spark學習（1）--ubuntu14.04集群搭建、配置（jdk）

Spark學習（二）——RDD的設計與運行原理

Spark學習（伍）- DateFrame&Dataset

Spark學習（肆）- 從Hive平滑過渡到Spark SQL

Spark學習（叄）- 環境搭建

Spark 學習（6）

Spark學習（五）---RDD原理解析和spark執行架構

Spark學習（二）

Spark學習（一）——Scala基礎學習

Mysql學習（三）Spark（Scala）寫入Mysql的兩種方式

Spark學習（一）—— 論文翻譯

Spark學習（捌）- Spark Streaming入門

文章目錄

spark概念

Spark Streaming應用場景

Spark Streaming整合Spark生態系統的使用

Spark Streaming發展史

從詞頻統計功能著手入門Spark Streaming

spark-submit提交

spark-shell提交

Spark Streaming工作原理(粗粒度)

Spark Streaming工作原理(細粒度)

相關推薦