1. 程式人生 > >《深入理解Spark》之Spark-Stream概述1(官方文件翻譯版)

《深入理解Spark》之Spark-Stream概述1(官方文件翻譯版)

最近在學英語,學以致用,就嘗試著看Spark的官方文件,並試著翻譯了部分,由於水平有限如果有所疏漏的地方歡迎指正

* Spark Stream Overview
* Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,
* fault-tolerant stream processing of live data streams.
* Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets,
* and can be processed using complex algorithms expressed with high-level functions
* like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,
* databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data
* streams.
*
* Spark Stream 概述
* Spark Streaming是一個可擴充套件的、高吞吐量、容錯的實時資料處理框架
* 它可以從很多種類的資料來源中攝取資料比如Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets
* 並且可以使用高階函式表示的複雜的演算法處理資料,這些高階函式有map, reduce, join and window等。。
* 最後的處理結果可以被放入到外部的檔案系統如資料庫或者現場儀表盤(就好比每年雙11天貓在大螢幕上實時顯示的交易額)
* 實際上你還可以把spark的機器學習和圖形處理演算法都使用到spark Stream上

* Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches,
* which are then processed by the Spark engine to generate the final stream of results in batches
* 它的內部工作原理如下,
* spark Streaming接收實時輸入資料流然後按照批次分配資料,然後按照spark引擎生成最終的批處理的結果流資料

* Spark Streaming provides a high-level abstraction called discretized stream or DStream,
* which represents a continuous stream of data.
* DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis,
* or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
* This guide shows you how to start writing Spark Streaming programs with DStreams.
* You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
* all of which are presented in this guide.
* You will find tabs throughout this guide that let you choose between code snippets of different languages.
* Note: There are a few APIs that are either different or not available in Python.
* Throughout this guide, you will find the tag Python API highlighting these differences.
*
*
* Spark Streaming提供了一個高階的抽象的被稱作 discretized stream(離散化的流)的概念簡稱為DStream
* 它代表一個可以包含資料的流的抽象
* DStreams可以由從其他資料來源中攝入的資料建立,這些資料來源有 Kafka, Flume, and Kinesis
* 也可以吧高階函式應用到 DStream上面,實際上一個DStream是由一系列的RDD組成的
* 這篇指南會指導你使用DStream快速的寫一個Spark Streaming程式,你可以使用Scala, Java or Python來寫Spark Streaming程式
*  在整個指南手冊中找到選項卡在不同的語言中來選擇程式碼的片段
* 提示:你會發現有一些python版本的Api(高亮顯示的部分)有一點兒不一樣

A Quick Example

* Before we go into the details of how to write your own Spark Streaming program,
* let’s take a quick look at what a simple Spark Streaming program looks like.
* Let’s say we want to count the number of words in text data received from a data server listening on a TCP socket.
* All you need to do is as follows
*
* 一個快速開始的小栗子
* 在研究如何寫spark Streaming 程式的細節之前,讓我們來快速看下面一個例子
* 我們統計一個文字檔案的單詞個數,這個文字檔案是通過監聽 TCP套接字而來的,你應該向下面這樣做
 import org.apache.spark._
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3
// Create a local StreamingContext with two working thread and batch interval of 1 second.
// 通過兩個worker執行緒建立一個本地的 StreamingContext(Streaming上下文),並且間隔是1秒鐘

// The master requires 2 cores to prevent from a starvation scenario.
// master需要兩個核 以防止出現飢餓的情況
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
* Using this context, we can create a DStream that represents streaming data from a TCP source,
* specified as hostname (e.g. localhost) and port (e.g. 9999).
*
* 使用這個context物件我們可以建立一個DStream物件,這個DStream物件代表從TCP源而來的資料流物件
* 該TCP源是指定了主機和埠的
* 
*
* This lines DStream represents the stream of data that will be received from the data server.
* Each record in this DStream is a line of text. Next, we want to split the lines by space characters into words.
*
* 這行DStream物件代表從資料伺服器接受的資料流
* DStream物件中的每一行文字記錄我們都想通過空白字元切割為單詞
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
* flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records from
* each record in the source DStream. In this case, each line will be split into multiple words and the
* stream of words is represented as the words DStream. Next, we want to count these words
*
* flatMap是DStream的一個一對多的操作通過生成一個新的DStream把原來DStream裡面的記錄生成多個記錄
* 這種情況下,每一行都會被分割為多個單詞並且被當做一個單詞的DStream物件,然後我們再統計這些單詞
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
* The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs,
* which is then reduced to get the frequency of words in each batch of data. Finally,
* wordCounts.print() will print a few of the counts generated every second.
*
* 這些單詞的DStream被進一步對映成鍵值對的形式(型別依舊是Dstream)
* 然後把一批資料中單詞按照出現的頻率歸併起來,通過wordCounts.print()沒秒列印一次
*
*
* Note that when these lines are executed, Spark Streaming only sets up the computation it will perform
* when it is started, and no real processing has started yet.
* To start the processing after all the transformations have been setup, we finally call
*
* 當每行被執行卻還沒有被執行的那些transformation運算元,我們總是延遲執行,我們在最後呼叫
* 即懶載入模式
ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate
* The complete code can be found in the Spark Streaming example NetworkWordCount.
*
* 為了完成程式碼你可以找到名字叫NetworkWordCount的Spark Streaming 的例子
*
*
* If you have already downloaded and built Spark, you can run this example as follows.
* You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using
*
* 如果你已經下載並構建了Spark,你可以按照下面的方法執行這個例子
* 首先你需要執行 Netcat 作為資料伺服器使用(Netcat一個像類Unix的系統中的小工具)
*
* $ nc -lk 9999
* Then, in a different terminal, you can start the example by using
*
* 然後你可以通過使用一個不同的中斷來開始使用例子程式
*
*
* Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second
* 然後,在 netcat 終端上執行的任何型別的資料都將按照每秒的頻率被統計並列印到螢幕