《深入理解Spark》之Spark-Stream概述1（官方文件翻譯版）

阿新 • • 發佈：2019-02-06

最近在學英語，學以致用，就嘗試著看Spark的官方文件，並試著翻譯了部分，由於水平有限如果有所疏漏的地方歡迎指正

* Spark Stream Overview
* Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput,
* fault-tolerant stream processing of live data streams.
* Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets,
* and can be processed using complex algorithms expressed with high-level functions
* like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,
* databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data
* streams.
*
* Spark Stream 概述
* Spark Streaming是一個可擴充套件的、高吞吐量、容錯的實時資料處理框架
* 它可以從很多種類的資料來源中攝取資料比如Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets
* 並且可以使用高階函式表示的複雜的演算法處理資料，這些高階函式有map, reduce, join and window等。。
* 最後的處理結果可以被放入到外部的檔案系統如資料庫或者現場儀表盤(就好比每年雙11天貓在大螢幕上實時顯示的交易額)
* 實際上你還可以把spark的機器學習和圖形處理演算法都使用到spark Stream上

* Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into batches,
* which are then processed by the Spark engine to generate the final stream of results in batches
* 它的內部工作原理如下，
* spark Streaming接收實時輸入資料流然後按照批次分配資料，然後按照spark引擎生成最終的批處理的結果流資料

* Spark Streaming provides a high-level abstraction called discretized stream or DStream,
* which represents a continuous stream of data.
* DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis,
* or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
* This guide shows you how to start writing Spark Streaming programs with DStreams.
* You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
* all of which are presented in this guide.
* You will find tabs throughout this guide that let you choose between code snippets of different languages.
* Note: There are a few APIs that are either different or not available in Python.
* Throughout this guide, you will find the tag Python API highlighting these differences.
*
*
* Spark Streaming提供了一個高階的抽象的被稱作 discretized stream(離散化的流)的概念簡稱為DStream
* 它代表一個可以包含資料的流的抽象
* DStreams可以由從其他資料來源中攝入的資料建立，這些資料來源有 Kafka, Flume, and Kinesis
* 也可以吧高階函式應用到 DStream上面，實際上一個DStream是由一系列的RDD組成的
* 這篇指南會指導你使用DStream快速的寫一個Spark Streaming程式，你可以使用Scala, Java or Python來寫Spark Streaming程式
*  在整個指南手冊中找到選項卡在不同的語言中來選擇程式碼的片段
* 提示:你會發現有一些python版本的Api(高亮顯示的部分)有一點兒不一樣

A Quick Example

* Before we go into the details of how to write your own Spark Streaming program,
* let’s take a quick look at what a simple Spark Streaming program looks like.
* Let’s say we want to count the number of words in text data received from a data server listening on a TCP socket.
* All you need to do is as follows
*
* 一個快速開始的小栗子
* 在研究如何寫spark Streaming 程式的細節之前，讓我們來快速看下面一個例子
* 我們統計一個文字檔案的單詞個數，這個文字檔案是通過監聽 TCP套接字而來的，你應該向下面這樣做

 import org.apache.spark._
 import org.apache.spark.streaming._
 import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3

// Create a local StreamingContext with two working thread and batch interval of 1 second.
// 通過兩個worker執行緒建立一個本地的 StreamingContext(Streaming上下文)，並且間隔是1秒鐘

// The master requires 2 cores to prevent from a starvation scenario.
// master需要兩個核 以防止出現飢餓的情況
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))

* Using this context, we can create a DStream that represents streaming data from a TCP source,
* specified as hostname (e.g. localhost) and port (e.g. 9999).
*
* 使用這個context物件我們可以建立一個DStream物件，這個DStream物件代表從TCP源而來的資料流物件
* 該TCP源是指定了主機和埠的
* 
*
* This lines DStream represents the stream of data that will be received from the data server.
* Each record in this DStream is a line of text. Next, we want to split the lines by space characters into words.
*
* 這行DStream物件代表從資料伺服器接受的資料流
* DStream物件中的每一行文字記錄我們都想通過空白字元切割為單詞

val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))

* flatMap is a one-to-many DStream operation that creates a new DStream by generating multiple new records from
* each record in the source DStream. In this case, each line will be split into multiple words and the
* stream of words is represented as the words DStream. Next, we want to count these words
*
* flatMap是DStream的一個一對多的操作通過生成一個新的DStream把原來DStream裡面的記錄生成多個記錄
* 這種情況下，每一行都會被分割為多個單詞並且被當做一個單詞的DStream物件，然後我們再統計這些單詞

val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()

* The words DStream is further mapped (one-to-one transformation) to a DStream of (word, 1) pairs,
* which is then reduced to get the frequency of words in each batch of data. Finally,
* wordCounts.print() will print a few of the counts generated every second.
*
* 這些單詞的DStream被進一步對映成鍵值對的形式（型別依舊是Dstream）
* 然後把一批資料中單詞按照出現的頻率歸併起來，通過wordCounts.print()沒秒列印一次
*
*
* Note that when these lines are executed, Spark Streaming only sets up the computation it will perform
* when it is started, and no real processing has started yet.
* To start the processing after all the transformations have been setup, we finally call
*
* 當每行被執行卻還沒有被執行的那些transformation運算元，我們總是延遲執行，我們在最後呼叫
* 即懶載入模式

ssc.start()             // Start the computation
ssc.awaitTermination()  // Wait for the computation to terminate

* The complete code can be found in the Spark Streaming example NetworkWordCount.
*
* 為了完成程式碼你可以找到名字叫NetworkWordCount的Spark Streaming 的例子
*
*
* If you have already downloaded and built Spark, you can run this example as follows.
* You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using
*
* 如果你已經下載並構建了Spark，你可以按照下面的方法執行這個例子
* 首先你需要執行 Netcat 作為資料伺服器使用(Netcat一個像類Unix的系統中的小工具)
*
* $ nc -lk 9999
* Then, in a different terminal, you can start the example by using
*
* 然後你可以通過使用一個不同的中斷來開始使用例子程式
*
*
* Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second
* 然後，在 netcat 終端上執行的任何型別的資料都將按照每秒的頻率被統計並列印到螢幕

《深入理解Spark》之Spark-Stream概述1（官方文件翻譯版）

最近在學英語，學以致用，就嘗試著看Spark的官方文件，並試著翻譯了部分，由於水平有限如果有所疏漏的地方歡迎指正 * Spark Stream Overview * Spark Streaming

spark編譯（官方文件翻譯版）

原文地址：http://spark.apache.org/docs/latest/building-spark.html#building-a-runnable-distribution Building Apache Spark Apache Maven The Maven-based

深入理解javaSE之final關鍵字(終結器)（面試重點）

final關鍵字含義 final是java中保留的關鍵字，可以修飾類、方法、屬性以及變數，一旦引用宣告作final，那麼我們將不能改變這個引用了，那麼如果你嘗試去改變的話，編譯器會報錯。 final變數什麼叫final變數？ final變數就是用fin

深入理解Android之Java Security第二部分（Final）

深入理解Android之Java Security（第二部分，最後）程式碼路徑：Security.java：libcore/lunl/src/main/java/java/security/TrustedCertificateStore.java：libcore /crypt

深入理解系列之JAVA多執行緒（2）——synchronized同步原理

多執行緒中為了解決執行緒安全問題，一個重要的手段就是同步！所謂同步其實就是使得原本各個執行緒交叉執行（非同步），變成排隊執行（同步）。同步策略使得不同執行緒操作共享資料遵循“先來後到“，從而避免某個執行緒沒有處理完資料就被另一執行緒搶佔操作出現資料被覆蓋或

從零開始學習算法之歸並排序[1]（2.2歸並排序）

並排步驟 blog ++ 序列 else [1] 操作歸並排序歸並排序思想為將序列每相鄰兩個數字進行歸並操作（merge)，形成floor(n/2)個序列，排序後每個序列包含兩個元素將上述序列再次歸並，形成floor(n/4)個序列，每個序列包含四個元素重復步驟2，直

[spark程序]統計人口平均年齡（本地文件）（詳細過程）

pro simple res 本地文件 object c package library 退回 port 一、題目描述（1）編寫Spark應用程序，該程序可以在本地文件系統中生成一個數據文件peopleage.txt，數據文件包含若幹行（比如1000行，或者100萬行等等

《深入理解Java虛擬機》讀書筆記2-class文件結構

改變 image 都是就是固定 char 形式 lin ESS class文件結構　　Class文件內容可以分為兩種數據類型：無符號數和表。其中無符號數包括u1,u2,u3,u4，分別代表1個字節，2個字節，3個字節和4個字節。無符號數可以表示數字、UTF8編碼的字符

spark 調優（官方文件）

1.序列化物件在進行網路傳輸或進行持久化時需要進行序列化，如果採用序列化慢或者消耗大量位元組的序列化格式，則會拖慢計算。 spark 提供了兩種序列化類庫 1）. Java serialization 靈活，但是很慢 2） Kryo serializati

Spark SQL，DataFrames and DataSets Guide官方文件翻譯

通過使用saveAsTable命令，可以將DataFrames持久化到表中。已有的Hive部署不需要使用這個特性，Spark將會建立一個預設的本地Hive 元資料庫。和createOrRepalceTempView不同的是，saveAsTable會實體化DataFrame的內容，並且會在Hive 元資料庫中

深入理解加密、解密、數字簽名（簽名證書、加密證書）的組成和數字證書

深入理解加密、解密、數字簽名和數字證書隨著電子商務的迅速發展，資訊保安已成為焦點問題之一，尤其是網上支付和網路銀行對資訊保安的要求顯得更為突出。為了能在因特網上開展安全的電子商務活動，公開金鑰基礎設施（ PKI, Public Key Infrastructure

Android [Camera 原始碼] 概述(overview) Google官方文件（一）

Google原始碼網地址連結：https://source.android.com/devices/camera 看了幾天Camera HAL的實現以及底層的camera驅動/binder驅動，之間的Camera Binder傳輸，訊息Event傳送機制，搞得頭暈腦脹，每個模組都涉及很多方面的知

Hyperledger Fabric 1.3 官方文件翻譯（三）關鍵概念 (Key Concepts)

身份（Identity）什麼是身份（What is an Identity）? The different actors in a blockchain network include peers, orderers, client applications,

Hyperledger Fabric 1.3 官方文件翻譯（五）教程 (Tutorials)

構建你的第一個網路（Building Your First Network） These instructions have been verified to work against the latest stable Docker images and t

django 1.8 官方文件翻譯： 2-5-7 自定義查詢

自定義查詢 New in Django 1.7. Django為過濾提供了大量的內建的查詢（例如，exact和icontains）。這篇文件闡述瞭如何編寫自定義查詢，以及如何修改現存查詢的功能。關於查詢的API參考，詳見查詢API參考。一個簡單的

1.solr5官方文件中文：快速入門

1.1安裝solr 1.1.1 環境準備需要安裝JRE,版本1.7以上 1.1.2安裝solr 到solr官網http://lucene.apache.org/solr/下載安裝包。 Linux/Unix/OSX系統下載.tgz檔案包，windows系統下載.zip檔

django 1.8 官方文件翻譯： 2-2-1 執行查詢

執行查詢一旦你建立好資料模型之後，django會自動生成一套資料庫抽象的API，可以讓你執行增刪改查的操作。這篇文件闡述瞭如何使用這些API。關於所有模型檢索選項的詳細內容，請見。在整個文件（以及參考）中，我們會大量使用下面的模型，它構成了一個部落格應用

django 1.8 官方文件翻譯：2-1-1 模型語法

模型模型是你的資料的唯一的、權威的資訊源。它包含你所儲存資料的必要欄位和行為。通常，每個模型對應資料庫中唯一的一張表。基礎：模型的每個屬性都表示資料庫中的一個欄位。 Django 提供一套自動生成的用於資料庫訪問的API；詳見執行查詢。

django 1.8 官方文件翻譯：7-2 管理操作

管理操作簡而言之，Django管理後臺的基本流程是，“選擇一個物件並改變它”。在大多數情況下，這是非常適合的。然而當你一次性要對多個物件做相同的改變，這個流程是非常的單調乏味的。在這些例子中，Django管理後臺可以讓你實現和註冊“操作” —— 僅僅只是

Spark之訓練分類模型練習（1）

（）本博文為 spark機器學習第5章學習筆記。所用資料下載地址為：實驗資料集train.tsv 各列的資料意義為： “url” “urlid” “boilerplate” “alchemy_category” “alchemy_

《深入理解Spark》之Spark-Stream概述1（官方文件翻譯版）

A Quick Example

相關推薦