【實踐】Spark RDD API實戰

阿新 • • 發佈：2019-02-08

map

Applies a transformation function on each item of the RDD and returns the result as a new RDD.

//3表示指定為3個Partitions
var a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
//以a各元素的長度建議新的RDD
var b = a.map(_.length)
//將兩個RDD組合新一個新的RDD
var c = a.zip(b)
c.collect
res0: Array[(String, Int)] = Array((dog,3 
), (salmon,6), (salmon,6), (rat,3), 
(elephant,8))*

zip

Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-va lue pairs by the methods provided by the PairRDDFunctions extension.

var 
 a1 = sc.parallelize(1 to 10, 3)
var b1 = sc.parallelize(11 to 20, 3)
a1.zip(b1).collect
res1: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), \
(5,15), (6,16), (7,17), (8,18), (9,19), (10,20))

var a2 = sc.parallelize(1 to 10, 3)
var b2 = sc.parallelize(11 to 20, 3)
var c2 = sc.parallelize(21 to 30 
, 3)
a2.zip(b2).zip(c2).collect
res3: Array[((Int, Int), Int)] = Array(((1,11),21), ((2,12),22),
((3,13),23), ((4,14),24), ((5,15),25), ((6,16),26), ((7,17),27),
((8,18),28), ((9,19),29), ((10,20),30))
a2.zip(b2).zip(c2).map((x) => (x._1._1, x._1._2, x._2 )).collect
res2: Array[(Int, Int, Int)] = Array((1,11,21), (2,12,22), (3,13,23),
(4,14,24), (5,15,25), (6,16,26), (7,17,27), (8,18,28), (9,19,29), (10,20,30))

filter

Evaluates a boolean function for each data item of the RDD and puts the items fo r which the function returned true into the resulting RDD.Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-va lue pairs by the methods provided by the PairRDDFunctions extension.

val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res4: Array[Int] = Array(2, 4, 6, 8, 10)

flatMap

Similar to map, but allows emitting more than one item in the map function. map是一個元素，變成另一個元素。flatMap是一個元素變成1個或多個元素。

var a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res8: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4,
5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8,
1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res9: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

var x  = sc.parallelize(1 to 5, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(5))(_)).collect
res10: Array[Int] = Array(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)

mapPartitions

This is a specialized map that is called only once for each partition. The entir e content of the respective partitions is available as a sequential stream of va lues via the input argument (Iterarator[T]). The custom function must return yet another Iterator[U]. The combined result iterators are automatically converted into a new RDD. Please note, that the tuples (3,4) and (6,7) are missing from th e following result due to the partitioning we chos 對每一個Partiion中的各個元素，以指定的函式進行處理，生成新的RDD。

val a = sc.parallelize(1 to 9, 3)
def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {
  var res = List[(T, T)]()
  var pre = iter.next
  while (iter.hasNext)
  {
    val cur = iter.next;
    res .::= (pre, cur)
    pre = cur;
  }
  res.iterator
}
a.mapPartitions(myfunc).collect
res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

mapPartitionsWithIndex

Similar to mapPartitions, but takes two parameters. The first parameter is the i ndex of the partition and the second is an iterator through all the items within this partition. The output is an iterator containing the list of items after ap plying whatever transformation the function encodes.

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
  iter.toList.map(x => index + "," + x).iterator
}
x.mapPartitionsWithIndex(myfunc).collect()
res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)

sample

Randomly selects a fraction of the items of a RDD and returns them in a new RDD.

val a = sc.parallelize(1 to 10000, 3)
a.sample(false, 0.1, 0).count
res24: Long = 960

a.sample(true, 0.3, 0).count
res25: Long = 2888

a.sample(true, 0.3, 13).count
res26: Long = 2985

union, ++

Performs the standard set operation: A union B. union,++是兩個RDD中的元素，都直接作為新RDD的元素。zip是兩個RDD中的元素組合成tupl e，tuple作為新RDD的元素。

val a = sc.parallelize(1 to 3, 1)
val b = sc.parallelize(5 to 7, 1)
(a ++ b).collect
res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)

intersection

Returns the elements in the two RDDs which are the same.

val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30)
val z = x.intersection(y)
z.collect
res74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)

distinct

Returns a new RDD that contains each unique value only once.

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2

a.distinct(3).partitions.length
res17: Int = 3

groupBy

val a = sc.parallelize(1 to 9, 3)
a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collect
res42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)), 
(odd,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
  a % 2
}
a.groupBy(myfunc).collect
res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), 
(1,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)
def myfunc(a: Int) : Int =
{
  a % 2
}
a.groupBy(x => myfunc(x), 3).collect
a.groupBy(myfunc(_), 1).collect
res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), 
(1,ArrayBuffer(1, 3, 5, 7, 9)))

keyBy

Constructs two-component tuples (key-value pairs) by applying a function on each data item. The result of the function becomes the key and the original data item becomes the value of the newly created tuples.

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)
val b = a.keyBy(_.length)
b.collect
res26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), 
(8,elephant))

groupByKey

Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)
val b = a.keyBy(_.length)
b.groupByKey.collect
res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), 
(6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), 
(5,ArrayBuffer(tiger, eagle)))

reduceByKey

This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res86: Array[(Int, String)] = Array((3,dogcatowlgnuant))

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.reduceByKey(_ + _).collect
res87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))

aggregate

val z = sc.parallelize(List(1,2,3,4,5,6), 2)
z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

val z = sc.parallelize(List("a","b","c","d","e","f"),2)
z.aggregate("")(_ + _, _+_)
res115: String = abcdef

z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10

sortByKey

This function sorts the input RDD’s data and stores it in a new RDD. The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled RDD. Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.

val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.toInt, 2)
val c = a.zip(b)
c.sortByKey(true).collect
res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))
c.sortByKey(false).collect
res75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))

val a = sc.parallelize(1 to 100, 5)
val b = a.cartesian(a)
val c = sc.parallelize(b.takeSample(true, 5, 13), 2)
val d = c.sortByKey(false)
res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))

cogroup

cogroup對兩個RDD資料集按key進行group by，並對每個RDD的value進行單獨group by。

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

val d = a.map((_, "d"))
b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),
(2,(ArrayBuffer(banana),ArrayBuffer())),
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))

pipe

對每一個Partition的資料應用指定的shell命令，並輸出到stdin：

val a = sc.parallelize(1 to 9, 3)
a.pipe("head -n 1").collect
res2: Array[String] = Array(1, 4, 7)

coalesce，repartition

調整RDD的Partition個數，生成新的RDD。repartition固定會執行shuffle操作，coalesce可以指定是否shuffle。

val y = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

【實踐】Spark RDD API實戰

map Applies a transformation function on each item of the RDD and returns the result as a new RDD. //3表示指定為3個Partitions v

【spark 深入學習 03】Spark RDD的蠻荒世界

解釋不難特性 bsp resid 易懂優化方式序列 RDD真的是一個很晦澀的詞匯，他就是伯克利大學的博士們在論文中提出的一個概念，很抽象，很難懂；但是這是spark的核心概念，因此有必要spark rdd的知識點，用最簡單、淺顯易懂的詞匯描述。不想用學術話的語言來

【實踐】基於spark的CF實現及優化

最近專案中用到ItemBased Collaborative Filtering，實踐過spark mllib中的ALS，但是因為其中涉及到降維操作，大資料量的計算實在不能恭維。所以自己實踐實現基於spark的分散式cf，已經做了部分優化。目測執行效率還不錯。以下程式碼 p

蝸龍徒行-Spark學習筆記【四】Spark叢集中使用spark-submit提交jar任務包實戰經驗

一、所遇問題由於在IDEA下可以方便快捷地執行scala程式，所以先前並沒有在終端下使用spark-submit提交打包好的jar任務包的習慣，但是其只能在local模式下執行，在網上搜了好多帖子設定VM引數都不能啟動spark叢集，由於實驗任務緊急只能暫時

Spark RDD API解析及實戰

import org.apache.spark.{HashPartitioner, SparkConf, SparkContext} import scala.collection.mutable.ArrayBuffer object RDDTest { def m

【java】itoo項目實戰之hibernate 懶載入優化性能

bsp xtra extra pda 程序前端框架外連接獲取轉換成在做itoo 3.0 的時候，考評系統想要上線，就開始導入數據了，僅僅導入學生2萬條數據，可是導入的速度特別的慢。這個慢的原因是由於導入的時候進行了過多的IO操作。可是導入成功之後，

【java】itoo項目實戰之hibernate 批量保存優化

新的 hibernate 缺點 try 實戰 lis 插入 entity man 在itoo中。基本上每一個系統都有一個導入功能，大量的數據填寫進入excel模板中。然後使用導入功能導入的數據庫中，這樣能夠大大的提高工作效率。那麽導入就涉及到了批量保存數據庫的

【轉載】 Spark性能優化指南——基礎篇

否則內存占用是否進行優先邏輯我們流式字節數組前言開發調優調優概述原則一：避免創建重復的RDD 原則二：盡可能復用同一個RDD 原則三：對多次使用的RDD進行持久化原則四：盡量避免使用shuffle類算子原則五：使用map-side預聚

【實踐】簡潔大方的summernote 富文本編輯器插件的用發——實例篇

user 路徑 z-index 寫上 eno error: 而已 tab size 實例化後的summernote 是這樣子的很漂亮對吧，而我做成頁面效果是這樣的：先說說實例化一個summernote 的方法把，其實也不難，jq 選擇器選擇一個要變成富文本編

【總結】spark按文本格式和Lzo格式處理Lzo壓縮文件的比較

spark lzotextinputformat1、描述spark中怎麽加載lzo壓縮格式的文件2、比較lzo格式文件以textFile方式和LzoTextInputFormat方式計算數據，Running Tasks個數的影響 a.確保lzo文件所在文件夾中生成lzo.index索引文件 b.以

【轉載】Spark學習——spark中的幾個概念的理解及參數配置

program submit man 聯眾 tail 進行 orb 數據源 work 首先是一張Spark的部署圖：節點類型有： 1. master 節點：常駐master進程，負責管理全部worker節點。2. worker 節點：常駐worker進程，負責管理

【轉】Spark Streaming和Kafka整合開發指南

thread ada 關系方法拷貝理解 1.2 reduce arr 基於Receivers的方法這個方法使用了Receivers來接收數據。Receivers的實現使用到Kafka高層次的消費者API。對於所有的Receivers，接收到的數據將會保存在Spark

【python】spark+kafka使用

設置消費 /usr tegra 情況下分布式文件系統默認 usr mina 網上用python寫spark+kafka的資料好少啊自己記錄一點踩到的坑~ spark+kafka介紹的官方網址：http://spark.apache.org/docs/latest

【轉載】Vue 2.x 實戰之後臺管理系統開發（二）

null element asc 其他就會 ans 目錄 asi all 2. 常見需求 01. 父子組件通信 a. 父 -> 子（父組件傳遞數據給子組件）使用 props，具體查看文檔 - 使用 Prop 傳遞數據（cn.vuejs.org/v2/guide

【實踐】Yalmip使用Knitro的一些總結

完整步驟 bubuko pro 打開 ecg LV start features Yalmip使用Knitro的一些總結 1.軟件　　Knitro 11.0.1 　　Win64（包含安裝包和確定機器ID的軟件）：鏈接：https://pan.baidu.com/s/

【實踐】演算法第三章上機實踐報告

1. 實踐題目 7-3 編輯距離問題 2. 問題描述設A和B是2個字串。要用最少的字元操作將字串A轉換為字串B。這裡所說的字元操作包括 (1)刪除一個字元； (2)插入一個字元； (3)將一個字元改為另一個字元。將字串A變換為字串B所用的最少字元運算元稱為字串A到 B的編輯距離，記為

【實踐】如何獲得Rinkeby網路的測試以太幣

當把智慧合約部署到Rinkeby Test Network時，需要獲得測試以太幣。其網路獲取測試以太幣的方法同Ropsten Test Network有些不同，本文詳細講解一下。 1 訪問網站訪問rinkeby網路（https://www.rinkeby.io/#f

【實踐】演算法第四章上機實踐報告

1. 實踐題目：卡了很久的”刪數問題“ 2. 問題描述：給定n位正整數a，去掉其中任意k≤n 個數字後，剩下的數字按原次序排列組成一個新的正整數。對於給定的n位正整數a和正整數 k，設計一個演算法找出剩下數字組成的新數最小的刪數方案。要求輸出最小數。如：給定a = 178543，k = 4，則輸出

Spark RDD 操作實戰之檔案讀取

/1、本地檔案讀取 val local_file_1 = sc.textFile("/home/hadoop/sp.txt") val local_file_2 = sc.textFile("file://home/hadoop/sp.txt") //2、當前目錄下的檔案 val file1 = sc

【連載】Django入門到實戰（一）

一、專案目錄結構介紹 manager.py 與專案進行互動的命令列工具集的入口(專案管理器) MyDjango 目錄：專案容器，包含專案的基本配置，目錄名稱不建議修改 __init__.py Python中宣告模組的檔案，內容預設為空 settings.py 專案

【實踐】Spark RDD API實戰

相關推薦