spark rdd 詳解 - 程式人生

aggregate

The aggregate function allows the user to apply two different reduce functions to the RDD. The first reduce function is applied within each partition to reduce the data within each partition into a single result. The second reduce function is used to combine the different reduced results of all partitions together to arrive at one final result. The ability to have two separate reduce functions for intra partition versus across partition reducing adds a lot of flexibility. For example the first reduce function can be the max function and the second one can be the sum function. The user also specifies an initial value. Here are some important facts.
The initial value is applied at both levels of reduce. So both at the intra partition reduction and across partition reduction.

Both reduce functions have to be commutative and associative.
Do not assume any execution order for either partition computations or combining partitions.
Why would one want to use two input data types? Let us assume we do an archaeological site survey using a metal detector. While walking through the site we take GPS coordinates of important findings based on the output of the metal detector. Later, we intend to draw an image of a map that highlights these locations using the aggregate function. In this case the zeroValue could be an area map with no highlights. The possibly huge set of input data is stored as GPS coordinates across many partitions. seqOp (first reducer) could convert the GPS coordinates to map coordinates and put a marker on the map at the respective position. combOp (second reducer) will receive these highlights as partial maps and combine them into a single final output map.
Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

// lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
  iter.toList 
.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res28: Array[String] = Array([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])

z.aggregate(0)(math.max(_, _), _ + _)
res40: Int = 9

// This example returns 16 since the initial value is 5
// reduce of partition 0 will be max(5, 1, 2, 3) = 5
// reduce of partition 1 will be max(5, 4, 5, 6) = 6
// final reduce across partitions will be 5 + 5 + 6 = 16
// note the final reduce include the initial value
z.aggregate(5)(math.max(_, _), _ + _)
res29: Int = 16


val z = sc.parallelize(List("a","b","c","d","e","f"),2)

//lets first print out the contents of the RDD with partition labels
def myfunc(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
  iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}

z.mapPartitionsWithIndex(myfunc).collect
res31: Array[String] = Array([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])

z.aggregate("")(_ + _, _+_)
res115: String = abcdef

// See here how the initial value "x" is applied three times.
//  - once for each partition
//  - once when combining all the partitions in the second reduce function.
z.aggregate("x")(_ + _, _+_)
res116: String = xxdefxabc

// Below are some more advanced examples. Some are quite tricky to work out.

val z = sc.parallelize(List("12","23","345","4567"),2)
z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)
res141: String = 42

z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res143: String = 10

The main issue with the code above is that the result of the inner min is a string of length 1.
The zero in the output is due to the empty string being the last string in the list. We see this result because we are not recursively reducing any further within the partition for the final string.

Examples 2

val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)
res144: String = 11

In contrast to the previous example, this example has the empty string at the beginning of the second partition. This results in length of zero being input to the second reduce which then upgrades it a length of 1. (Warning: The above example shows bad design since the output is dependent on the order of the data inside the partitions.)

aggregateByKey [Pair]

Works like the aggregate function except the aggregation is applied to the values with the same key. Also unlike the aggregate function the initial value is not applied to the second reduce.

Listing Variants

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, numPartitions: Int)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
def aggregateByKey[U](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

Example

val pairRDD = sc.parallelize(List( ("cat",2), ("cat", 5), ("mouse", 4),("cat", 12), ("dog", 12), ("mouse", 2)), 2)

// lets have a look at what is in the partitions
def myfunc(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
  iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
}
pairRDD.mapPartitionsWithIndex(myfunc).collect

res2: Array[String] = Array([partID:0, val: (cat,2)], [partID:0, val: (cat,5)], [partID:0, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)], [partID:1, val: (mouse,2)])

pairRDD.aggregateByKey(0)(math.max(_, _), _ + _).collect
res3: Array[(String, Int)] = Array((dog,12), (cat,17), (mouse,6))

pairRDD.aggregateByKey(100)(math.max(_, _), _ + _).collect
res4: Array[(String, Int)] = Array((dog,100), (cat,200), (mouse,200))

cartesian

Computes the cartesian product between two RDDs (i.e. Each item of the first RDD is joined with each item of the second RDD) and returns them as a new RDD. (Warning: Be careful when using this function.! Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))
val y = sc.parallelize(List(6,7,8,9,10))
x.cartesian(y).collect
res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8), (1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9), (3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computed next. Checkpointed RDDs are stored as a binary file within the checkpoint directory which can be specified using the Spark context. (Warning: Spark applies lazy evaluation. Checkpointing will not occur until an action is invoked.)

Important note: the directory “my_directory_name” should exist in all slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")
val a = sc.parallelize(1 to 4)
a.checkpoint
a.count
14/02/25 18:13:53 INFO SparkContext: Starting job: count at <console>:15
...
14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5 stored as values to memory (estimated size 115.7 KB, free 296.3 MB)
14/02/25 18:13:53 INFO RDDCheckpointData: Done checkpointing RDD 11 to file:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11, new parent is RDD 12
res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number of partitions. repartition(numPartitions) is simply an abbreviation for coalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean = false ): RDD [T]
def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)
val z = y.coalesce(2, false)
z.partitions.length
res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow grouping up to 3 key-value RDDs together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], numPartitions: Int): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))]
def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Iterable[V], Iterable[W]))]
def groupWith[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)]): RDD[(K, (Iterable[V], IterableW1], Iterable[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.map((_, "b"))
val c = a.map((_, "c"))
b.cogroup(c).collect
res7: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c))),
(3,(ArrayBuffer(b),ArrayBuffer(c))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))
)

val d = a.map((_, "d"))
b.cogroup(c, d).collect
res9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array(
(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),
(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d)))
)

val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)
val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)
x.cogroup(y).collect
res23: Array[(Int, (Iterable[String], Iterable[String]))] = Array(
(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))), 
(2,(ArrayBuffer(banana),ArrayBuffer())), 
(3,(ArrayBuffer(orange),ArrayBuffer())),
(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),
(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect, toArray

Converts the RDD into a Scala array and returns it. If you provide a standard map-function (i.e. f = T -> U) it will be applied before inserting the values into the result array.

Listing Variants

def collect(): Array[T]
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U]
def toArray(): Array[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.collect
res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu, Rat)

collectAsMap [Pair]

Similar to collect, but works on key-value RDDs and converts them into Scala maps to preserve their key-value structure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

val a = sc.parallelize(List(1, 2, 1, 3), 1)
val b = a.zip(a)
b.collectAsMap
res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1 -> 1, 3 -> 3)

combineByKey[Pair]

Very efficient implementation that combines the values of a RDD consisting of two-component tuples by applying multiple aggregators one after another.

Listing Variants

def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null): RDD[(K, C)]

Example

val a = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
val c = b.zip(a)
val d = c.combineByKey(List(_), (x:List[String], y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)
d.collect
res16: Array[(Int, List[String])] = Array((1,List(cat, dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute

Executes dependencies and computes the actual representation of the RDD. This function should not be called directly by users.

Listing Variants

def compute(split: Partition, context: TaskContext): Iterator[T]

context, sparkContext

Returns the SparkContext that was used to create the RDD.

Listing Variants

def compute(split: Partition, context: TaskContext): Iterator[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.context
res8: org.apache.spark.SparkContext = org.apache.spark.SparkContext@58c1c2f1

count

Returns the number of items stored within a RDD.

Listing Variants

def count(): Long

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res2: Long = 4

countApprox

Marked as experimental feature! Experimental features are currently not covered by this document!

Listing Variants

def (timeout: Long, confidence: Double = 0.95): PartialResult[BoundedDouble]

countApproxDistinct

Computes the approximate number of distinct values. For large RDDs which are spread across many nodes, this function may execute faster than other counting methods. The parameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05): Long

Example

val a = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224

b.countApproxDistinct(0.05)
res15: Long = 9750

b.countApproxDistinct(0.01)
res16: Long = 9947

b.countApproxDistinct(0.001)
res0: Long = 10000

countApproxDistinctByKey [Pair]

Similar to countApproxDistinct, but computes the approximate number of distinct values for each distinct key. Hence, the RDD must consist of two-component tuples. For large RDDs which are spread across many nodes, this function may execute faster than other counting methods. The parameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))

countByKey [Pair]

Very similar to count, but counts the values of a RDD consisting of two-component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)

countByKeyApprox [Pair]

Marked as experimental feature! Experimental features are currently not covered by this document!

Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

Returns a map that contains all unique values of the RDD and their respective occurrence counts. (Warning: This operation will finally aggregate the information in a single reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox

Marked as experimental feature! Experimental features are currently not covered by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence: Double = 0.95): PartialResult[Map[T, BoundedDouble]]

dependencies

Returns the RDD on which this RDD depends.

Listing Variants

final def dependencies: Seq[Dependency[_]]

Example

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[32] at parallelize at <console>:12
b.dependencies.length
Int = 0

b.map(a => a).dependencies.length
res40: Int = 1

b.cartesian(a).dependencies.length
res41: Int = 2

b.cartesian(a).dependencies
res42: Seq[org.apache.spark.Dependency[_]] = List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa, org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

Returns a new RDD that contains each unique value only once.

Listing Variants

def distinct(): RDD[T]
def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
c.distinct.collect
res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))
a.distinct(2).partitions.length
res16: Int = 2

a.distinct(3).partitions.length
res17: Int = 3

first

Looks for the very first data item of the RDD and returns it.

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.first
res1: String = Gnu

filter

Evaluates a boolean function for each data item of the RDD and puts the items for which the function returned true into the resulting RDD.

Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)
val b = a.filter(_ % 2 == 0)
b.collect
res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able to handle all data items contained in the RDD. Scala provides so-called partial functions to deal with mixed data-types. (Tip: Partial functions are very useful if you have some data which may be bad and you do not want to handle but for the good data (matching data) you want to apply some kind of map function. The following article is good. It teaches you about partial functions in a very nice way and explains why case has to be used for partial functions: article)

Examples for mixed data without partial functions

val b = sc.parallelize(1 to 8)
b.filter(_ < 4).collect
res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.filter(_ < 4).collect
<console>:15: error: value < is not a member of Any

This fails because some components of a are not implicitly comparable against integers. Collect uses the isDefinedAt property of a function-object to determine whether the test-function is compatible with each data item. Only data items that pass this test (=filter) are then mapped using the function-object.

Examples for mixed data with partial functions

val a = sc.parallelize(List("cat", "horse", 4.0, 3.5, 2, "dog"))
a.collect({case a: Int    => "is integer" |
           case b: String => "is string" }).collect
res17: Array[String] = Array(is string, is string, is integer, is string)

val myfunc: PartialFunction[Any, Any] = {
  case a: Int    => "is integer" |
  case b: String => "is string" }
myfunc.isDefinedAt("")
res21: Boolean = true

myfunc.isDefinedAt(1)
res22: Boolean = true

myfunc.isDefinedAt(1.5)
res23: Boolean = false

Be careful! The above code works because it only checks the type itself! If you use operations on this type, you have to explicitly declare what type you want instead of any. Otherwise the compiler does (apparently) not know what bytecode it should produce:

val myfunc2: PartialFunction[Any, Any] = {case x if (x < 4) => "x"}
<console>:10: error: value < is not a member of Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x < 4) => "x"}
myfunc2: PartialFunction[Int,Any] = <function1>

filterByRange [Ordered]

Returns an RDD containing only the items in the key range specified. From our testing, it appears this only works if your data is in key value pairs and it has already been sorted by key.

Listing Variants

def filterByRange(lower: K, upper: K): RDD[P]

Example

val randRDD = sc.parallelize(List( (2,"cat"), (6, "mouse"),(7, "cup"), (3, "book"), (4, "tv"), (1, "screen"), (5, "heater")), 3)
val sortedRDD = randRDD.sortByKey()

sortedRDD.filterByRange(1, 3).collect
res66: Array[(Int, String)] = Array((1,screen), (2,cat), (3,book))

filterWith (deprecated)

This is an extended version of filter. It takes two function arguments. The first argument must conform to Int -> T and is executed once per partition. It will transform the partition index to type T. The second function looks like (U, T) -> Boolean. T is the transformed partition index and U are the data items from the RDD. Finally the function has to return either true or false (i.e. Apply the filter).

Listing Variants

def filterWith[A: ClassTag](constructA: Int => A)(p: (T, A) => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)
val b = a.filterWith(i => i)((x,i) => x % 2 == 0 || i % 2 == 0)
b.collect
res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)
a.filterWith(x=> x)((a, b) =>  b == 0).collect
res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) =>  a % (b+1) == 0).collect
res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) =>  b == "2").collect
res34: Array[Int] = Array(5, 6)

flatMap

Similar to map, but allows emitting more than one item in the map function.

Listing Variants

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U]

Example

val a = sc.parallelize(1 to 10, 5)
a.flatMap(1 to _).collect
res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collect
res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

// The program below generates a random number of copies (up to 10) of the items in the list.
val x  = sc.parallelize(1 to 10, 3)
x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues

Very similar to mapValues, but collapses the inherent structure of the values during mapping.

Listing Variants

def flatMapValues[U](f: V => TraversableOnce[U]): RDD[(K, U)]

Example

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
b.flatMapValues("x" + _ + "x").collect
res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o), (3,g), (3,x), (5,x), (5,t), (5,i), (5,g), ( 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    Spark RDD詳解
      Spark   RDD   大數據   大數據開發   1、RDD是什麽RDD：Spark的核心概念是RDD (resilientdistributed dataset)，指的是一個只讀的，可分區的分布式數據集，這個數據集的全部或部分可以緩存在內存中，在多次計算間重用。
為什麽會產生RDD？
（1）傳統的Ma 

  
 

    

    
    Spark 入門之二：Spark RDD詳解
      RDD的容錯機制實現分散式資料集容錯方法有兩種：資料檢查點和記錄更新RDD採用記錄更新的方式：記錄所有更新點的成本很高。所以，RDD只支援粗顆粒變換，即只記錄單個塊上執行的單個操作，然後建立某個RDD的變換序列（血統）儲存下來；變換序列指，每個RDD都包含了他是如何由其他RDD變換過來的以及如何重建某一塊資料 

  
 

    

    
    spark rdd詳解一(rdd入門)
      
							
							
							1.rdd簡介

與許多專有的大資料處理平臺不同，Spark建立在統一抽象的RDD之上，使得它可以以基本一致的方式應對不同的大資料處理場景，包括MapReduce，Streaming，SQL，Machine Learning以及Graph等。這即Matei Za 

  
 

    

    
    spark rdd 詳解
      
							
							
							

aggregate

The aggregate function allows the user to apply two different reduce functions to the RDD. The first reduce function i 

  
 

    

    
    Spark RDD詳解 | RDD特性、lineage、快取、checkpoint、依賴關係
      RDD（Resilient Distributed Datasets）彈性的分散式資料集，又稱Spark core，它代表一個只讀的、不可變、可分割槽，裡面的元素可分散式平行計算的資料集。
RDD是一個很抽象的概念，不易於理解，但是要想學好Spark，必須要掌握RDD，熟悉它的程式設計模型，這是學習Spark 

  
 

    

    
    spark的rdd詳解1
      操作   spa   img   cal   選擇   分享圖片   分區   並行   方式    
1，rdd的轉換和行動操作
 

 
 
2，創建rdd的2種方式
1，通過hdfs支持的文件系統，沒有真正把數據放rdd，只記錄了一下元數據
2，通過scala的集合或者數組並行化的創建rdd

 
3， 

  
 

    

    
    深入理解spark－rdd詳解
      彈性   gem   exc   .com   drive   image   都是   spa   ima    
1.我們在使用spark計算的時候，操作數據集的感覺很方便是因為spark幫我們封裝了一個rdd（彈性分布式數據集Resilient Distributed Dataset）；
那麽rdd 

  
 

    

    
    Spark函式詳解系列之RDD基本轉換
      
                摘要：

 RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集

 RDD有兩種操作運算元：

     Transformation（轉換）：Transformation屬於延遲計 

  
 

    

    
    Spark函式詳解系列之RDD基本轉換+例項
       
 
 
  RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集 
   RDD有兩種操作運算元： 
       &nbs 

  
 

    

    
    Spark彈性分散式資料集RDD詳解
      
							
							
							【Spark】彈性分散式資料集RDD概述



彈性分佈資料集RDD

RDD(Resilient Distributed Dataset)是Spark的最基本抽象，是對分散式記憶體的抽象使用，實現了以操作本地集合的方式來操作分散式資料集的抽象實現。RDD是Sp 

  
 

    

    
    spark運算元詳解
       
  
  
 combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner)

定義：



def combineByKey[C](
      createCombiner: V => C,
      mergeVal 

  
 

    

    
    Spark入門詳解（一）-Spark簡介
       
  
  
 個人部落格原文連結 
 簡介 
 Spark是基於記憶體計算的大資料分散式計算框架。Spark基於記憶體計算，提供可互動查詢方式，提供近實時處理方式，同時保證了高容錯性和高可伸縮性，允許使用者將Spark部署在大量廉價硬體之上，形成叢集。 
 Spark使用Scala語言進行實現，它是一種面 

  
 

    

    
    Spark中的Spark Shuffle詳解(多看幾遍)
       
 
 Shuffle簡介 
 Shuffle描述著資料從map task輸出到reduce task輸入的這段過程。shuffle是連線Map和Reduce之間的橋樑，Map的輸出要用到Reduce中必須經過shuffle這個環節，shuffle的效能高低直接影響了整個程式的效能和吞吐量。因為在分散式情況 

  
 

    

    
    sparkCore-RDD詳解
      1.1 什麼是RDD
1.1.1 產生背景
當初設計RDD主要是為了解決三個問題：

Fast: Spark之前的Hadoop用的是MapReduce的程式設計模型，沒有很好的利用分散式記憶體系統，中間結果都需要儲存到external disk，執行效率很低。RDD模型是in-memory computing 

  
 

    

    
    Spark SQL詳解
       
 
 轉自：https://mp.weixin.qq.com/s/SGhYBxGd5qCVfeM70DRFTw 
 發家史 
 熟悉spark sql的都知道，spark sql是從shark發展而來。Shark為了實現Hive相容，在HQL方面重用了Hive中HQL的解析、邏輯執行計劃翻譯、執行計劃優化 

  
 

    

    
    Spark2.0學習（二）--------RDD詳解
      新增針對scala檔案的編譯外掛 
------------------------------ 
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns: 

  
 

    

    
    spark運算元詳解------Action運算元介紹
       
 
 本文首發自個人部落格：https://blog.smile13.com/articles/2018/11/30/1543589289882.html 
 一、無輸出的運算元 
 1.foreach運算元 
  
  功能：對 RDD 中的每個元素都應用 f 函式操作，無返回值。
 
  原始碼：

 

  
 

    

    
    spark運算元詳解------Transformation運算元介紹
       
 
 本文首發自個人部落格：https://blog.smile13.com/articles/2018/12/02/1543738193387.html 
 一、Value資料型別的Transformation運算元　 
 1.輸入分割槽與輸出分割槽一對一型別的運算元 
 1.1.map運算元 
  
 

  
 

    

    
    spark運算元詳解------spark運算元分類
       
 
 本文首發自個人部落格：https://blog.smile13.com/articles/2018/12/02/1543738098914.html 
 1.spark運算元分類 
 1.1Transformation運算元 
  
  Transformation運算元不觸發提交作業,完成作業中間 

  
 

    

    
    Spark API 詳解/大白話解釋 之 map、mapPartitions、mapValues、mapWith、flatMap、flatMapWith、flatMapValues
      
							
							
							map(function) 
map是對RDD中的每個元素都執行一個指定的函式來產生一個新的RDD。任何原RDD中的元素在新RDD中都有且只有一個元素與之對應。

舉例：



val a = sc.parallelize(1 to 9, 3)
val b =