1. 程式人生 > >Spark中常見join操作

Spark中常見join操作

spark中的連線操作

(1)join

如果熟悉sql的同學應該很熟悉join,這裡的join和sql中的inner join操作很相似,返回結果是前面一個集合和後面一個集合中匹配成功的,過濾掉關聯不上的。

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Return an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is
in this and (k, v2) is in other. Performs a hash join across the cluster.

具體實際操作如下:

    val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
    val b=sc.parallelize(Array(("1",2.0),("2",8.0)))

    val c=a.join(b)
    c.foreach(println)

     //列印結果出來如下:
     //(2,(8.0,8.0))
     //(1,(4.0,2.0))
     //這裡返回的結果很顯然是3匹配不到過濾掉,合併匹配到。

(2)leftOuterJoin

leftOuterJoin類似於SQL中的左外關聯left outer join,返回結果以第一個RDD為主,關聯不上的記錄為空。

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Perform a left outer join of this and other. For each element (k, v) in this, the resulting RDD will either contain all pairs (k, (v, Some(w))) for
w in other, or the pair (k, (v, None)) if no elements in other have key k. Hash-partitions the output using the existing partitioner/parallelism level.

具體實際操作如下:

    val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
    val b=sc.parallelize(Array(("1",2.0),("2",8.0)))

    val c=a.leftOuterJoin(b)
    c.foreach(println)

    //列印結果出來如下:
    //(2,(8.0,Some(8.0)))
    //(3,(9.0,None))
    //(1,(4.0,Some(2.0)))

(3)rightOuterJoin

rightOuterJoin類似於SQL中的有外關聯right outer join,返回結果以引數也就是第二個RDD為主,關聯不上的記錄為空

def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (Option[V], W))]
Perform a right outer join of this and other. For each element (k, w) in other, the resulting RDD will either contain all pairs (k, (Some(v), w)) for v in this, or the pair (k, (None, w)) if no elements in this have key k. Hash-partitions the resulting RDD using the existing partitioner/parallelism level.

具體實際操作如下:

    val a =sc.parallelize(Array(("1",4.0),("2",8.0),("3",9.0)))
    val b=sc.parallelize(Array(("1",2.0),("2",8.0)))

    val c=a.rightOuterJoin(b)

    c.foreach(println)

    //列印結果出來如下:
    //(2,(Some(8.0),8.0))
    //(1,(Some(4.0),2.0))