1. 程式人生 > >談談hive中join下on和where

談談hive中join下on和where

本文為部落格園作者所寫: 一寸HUI,個人部落格地址:https://www.cnblogs.com/zsql/

很多人如果先接觸mysql的執行順序(from ->on ->join ->where ->group by ->having ->select ->distinct ->order by ->limit),可能會對hive中的on和where會產生一些誤解,網上也有一些部落格寫了關於這些內容的,但是自己也還是想自己親自試驗一波,本文主要從inner join,left/right join和full  join條件下去區別on和where,以及加上分割槽的條件,其實搞懂這些對寫hql有很大的幫助,要麼可以更加的簡潔,要麼更優,接下來就開始實踐。

版本:本文使用CDH 6.3.2 的版本,hive   2.1.1+cdh6.3.2進行測試的

一、試驗表和資料

1.1、建表

create  table `user`(  -- 使用者表,分割槽表
department_id int,
age int,
sex string,
name string
)
PARTITIONED BY (`date` string)
row format delimited
fields terminated by ','
STORED AS TEXTFILE;


create table department(  -- 部門表
id int,
name string,
count int
)
row format delimited
fields terminated by ','
STORED AS TEXTFILE;

1.2、資料

-- /data/hive/user1.txt
1,34,male,zhangsan
1,31,female,lili
3,14,female,liushen
3,24,female,sasa
4,54,male,liubei
4,36,female,yuji
4,25,male,zhaoyun
8,18,male,zhangfei

-- /data/hive/user2.txt
3,37,male,wangwu
4,38,female,lisi
3,19,female,caocao
2,22,female,guanyu
1,51,male,wzj
6,31,female,zhenji
6,25,male,sunwukong
6,17,male,tangsz

-- /data/hive/department.txt
1,dashuju,8
2,kaifa,9
3,ui,10
4,hr,3
5,shouxiao,12
6,zongjian,3

1.3、資料匯入

load data local inpath '/data/hive/user1.txt'  into table `user` partition (`date`='2020-12-24');
load data local inpath '/data/hive/user2.txt'  into table `user` partition (`date`='2020-12-25');
load data local inpath '/data/hive/department.txt'  into table `department`;

1.4、查詢資料

SELECT  * from `user`;

 SELECT  * from department ;

 

  1.5、對錶進行分析

ANALYZE TABLE `user` partition(`date`='2020-12-24') COMPUTE STATISTICS;
ANALYZE TABLE `user` partition(`date`='2020-12-25') COMPUTE STATISTICS;
ANALYZE TABLE department COMPUTE STATISTICS;

如果不進行如上的操作,在下面的實踐中會出問題,資料不真實,看不出效果,所以要做explain前對錶進行分析,這樣更加的準確(剛剛踩了坑,每次explain出來都只有1條資料,統計有問題)

二、inner  join下的on和where分析

2.1、不使用分割槽進行過濾

1、首先看一個沒有條件的inner join的結果

SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;

  檢視執行計劃:

explain SELECT  * from `user` u  inner join  department d  on d.id=u.department_id;
STAGE DEPENDENCIES:
  Stage-4 is a root stage  
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: id is not null (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: department_id is not null (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: department_id is not null (type: boolean)
              Statistics: Num rows: 16 Data size: 2944 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 17 Data size: 3238 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

如上語句沒有做任何的條件過濾,也沒有使用分割槽:

  • 首先對d表(department)進行全表掃描,掃描到了6條資料,然後預設新增id is not null的過濾條件對其掃描到的6條資料進行過濾(自動優化,如果join下要保留null的行的話需要提前做特殊處理,不然預設就被優化掉了,會導致資料丟失),最終得到過濾後還剩下6條資料參與inner join。
  • 然後對u表(user)進行全表掃描,掃描到了16條資料,同樣新增預設的過濾條件department_id is not null,最終得到16條資料參與inner join。

 

2、接下來看一個有where條件和on條件下的結果

SELECT  * from `user` u inner join  department d on d.id=u.department_id
and d.count > 9
and u.age > 20
where u.age < 30;

  接下來看看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: ((count > 9) and id is not null) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((count > 9) and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
              Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上所示:

  • 首先掃描d(department)表,全表掃描6條資料,並對其進行過濾:((count > 9) and id is not null) ,過濾結果剩下2條資料進行inner join操作
  • 然後掃描u(user)表,也是全表掃描16條資料,並對其進行過濾((age > 20) and department_id is not null and (age < 30)),過濾剩下1條資料(這裡是有誤差的,其實應該剩餘4條資料,hive的執行計劃是一個大概的統計執行過程,不完全正確)進行inner join操作

小總結:inner join在不使用分割槽過濾的情況下,會對其兩個表都進行全表掃描,然後自動為join的鍵(on d.id=u.department_id)新增is not null的過濾條件,然後在配合on和where後面的條件進行過濾,在inner join中where和on是同時進行過濾的。

2.2、使用分割槽過濾

1、先看一個只有分割槽過濾的情況

SELECT  * from `user` u inner join  department d
on d.id=u.department_id
where u.`date`='2020-12-25';

 檢視執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: id is not null (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: id is not null (type: boolean)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: (department_id is not null and (date = '2020-12-25')) (type: boolean)
            Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE #這裡一個分割槽只有8條資料
            Filter Operator
              predicate: department_id is not null (type: boolean)
              Statistics: Num rows: 8 Data size: 134 Basic stats: COMPLETE Column stats: NONE 
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-25' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 8 Data size: 147 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上,這裡和沒有新增分割槽過的情況對比,就是添加了分割槽後不會對錶u(user)進行全表掃描,這樣的話就能提高效率,因為分割槽的儲存就是一個資料夾,所以在分割槽過濾後就可以指定分割槽進行掃描,就不會進行全表掃描,這樣的情況說明:分割槽表先進行分割槽過濾,然後對指定的分割槽進行全部掃描,然後再使用on和where以及自動新增的is not null條件進行過濾,過濾後的資料才進行inner join

 2、看一個帶條件和分割槽過濾的結果

SELECT  * from `user` u inner join  department d
on d.id=u.department_id
and d.count > 9
and u.age > 20
where u.age < 30  and u.`date`='2020-12-24';

  看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: ((count > 9) and id is not null) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((count > 9) and id is not null) (type: boolean)
              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: ((age > 20) and department_id is not null and (age < 30)) (type: boolean)
              Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上,得出結果與(不使用分割槽條件過濾且使用on和where過濾)對比,可以看出來,使用分割槽過濾的區別就是在對錶進行掃描的時候是掃描全部還是掃描指定的分割槽,如果沒有分割槽過濾,則掃描全表,否則,只對指定的分割槽進行掃描。

2.3、inner join下on和where的總結

在inner join下,如果where條件中使用了分割槽過濾,則掃描指定的分割槽的資料,然後在通過where和on條件進行過濾,以及為join的鍵(on d.id=u.department_id)新增is not null的過濾條件(這裡需要注意的是join鍵為null的資料是否要保留,需要保留的話,就需要對join鍵進行特殊的處理,否則資料則被過濾掉,導致資料丟失),這裡on和where是同時過濾的,不區分先後。

三、left/right join下的on和where分析

由於left join和right join屬於同一型別,所以本文只針對left join進行實踐。

3.1、非主表在on和where條件下執行

先看一條執行語句的結果(在非主表在on後面新增過濾條件)(約定:u  (主表) left join d(非主表))

SELECT * from `user` u
left join department d
on d.id=u.department_id
and d.count > 9

 檢視結果

   然後看一個執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            filterExpr: (count > 9) (type: boolean)
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (count > 9) (type: boolean)
              Statistics: Num rows: 2 Data size: 19 Basic stats: COMPLETE Column stats: NONE
              HashTable Sink Operator
                keys:
                  0 department_id (type: int)
                  1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上,我們發現在left join下,約定:u  (主表) left join d(非主表),非主表在on下面的條件d.count > 9過濾有效,最終掃描全部6條資料,通過條件過濾剩下2條資料然後進行left join,主表掃描全表進行left join,這裡注意,在left join條件下兩個表的join鍵(on d.id=u.department_id)都沒有加上is not null的條件過濾,所以在進行left join的時候需要注意join 鍵是否為空,為空的情況可以對其進行優化。

 

看一條執行語句的結果(在非主表在where後面新增過濾條件)(約定:u  (主表) left join d(非主表))

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where d.count > 9

結果如下:(與非主表在on後面新增的新增結果是不一樣的)

   看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Filter Operator
                predicate: (_col10 > 9) (type: boolean)
                Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 5 Data size: 1039 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

從如上的執行計劃來看,對錶u(user)和d(department)是在沒有任何過濾的情況下,進行了去全表掃描的left  join,在left join獲得結果後,然後再對結果使用非主表的where條件d.count > 9進行過濾

小總結:(left join)在非主表下使用on或者使用where進行過濾時的結果是不一樣的,如果是在on下面新增條件過濾,則先進行表的資料過濾,然後在進行left join,如果是在where後面新增條件過濾,則是先進行left join,然後在對left join得到的結果進行where條件的過濾,在left join中,不會對join鍵新增預設的is not null的過濾條件。

3.2、主表在on和where條件下執行

先看一條執行語句的結果(在主表在on後面新增過濾條件)(約定:u  (主表) left join d(非主表))

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
and u.age > 20

 看到如上的結果發現,還是保留了16條資料(user表的全部資料),但是發現age<=20的資料好像不參加left join一樣,後面的值全都是null。 

看看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              filter predicates:
                0 {(age > 20)}
                1 
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Map Join Operator
              condition map:
                   Left Outer Join0 to 1
              filter predicates:
                0 {(age > 20)}
                1 
              keys:
                0 department_id (type: int)
                1 id (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上,其中在處理d(department表)時,掃描全表6條資料,對錶d(department)進行標記age>20的條件,然後對u(user)表進行全表掃描並進行全表的left join,在left join的過程中對d(department)表(來源於d表的 的欄位)通過主表的條件age > 20進行篩選,如果u表的age <=20,則來源於d表的欄位全部置為null,(如上為個人理解,不一定正確。簡單來說,先做個判斷標記,然後進行left join,在left join的過程中通過條件進行過濾(不符合條件的資料保留主表的資料,非主表的資料丟棄,置為null)),這裡在on後面的條件不會對主表的條數產生影響,也是先進行left join並進行相應的過濾。理解起來比較繞,可以自己對應結果看看。

 

看一條執行語句的結果(在主表在where後面新增過濾條件)(約定:u  (主表) left join d(非主表))

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where u.age > 20

  接下來看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: (age > 20) (type: boolean)
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Filter Operator
              predicate: (age > 20) (type: boolean)
              Statistics: Num rows: 5 Data size: 920 Basic stats: COMPLETE Column stats: PARTIAL
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

結果如上,可以明確的看出來當在主表中使用where過濾,會先對主表的資料進行過濾然後在進行left join,主表掃描出16條資料,過濾後剩餘5條,然後再進行left join得到最終的結果。

小總結:(left join)在主表下使用on或者使用where進行過濾時的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然後再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的資料設定為null。

3.3、left/right join使用分割槽過濾

看如下語句:

SELECT  * from `user` u
left join  department d
on d.id=u.department_id
where u.age > 20 and u.`date` = '2020-12-24';

結果:

  看看執行計劃:

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        d 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        d 
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            HashTable Sink Operator
              keys:
                0 department_id (type: int)
                1 id (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            filterExpr: ((age > 20) and (date = '2020-12-24')) (type: boolean)
            Statistics: Num rows: 8 Data size: 136 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (age > 20) (type: boolean)
              Statistics: Num rows: 2 Data size: 34 Basic stats: COMPLETE Column stats: NONE
              Map Join Operator
                condition map:
                     Left Outer Join0 to 1
                keys:
                  0 department_id (type: int)
                  1 id (type: int)
                outputColumnNames: _col0, _col1, _col2, _col3, _col8, _col9, _col10
                Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), '2020-12-24' (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
                  outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
                  Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 6 Data size: 63 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

根據如上的執行計劃可以看出來,使用分割槽過濾可以防止全表掃描,如上值掃描了一個分割槽,所以資料只有8條。

3.4、left/right join下on和where的總結

  1. 使用分割槽條件過濾,可以防止全表掃描,最優先過濾
  2. 在主表下進行過濾,使用on和where過濾的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然後再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的資料設定為null。
  3. 在非主表下進行過濾,使用on和where過濾的結果是不一樣的,如果是在on下面新增條件過濾,則先進行表的資料過濾,然後在進行left join,如果是在where後面新增條件過濾,則是先進行left join,然後在對left join得到的結果進行where條件的過濾
  4. left/right join不會對join鍵自動新增is not null的過濾條件,所以在left/right join的時候要注意join鍵為null的情況,這裡是可以做優化的

四、full join下的on和where分析

4.1、沒有過濾條件的full  join

直接看一個沒有任何條件的full join

SELECT  * from `user` u
full join  department d
on d.id=u.department_id

  檢視執行計劃:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

執行計劃如上,會對每個表進行升序的排序,沒有自動優化(新增null過濾),執行全表的full join。

4.2、有where條件的full join

 SELECT  * from `user` u
full join  department d
on d.id=u.department_id
where u.age > 20 and  d.count > 9

結果如下

 檢視執行計劃:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: ((_col10 > 9) and (_col1 > 20)) (type: boolean)
            Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 1 Data size: 207 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

從執行計劃看出來,在full join下的使用where 進行過濾的時候是先進行全表掃描,然後進行full join,full join獲得結果後才對where中的條件進行過濾。

4.3、有on條件的full join(留有疑問)

SELECT  * from `user` u
full join  department d
on d.id=u.department_id
and u.age > 20 and  d.count > 9

看到如上結果,可能有點意外, (個人能力有限,厲害的博友可以解釋解釋),個人的理解為就像left join的主表下的on條件一樣,都是在full  join的過程中進行過濾,然而兩個表的全部資料都有保留下來,只有兩個條件都成立的情況下,才沒有null值。(在full join如果不懂,就儘量使用where條件判斷啦)

檢視執行計劃:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          filter predicates:
            0 {(VALUE._col0 > 20)}
            1 {(VALUE._col1 > 9)}
          keys:
            0 department_id (type: int)
            1 id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col0 (type: int), _col1 (type: int), _col2 (type: string), _col3 (type: string), _col4 (type: string), _col8 (type: int), _col9 (type: string), _col10 (type: int)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
            Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

留有疑問????,優秀的博友如果理解了,可以在下面留言

4.4、分割槽過濾的full join

SELECT  * from department d full join `user` u 
on d.id=u.department_id 
where u.`date`= '2020-12-24';

 

 檢視執行計劃:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: d
            Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: id (type: int)
              sort order: +
              Map-reduce partition columns: id (type: int)
              Statistics: Num rows: 6 Data size: 58 Basic stats: COMPLETE Column stats: NONE
              value expressions: name (type: string), count (type: int)
          TableScan
            alias: u
            Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
            Reduce Output Operator
              key expressions: department_id (type: int)
              sort order: +
              Map-reduce partition columns: department_id (type: int)
              Statistics: Num rows: 16 Data size: 3214 Basic stats: COMPLETE Column stats: PARTIAL
              value expressions: age (type: int), sex (type: string), name (type: string), date (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
          keys:
            0 id (type: int)
            1 department_id (type: int)
          outputColumnNames: _col0, _col1, _col2, _col6, _col7, _col8, _col9, _col10
          Statistics: Num rows: 17 Data size: 3535 Basic stats: COMPLETE Column stats: NONE
          Filter Operator
            predicate: (_col10 = '2020-12-24') (type: boolean)
            Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: _col0 (type: int), _col1 (type: string), _col2 (type: int), _col6 (type: int), _col7 (type: int), _col8 (type: string), _col9 (type: string), '2020-12-24' (type: string)
              outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7
              Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
              File Output Operator
                compressed: false
                Statistics: Num rows: 8 Data size: 1663 Basic stats: COMPLETE Column stats: NONE
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

根據執行計劃得知:在full join中,就算使用了分割槽過濾,還是先full join得到結果,然後在通過where條件進行過濾,所以推薦使用子查詢先過濾,然後在進行full join。

4.5、full join中的on和where總結

  1. 這裡在on的條件下還是留有疑問。。
  2. 在where的條件下不管是否使用分割槽過濾都是先full join,再進行過濾的,所以這裡現有通過子查詢過濾,再進行full join
  3. 在full jion中不會自動新增join 鍵為is not null的條件

五、總結

1、inner join

  • inner join首先可以通過分割槽進行過濾,防止全表掃描。
  • inner join會自動為join的鍵(on d.id=u.department_id)新增is not null的過濾條件
  • inner join 下on和where後面的條件進行過濾,在inner join中where和on是同時進行過濾,沒有順序的區別

2、left/right join

  • left/right join使用分割槽條件過濾,可以防止全表掃描,最優先過濾
  • left/right join在主表下進行過濾,使用on和where過濾的結果是不一樣的,當使用where對主表進行過濾的時候,先過濾再進行left join。當使用on對主表進行過濾,先在非主表進行過濾標記,然後再對全表進行left join時根據過濾條件把不符合條件的行中來源於非主表的資料設定為null。
  • left/right join在非主表下進行過濾,使用on和where過濾的結果是不一樣的,如果是在on下面新增條件過濾,則先進行表的資料過濾,然後在進行left join,如果是在where後面新增條件過濾,則是先進行left join,然後在對left join得到的結果進行where條件的過濾,所以過濾非主表的時候可以通過on進行條件過濾,這樣防止寫子查詢
  • left/right join不會對join鍵自動新增is not null的過濾條件,所以在left/right join的時候要注意join鍵為null的情況,這裡是可以做優化的

3、full join

  • full join中on下條件下過濾(有疑問,還待探究)
  • full join中where下條件過濾,不管是不是分割槽過濾,都是先進行full join,在根據條件進行過濾,這裡推薦子查詢先過濾在進行full join
  • 不會對join鍵自動新增is not null的過濾條件