1. 程式人生 > >Hive初識(四)

Hive初識(四)

Hive本質上是一個數據倉庫,但不儲存資料(只儲存元資料(metadata),Hive中的元資料包括表的名字,表的列和分割槽及分割槽及其屬性,表的屬性(是否為外部表等),表的資料所在目錄等),使用者可以藉助Hive使用sql對儲存在分散式檔案系統中的大資料集進行讀寫

Hive查詢語言(HiveQL)是一種查詢語言,Hive處理在Metastore(元資料儲存)分析結構化資料。

SELECT語句用來從表中檢索的資料。WHERE子句中的工作原理類似於一個條件。它使用這個條件過濾資料,並返回給出一個有限的結果。

語法:下面給出的SELECT查詢的語法

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference 
[WHERE where_condition] 
[GROUP BY col_list] 
[HAVING having_condition] 
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]] 
[LIMIT number];

示例

舉個例子SELECT...WHERE子句。假設employee表有如下Id,Name,Salary,Designation和Dept等欄位,生成一個查詢檢索超過30000薪水的員工詳細資訊。

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  | 
+------+--------------+-------------+-------------------+--------+

下面的查詢檢索使用上述業務情景的員工詳細資訊:

SELECT * FROM employee WHERE salary>30000;

成功查詢後,能看到以下回應:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
+------+--------------+-------------+-------------------+--------+

下面介紹使用SELECT語句的ORDER BY子句。

示例:假設需要生成一個查詢用於檢索員工的詳細資訊。

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+

下面是使用上述業務情景查詢檢索員工詳細資訊:

SELECT * FROM employee ORDER BY DEPT;

成功查詢後能得到以下回應:

+------+--------------+-------------+-------------------+--------+
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
|1204  | Krian        | 40000       | Hr Admin          | HR     |
|1202  | Manisha      | 45000       | Proofreader       | PR     |
|1201  | Gopal        | 45000       | Technical manager | TP     |
|1203  | Masthanvali  | 40000       | Technical writer  | TP     |
+------+--------------+-------------+-------------------+--------+

GROUP BY子句用於分類所有記錄結果的特定集合列。它被用來查詢一組激勵。

如果用來產生一個查詢以檢索每個部門的員工數量。

+------+--------------+-------------+-------------------+--------+ 
| ID   | Name         | Salary      | Designation       | Dept   |
+------+--------------+-------------+-------------------+--------+ 
|1201  | Gopal        | 45000       | Technical manager | TP     | 
|1202  | Manisha      | 45000       | Proofreader       | PR     | 
|1203  | Masthanvali  | 40000       | Technical writer  | TP     | 
|1204  | Krian        | 45000       | Proofreader       | PR     | 
|1205  | Kranthi      | 30000       | Op Admin          | Admin  |
+------+--------------+-------------+-------------------+--------+ 

下面使用上述業務情景查詢檢索員工的詳細資訊。

SELECT Dept,count(*) FROM employee GROUP BY DEPT;

返回結果為:

+------+--------------+ 
| Dept | Count(*)     | 
+------+--------------+ 
|Admin |    1         | 
|PR    |    2         | 
|TP    |    3         | 
+------+--------------+


JOIN是子句用於通過使用共同值組合來自兩個表特定欄位。它是用來從資料庫中兩個或更多的表組合的記錄。它或多或少類似於SQL JOIN。

示例:

我們將使用下面兩個表,CUSTOMERS表

+----+----------+-----+-----------+----------+ 
| ID | NAME     | AGE | ADDRESS   | SALARY   | 
+----+----------+-----+-----------+----------+ 
| 1  | Ramesh   | 32  | Ahmedabad | 2000.00  |  
| 2  | Khilan   | 25  | Delhi     | 1500.00  |  
| 3  | kaushik  | 23  | Kota      | 2000.00  | 
| 4  | Chaitali | 25  | Mumbai    | 6500.00  | 
| 5  | Hardik   | 27  | Bhopal    | 8500.00  | 
| 6  | Komal    | 22  | MP        | 4500.00  | 
| 7  | Muffy    | 24  | Indore    | 10000.00 | 
+----+----------+-----+-----------+----------+

ORDERS表

+-----+---------------------+-------------+--------+ 
|OID  | DATE                | CUSTOMER_ID | AMOUNT | 
+-----+---------------------+-------------+--------+ 
| 102 | 2009-10-08 00:00:00 |           3 | 3000   | 
| 100 | 2009-10-08 00:00:00 |           3 | 1500   | 
| 101 | 2009-11-20 00:00:00 |           2 | 1560   | 
| 103 | 2008-05-20 00:00:00 |           4 | 2060   | 
+-----+---------------------+-------------+--------+

有不同型別的聯接給出如下:

JOIN

LEFT OUTER JOIN

RIGHT OUTER JOIN

FULL OUTER JOIN


JOIN子句用於合併和檢索來自多個表中的記錄。JOIN和SQL OUTER JOIN類似。連線條件是使用主鍵和表的外來鍵。

下面的查詢執行JOIN的CUSTOMER和ORDERS表。並檢索記錄。

hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

成功執行查詢後,能看到以下回應:

+----+----------+-----+--------+ 
| ID | NAME     | AGE | AMOUNT | 
+----+----------+-----+--------+ 
| 3  | kaushik  | 23  | 3000   | 
| 3  | kaushik  | 23  | 1500   | 
| 2  | Khilan   | 25  | 1560   | 
| 4  | Chaitali | 25  | 2060   | 
+----+----------+-----+--------+

LEFT OUTER JOIN

HiveQL LEFT OUTER JOIN返回所有行左表,即使是在正確的表中沒有匹配。這意味著,如果ON子句匹配的右表零記錄,JOIN還是返回結果行,但在右表中的每一行為NULL。

LEFT JOIN返回左表中的所有的值,加上右表,或JOIN子句沒有匹配的情況下返回NULL。

下面的查詢演示了CUSTOMER和ORDERS表之間的LEFT OUTER JOIN用法:

hive > SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROME CUSTOMERS c LEFT JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

成功執行查詢後,能看到以下回應:

+----+----------+--------+---------------------+ 
| ID | NAME     | AMOUNT | DATE                | 
+----+----------+--------+---------------------+ 
| 1  | Ramesh   | NULL   | NULL                | 
| 2  | Khilan   | 1560   | 2009-11-20 00:00:00 | 
| 3  | kaushik  | 3000   | 2009-10-08 00:00:00 | 
| 3  | kaushik  | 1500   | 2009-10-08 00:00:00 | 
| 4  | Chaitali | 2060   | 2008-05-20 00:00:00 | 
| 5  | Hardik   | NULL   | NULL                | 
| 6  | Komal    | NULL   | NULL                | 
| 7  | Muffy    | NULL   | NULL                | 
+----+----------+--------+---------------------+

RIGHT OUTER JOIN

HiveQL RIGHT OUTER JOIN返回右邊表的所有行,即使在左表中沒有匹配。如果ON子句的左表匹配零記錄,JOIN結果返回一行,但在左表中的每一行為NULL。

RIGHT JOIN返回右表中的所有值,加上左表,或者沒有匹配的情況下返回NULL。

下面的查詢演示了CUSTOMERS和ODERS表之間使用RIGHT OUTER JOIN。

hive > SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);

成功執行查詢後,能看到以下回應:

+------+----------+--------+---------------------+ 
| ID   | NAME     | AMOUNT | DATE                | 
+------+----------+--------+---------------------+ 
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 | 
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 | 
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 | 
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 | 
+------+----------+--------+---------------------+


FULL OUTER JOIN 

HiveQL FULL OUTER JOIN結合了左邊,並且滿足JOIN條件合適外部表的記錄。連線表包含兩個表的所有記錄,或兩側缺少匹配結果那麼使用NULL值填補。

下面的查詢演示了CUSTOMERS和ORDERS表之間的FULL OUTER JOIN:

hive > SELCE c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL OUTER JOIN ODERS o ON (c.ID = o.CUSTOMER_ID);

成功執行查詢後,能看到以下回應:

+------+----------+--------+---------------------+ 
| ID   | NAME     | AMOUNT | DATE                | 
+------+----------+--------+---------------------+ 
| 1    | Ramesh   | NULL   | NULL                | 
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 | 
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 | 
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 | 
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 | 
| 5    | Hardik   | NULL   | NULL                | 
| 6    | Komal    | NULL   | NULL                |
| 7    | Muffy    | NULL   | NULL                |  
| 3    | kaushik  | 3000   | 2009-10-08 00:00:00 | 
| 3    | kaushik  | 1500   | 2009-10-08 00:00:00 | 
| 2    | Khilan   | 1560   | 2009-11-20 00:00:00 | 
| 4    | Chaitali | 2060   | 2008-05-20 00:00:00 | 
+------+----------+--------+---------------------+