大資料11-Hive執行機制與使用

阿新 • • 發佈：2019-01-12

hive介紹

hive是基於Hadoop的一個數據倉庫工具，可以將結構化的資料檔案對映為一張資料庫表，並提供簡單的sql查詢功能，可以將sql語句轉換為MapReduce任務進行執行。其優點是學習成本低，可以通過類SQL語句快速實現簡單的MapReduce統計，不必開發專門的MapReduce應用，十分適合資料倉庫的統計分析。

hive的執行機制

圖示

假設我在hive命令列客戶端使用建立了一個數據庫（database）myhive，接著又在該資料庫中建立了一張表emp。

create database myhive;
use myhive;
create table emp(id int,name string);

那麼hive會將元資料儲存在資料庫中。Hive 中的元資料包括表的名字，表的列和分割槽及其屬性，表的屬性（是否為外部表等），表的資料所在目錄等。
hive是基於hadoop的，所以資料庫和表均表現在hdfs上的目錄，資料資訊當然也是儲存在hdfs上。
對於上面的庫和表來說，會在hdfs上建立/user/hive/warehouse/myhive.db這樣的目錄結構，而表的資訊則可以自己上傳個檔案比如圖中的emp.data到/user/hive/warehouse/myhive.db目錄下。那麼就可以寫sql進行查詢了（注：寫查詢語句寫的是myhive這張表不刪emp.data，如select * from myhive,但是查詢到的是emp.data中的資訊，兩者結合可以理解為傳統資料庫的某張表），而這些元資料資訊都會儲存到外部的資料庫中（如mysql，當然也可以使用內嵌的derby，不推薦使用derby畢竟是內嵌的不能共享資訊）。
然後我再寫個查詢語句

select id,name from emp where id>2 order by id desc;

那麼是怎麼執行的呢？查詢語句交給hive，hive利用解析器、優化器等（圖中表示Compiler）,呼叫mapreduce模板，形成計劃，生成的查詢計劃儲存在 HDFS 中，隨後由Mapreduce程式呼叫，提交給job放在Yarn上執行。

hive與mapreduce關係

hive的資料儲存
1、Hive中所有的資料都儲存在 HDFS 中，沒有專門的資料儲存格式（可支援Text，SequenceFile，ParquetFile，RCFILE等）
2、只需要在建立表的時候告訴 Hive 資料中的列分隔符和行分隔符，Hive 就可以解析資料。
3、Hive 中包含以下資料模型：DB、Table，External Table，Partition，Bucket。
db：在hdfs中表現為${hive.metastore.warehouse.dir}目錄下一個資料夾
table：在hdfs中表現所屬db目錄下一個資料夾
external table：外部表, 與table類似，不過其資料存放位置可以在任意指定路徑
普通表: 刪除表後, hdfs上的檔案都刪了
External外部表刪除後, hdfs上的檔案沒有刪除, 只是把檔案刪除了
partition：在hdfs中表現為table目錄下的子目錄
bucket：桶, 在hdfs中表現為同一個表目錄下根據hash雜湊之後的多個檔案, 會根據不同的檔案把資料放到不同的檔案中

理論總讓人頭昏，下面介紹hive的初步使用上面的自然就明白了。

hive的使用
雖然可以使用hive與shell互動的方式啟動hive

[[email protected] ~]# cd apps/hive/bin
[[email protected] bin]# ll
總用量 32
-rwxr-xr-x. 1 root root 1031 4月  30 2015 beeline
drwxr-xr-x. 3 root root 4096 10月 17 12:38 ext
-rwxr-xr-x. 1 root root 7844 5月   8 2015 hive
-rwxr-xr-x. 1 root root 1900 4月  30 2015 hive-config.sh
-rwxr-xr-x. 1 root root  885 4月  30 2015 hiveserver2
-rwxr-xr-x. 1 root root  832 4月  30 2015 metatool
-rwxr-xr-x. 1 root root  884 4月  30 2015 schematool
[[email protected] bin]# ./hive
hive>

但是介面並不好看，而hive也可以釋出為服務（Hive thrift服務），然後可以使用hive自帶的beeline去連線。如下

視窗1，開啟服務

[[email protected] ~]# cd apps/hive/bin
[[email protected] bin]# ll
總用量 32
-rwxr-xr-x. 1 root root 1031 4月  30 2015 beeline
drwxr-xr-x. 3 root root 4096 10月 17 12:38 ext
-rwxr-xr-x. 1 root root 7844 5月   8 2015 hive
-rwxr-xr-x. 1 root root 1900 4月  30 2015 hive-config.sh
-rwxr-xr-x. 1 root root  885 4月  30 2015 hiveserver2
-rwxr-xr-x. 1 root root  832 4月  30 2015 metatool
-rwxr-xr-x. 1 root root  884 4月  30 2015 schematool
[[email protected] bin]# ./hiveserver2

視窗2，作為客戶端連線

[[email protected] bin]# ./beeline 
Beeline version 1.2.1 by Apache Hive
beeline> [[email protected] bin]# 
[[email protected] bin]# ./beeline 
Beeline version 1.2.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: root
Enter password for jdbc:hive2://localhost:10000: ******
Connected to: Apache Hive (version 1.2.1)
Driver: Hive JDBC (version 1.2.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

可能出現錯誤

Error: Failed to open new session: java.lang.RuntimeException: java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=EXECUTE, inode="/tmp":hadoop3:supergroup:drwx------

./hadoop dfs -chmod -R 777 /tmp

1、檢視資料庫

0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (1.456 seconds)

2、建立並使用資料庫，查看錶

0: jdbc:hive2://localhost:10000> create database myhive;
No rows affected (0.576 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+----------------+--+
| database_name  |
+----------------+--+
| default        |
| myhive         |
+----------------+--+
0: jdbc:hive2://localhost:10000> use myhive;
No rows affected (0.265 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+

3、查看錶資訊

結果肯定都是null，因為建立表的時候根本沒指定根據”,”來切分，而檔案中的欄位分隔用了逗號。那麼刪除該表，重新上傳檔案，重新建表語句如下

0: jdbc:hive2://localhost:10000> drop table emp;
No rows affected (1.122 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
+-----------+--+
0: jdbc:hive2://localhost:10000> create table emp(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited
0: jdbc:hive2://localhost:10000> fields terminated by ',';
No rows affected (0.265 seconds)
0: jdbc:hive2://localhost:10000> 

[[email protected] ~]# hadoop fs -put sz.data /user/hive/warehouse/myhive.db/emp
0: jdbc:hive2://localhost:10000> select * from emp;
+---------+-----------+--+
| emp.id  | emp.name  |
+---------+-----------+--+
| 1       | zhangsan  |
| 2       | lisi      |
| 3       | wangwu    |
| 4       | furong    |
| 5       | fengjie   |
+---------+-----------+--+

6、條件查詢

0: jdbc:hive2://localhost:10000> select id,name from emp where id>2 order by id desc;
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1508216103995_0004
INFO  : The url to track the job: http://mini1:8088/proxy/application_1508216103995_0004/
INFO  : Starting Job = job_1508216103995_0004, Tracking URL = http://mini1:8088/proxy/application_1508216103995_0004/
INFO  : Kill Command = /root/apps/hadoop-2.6.4/bin/hadoop job  -kill job_1508216103995_0004
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
INFO  : 2017-10-18 00:35:39,865 Stage-1 map = 0%,  reduce = 0%
INFO  : 2017-10-18 00:35:46,275 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.33 sec
INFO  : 2017-10-18 00:35:51,487 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.34 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 340 msec
INFO  : Ended Job = job_1508216103995_0004
+-----+----------+--+
| id  |   name   |
+-----+----------+--+
| 5   | fengjie  |
| 4   | furong   |
| 3   | wangwu   |
+-----+----------+--+
3 rows selected (18.96 seconds)

看到這就能明白了，寫的sql最後是被解析為了mapreduce程式放到yarn上來跑的，hive其實是提供了眾多的mapreduce模板。

7、建立外部表

0: jdbc:hive2://localhost:10000> create external table emp2(id int,name string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ','//指定逗號分割
0: jdbc:hive2://localhost:10000> stored as textfile//文字儲存方式
0: jdbc:hive2://localhost:10000> location '/company';
No rows affected (0.101 seconds)//儲存在/company目錄下

8、載入檔案資訊到表中
前面使用了hadoop命令將檔案上傳到了表對應的目錄下，但是也可以在命令列下直接匯入檔案資訊

0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table emp2;(也可以用hadoo直接上傳)
INFO  : Loading data to table myhive.emp2 from file:/root/sz.data
INFO  : Table myhive.emp2 stats: [numFiles=0, totalSize=0]
No rows affected (0.414 seconds)
0: jdbc:hive2://localhost:10000> select * from emp2;
+----------+------------+--+
| emp2.id  | emp2.name  |
+----------+------------+--+
| 1        | zhangsan   |
| 2        | lisi       |
| 3        | wangwu     |
| 4        | furong     |
| 5        | fengjie    |
+----------+------------+--+

9、表分割槽，分割槽欄位為school,匯入資料到2個不同的分割槽中

0: jdbc:hive2://localhost:10000> create table stu(id int,name string)
0: jdbc:hive2://localhost:10000> partitioned by(school string)
0: jdbc:hive2://localhost:10000> row format delimited fields terminated by ',';
No rows affected (0.319 seconds)
0: jdbc:hive2://localhost:10000> show tables;
+-----------+--+
| tab_name  |
+-----------+--+
| emp       |
| emp2      |
| stu       |
| t_sz_ext  |
+-----------+--+
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz.data' into table stu partition(school='scu');
INFO  : Loading data to table myhive.stu partition (school=scu) from file:/root/sz.data
INFO  : Partition myhive.stu{school=scu} stats: [numFiles=1, numRows=0, totalSize=46, rawDataSize=0]
No rows affected (0.607 seconds)
0: jdbc:hive2://localhost:10000> select * from stu;
+---------+-----------+-------------+--+
| stu.id  | stu.name  | stu.school  |
+---------+-----------+-------------+--+
| 1       | zhangsan  | scu         |
| 2       | lisi      | scu         |
| 3       | wangwu    | scu         |
| 4       | furong    | scu         |
| 5       | fengjie   | scu         |
+---------+-----------+-------------+--+
5 rows selected (0.286 seconds)
0: jdbc:hive2://localhost:10000> load data local inpath '/root/sz2.data' into table stu partition(school='hfut');
INFO  : Loading data to table myhive.stu partition (school=hfut) from file:/root/sz2.data
INFO  : Partition myhive.stu{school=hfut} stats: [numFiles=1, numRows=0, totalSize=46, rawDataSize=0]
No rows affected (0.671 seconds)
0: jdbc:hive2://localhost:10000> select * from stu;
+---------+-----------+-------------+--+
| stu.id  | stu.name  | stu.school  |
+---------+-----------+-------------+--+
| 1       | Tom       | hfut        |
| 2       | Jack      | hfut        |
| 3       | Lucy      | hfut        |
| 4       | Kitty     | hfut        |
| 5       | Lucene    | hfut        |
| 6       | Sakura    | hfut        |
| 1       | zhangsan  | scu         |
| 2       | lisi      | scu         |
| 3       | wangwu    | scu         |
| 4       | furong    | scu         |
| 5       | fengjie   | scu         |
+---------+-----------+-------------+--+

注：hive是不遵循三正規化的，別去考慮主鍵了。

10、新增分割槽

0: jdbc:hive2://localhost:10000> alter table stu add partition (school='Tokyo');

為了更直觀，去頁面檢視

大資料11-Hive執行機制與使用

可能出現錯誤

大資料11-Hive執行機制與使用

Hive執行機制與使用

大資料之Hive概況與部署

大資料開發----Hive（入門篇）

大資料 Hadoop介紹、配置與使用

【大資料】Hive作者肯定進修過藍翔挖掘機

R語言大資料分析工具的安裝與應用

PHP 底層的執行機制與原理解析（轉載）

大資料元件服務的啟動與關閉命令

https的執行機制與配置

Zookeeper執行機制與選舉機制

大資料平臺hive原生搭建教程

學習筆記:從0開始學習大資料-11. sqoop安裝部署

學習筆記:從0開始學習大資料-10. hive安裝部署

c++11多執行緒與執行緒池

#大資料Hadoop啟動執行遇到的問題#

python大佬養成計劃----執行緒與多執行緒

大資料Linux命令之許可權與修改許可權

大資料利用hive on spark程式操作hive

大資料8-Hive簡介和叢集搭建

大資料11-Hive執行機制與使用

可能出現錯誤

相關推薦