大資料技術學習筆記之hive框架基礎1-基本架構及環境部署

阿新 • • 發佈：2018-12-07

一、hive的介紹及其發展
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /course/view.php?id=27 HTTP/1.1" "303" "440" - "http://www.micro.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.micro.com"

-》需求：每小時的PV數統計
   -》資料採集
       -》將日誌上傳到HDFS
   -》資料清洗
       -》功能
           -》資料欄位的提取
               "27.38.5.159"   "31/Aug/2015:00:04:37 +0800" /course/view.php
           -》資料欄位的補充（一般儲存在RDBMS）
               -》使用者的資訊，訂單的資訊
           -》欄位格式化
               -》對欄位去除雙引號
               -》31/Aug/2015:00:04:37 +0800 -》 2015-08-31 00::04:37 20150831 00:04:37
   -》資料分析（MapReduce）
       -》input
       -》map
           key：小時
           value：URL
       -》shuffle
           key,{url1,url2……}
       -》reduce
           key,count(url)
       -》output


   -》如果使用SQL實現
       -》建立需要分析的表
           hpsk_log{
           ip,
           time,
           url
           ……
           }
       -》將小時提取出來
           time：20150831000437
           select ip,substring(time,9,2) hour,url from hpsk_log;
       -》進行每小時的PV統計
           預期結果：
               hour   pv
               00       100
               01       200
           select hour ,count(url) pv from
           (select ip,substring(time,9,2) hour,url from hpsk_log)
           group by hour;

   -》hive分析的本質：
       -》將資料檔案對映成表，使用SQL進行分析
   -》使用MapReduce分析的問題
       -》模式固定
       -》沒有schema，缺少類似於SQL的查詢語言
       -》開發成本比較高
   -》SQL on HADOOP：為Hadoop分析處理提供SQL語言的框架
       -》hive
       -》presto:facebook開源，記憶體式的處理，京東用的比較多
       -》impala:基於記憶體式的SQL處理框架
       -》spark SQL
   -》hive的特點：資料倉庫
       -》建立在Hadoop之上
       -》處理結構化的資料
       -》儲存依賴於HDSF：hive表中的資料是儲存在hdfs之上
       -》SQL語句的執行依賴於MapReduce
       -》hive的功能：讓Hadoop實現了SQL的介面，實際就是將SQL語句轉化為MapReduce程式
       -》hive的本質就是Hadoop的客戶端


二、hive的安裝部署及其架構
   -》官網：http://hive.apache.org/
   -》常用版本
       -》0.13.1：提供更多的SQL介面，穩定性更高
       -》1.2.1：提供更多的SQL介面，在處理效能方面的優化
   -》安裝：
       -》下載解壓
           tar -zxvf apache-hive-0.13.1-bin.tar.gz -C /opt/modules/
       -》修改配置檔案
           -》修改hive-env.sh
               mv hive-env.sh.template hive-env.sh
               HADOOP_HOME=/opt/modules/hadoop-2.5.0
               export HIVE_CONF_DIR=/opt/modules/hive-0.13.1-bin/conf
           -》環境變數的功能
               -》用於全域性訪問
               -》用於框架整合時的訪問
       -》建立資料倉庫目錄（設定同組可讀）
              bin/hdfs dfs -mkdir       /tmp
              bin/hdfs dfs -mkdir   -p /user/hive/warehouse
              bin/hdfs dfs -chmod g+w   /tmp
              bin/hdfs dfs -chmod g+w   /user/hive/warehouse
       -》啟動hive客戶端

   -》hive的架構
       -》metastore
           -》功能：用於儲存hive中資料庫、表、與資料檔案的對映
           -》儲存路徑：資料庫
               -》預設儲存在自帶的Derby資料庫中：metastore_db
               -》在企業中一般會修改為MySQL或者oracle資料庫
       -》client
           -》客戶端
           -》驅動
           -》SQL解析器
           -》語句優化器
           -》物理計劃
           -》執行
       -》Hadoop
           -》HDFS：用於儲存hive中表的資料
               預設的hive的儲存路徑：/user/hive/warehouse
           -》MapReduce
               用於hive分析計算，將SQL進行解析並處理
       -》hive支援的計算框架
           -》MapReduce
           -》Tez
           -》spark

三、配置MySQL儲存metastore
   -》預設是Derby資料庫儲存
       -》缺點：Derby儲存使用檔案儲存，同一時間只能啟動一個數據庫例項
       -》安全性不高
   -》配置使用MySQL
       -》安裝MySQL
           -》檢查是否已安裝MySQL
               sudo rpm -aq |grep mysql
           -》安裝
               sudo yum install -y mysql-server
           -》啟動MySQL的服務
               sudo service mysqld start
           -》配置開機啟動
               sudo chkconfig mysqld on
           -》配置管理員密碼
               mysqladmin -u root password '123456'
           -》進入MySQL
               mysql -u root -p
       -》配置使用者訪問許可權
           -》檢視
               select User,Host,Password from user;
           -》進行授權
               grant all privileges on *.* to 'root'@'%' identified by '123456' with grant option;
           -》將其他的使用者許可權刪除
               delete from user where host='127.0.0.1';
               delete from user where host='localhost';
               delete from user where host='bigdata-training01.hpsk.com';
           -》重新整理許可權
               flush privileges;
           -》重啟MySQL
               sudo service mysqld restart
   -》配置hive使用MySQL儲存元資料
       -》建立hive-site.xml檔案
       -》編輯，寫入配置項
           <property>
              <name>javax.jdo.option.ConnectionURL</name>
              <value>jdbc:mysql://bigdata-training01.hpsk.com:3306/metastore?createDatabaseIfNotExist=true</value>
              <description>JDBC connect string for a JDBC metastore</description>
           </property>
           <property>
              <name>javax.jdo.option.ConnectionDriverName</name>
              <value>com.mysql.jdbc.Driver</value>
              <description>Driver class name for a JDBC metastore</description>
           </property>
           <property>
              <name>javax.jdo.option.ConnectionUserName</name>
              <value>root</value>
              <description>username to use against metastore database</description>
           </property>
           <property>
              <name>javax.jdo.option.ConnectionPassword</name>
              <value>123456</value>
              <description>password to use against metastore database</description>
           </property>
       -》將連線驅動放入hive的lib目錄下
           cp /opt/tools/mysql-connector-java-5.1.27-bin.jar lib/
       -》啟動hive

四、基本命令和常用屬性
   -》基本命令
       -》資料庫
           create database if not exists student;
           show databases;
           use student;
       -》表
           -》將日誌檔案對映成表
           -》表的結構要與日誌檔案的結構一致
create table if not exists stu_tmp(
number string,
name string
) row format delimited fields terminated by '\t';
           -》載入資料
               load data local inpath '/opt/modules/hive-0.13.1-bin/student.txt' into table stu_info;
           -》查看錶
               show tables;
       -》函式
           -》show functions;
           -》desc function substring;
           -》desc function extended substring;
   -》hive的原理
       -》所有檢索查詢執行的是MapReduce
       -》hive中的資料庫、表在hdfs上都是一個檔案目錄
   -》常用屬性的配置
       -》資料倉庫目錄的配置
           -》預設值：/user/hive/warehouse
               <property>
                  <name>hive.metastore.warehouse.dir</name>
                  <value>/user/hive/warehouse</value>
                  <description>location of default database for the warehouse</description>
               </property>
           -》default資料庫預設的儲存目錄：/user/hive/warehouse
       -》啟用自定義的日誌配置資訊
           -》將conf目錄中的hive-log4j.properties.template重新命名
               mv hive-log4j.properties.template hive-log4j.properties
           -》修改日誌儲存路徑
               hive.log.threshold=ALL
               hive.root.logger=INFO,DRFA
               hive.log.dir=/opt/modules/hive-0.13.1-bin/logs
               hive.log.file=hive.log
       -》顯示當前當前資料庫
           <property>
              <name>hive.cli.print.current.db</name>
              <value>true</value>
              <description>Whether to include the current database in the Hive prompt.</description>
           </property>
       -》顯示錶頭資訊
           <property>
              <name>hive.cli.print.header</name>
              <value>true</value>
              <description>Whether to print the names of the columns in query output.</description>
           </property>
   -》啟動hive時，可以檢視的幫助命令
       -》bin/hive -help
           usage: hive
           -d,--define <key=value>          Variable subsitution to apply to hive
                                              commands. e.g. -d A=B or --define A=B
               --database <databasename>     Specify the database to use
           -e <quoted-query-string>         SQL from command line
           -f <filename>                    SQL from files
           -H,--help                        Print help information
           -h <hostname>                    connecting to Hive Server on remote host
               --hiveconf <property=value>   Use value for given property
               --hivevar <key=value>         Variable subsitution to apply to hive
                                              commands. e.g. --hivevar A=B
           -i <filename>                    Initialization SQL file
           -p <port>                        connecting to Hive Server on port number
           -S,--silent                      Silent mode in interactive shell
           -v,--verbose                     Verbose mode (echo executed SQL to the
                                              console)
       -》啟動時指定進入某個資料庫
           bin/hive --database student
       -》在命令列執行SQL語句:單條SQL語句
           bin/hive -e "show databases"
           bin/hive -e "show databases" >> /opt/datas/hive.exec.log
       -》在命令列執行一個包含SQL語句的檔案：多條sql語句
           bin/hive -f /opt/datas/hivetest.sql >> /opt/datas/hive.exec.log
       -》執行hive時，傳遞變數引數（可以是自定義，也可以是配置變數）
           bin/hive --hiveconf hive.cli.print.header=false
   -》常用的互動式操作
       -》退出
           exit、quit
       -》set命令
           -》檢視某個屬性的值
               set hive.cli.print.header;
               hive.cli.print.header=false
           -》修改某個屬性的值，並且立即生效
               set hive.cli.print.header=true;
       -》!用於在hive shell中執行Linux命令
           !ls /;
       -》dfs命令用於在hive shell中執行hdfs的命令
           dfs -ls /;
           dfs -ls /user/hive/warehouse;

五、資料庫與表的管理
   -》資料庫
       -》建立
create table if not exists tmp3_table(
number string,
name string
) row format delimited fields terminated by '\t';
load data local inpath '/opt/modules/hive-0.13.1-bin/student.txt' into table tmp3_table;
           -》本地匯入，將本地檔案複製到了hdfs上表的目錄下
           -》hdfs匯入，直接將檔案移動到了表的目錄下
           -》第一種
               create database if not exists tmp1;
           -》第二種：[LOCATION hdfs_path]，指定資料庫在hdfs上的目錄
               create database if not exists tmp2 location '/hive/tmp2';
       -》刪除
           drop database tmp1;
           刪除非空資料庫：
           drop database tmp1 cascade;
           -》刪除時會刪除元資料及HDFS的目錄
       -》檢視資訊
           desc database EXTENDED tmp2;
   -》表
       -》建立
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
   [ROW FORMAT row_format]
   [STORED AS file_format]
[LOCATION hdfs_path]
[AS select_statement];

       -》第一種：普通方式
       create table if not exists tmp3(
       col1 type,
       col2 type……
       )
       row format delimited fields terminated by '\t'
       stored as textfile
       location 'hdfs_path';

       -》第二種：as：子查詢方式
       create table tmp3_as as select name from tmp3_table;
       -》第三種：like
       CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
          LIKE existing_table_or_view_name
[LOCATION hdfs_path];
       create table tmp3_like like tmp3_table;
       -》as與like的區別
           -》as：將子查詢的結果，包括資料和表結構放入的新的表中
           -》like:只是複製了表結構
       -》刪除
           drop table tmp3_like;
           同時刪除元資料和hdfs的儲存目錄
       -》清空
           TRUNCATE table tmp3_as;
       -》描述
           desc tmp3_table;
           desc extended tmp3_table;
           desc formatted tmp3_table;
   -》建立員工表與部門表
       create database emp_test;
create table emp(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/emp.txt' into table emp;

create table dept(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/dept.txt' overwrite into table dept;

-》hive中表的型別
   -》管理表：
       Table Type:             MANAGED_TABLE


create table emp_m(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/emp.txt' into table emp_m;

drop table emp_m;

   -》外部表：EXTERNAL
create EXTERNAL table emp_e(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/emp.txt' into table emp_e;

drop table emp_e;

   -》對比
       -》管理表：
           -》刪除時會刪除元資料
           -》刪除時會刪除HDFS的資料檔案
       -》外部表：
           -》刪除時會刪除元資料
           -》刪除時不會刪除HDFS的資料檔案，保證了資料的安全性

-》企業中
   -》全部建立外部表，同時建立多張表，用於分析不同的業務
   -》通過location指定同一份資料來源

大資料技術學習筆記之hive框架基礎1-基本架構及環境部署

大資料技術學習筆記之hive框架基礎1-基本架構及環境部署

大資料技術學習筆記之Hadoop框架基礎1-Hadoop介紹及偽分散式部署

大資料技術學習筆記之hive框架基礎3-sqoop工具的使用及具體業務分析

大資料技術學習筆記之hive框架基礎2-hive中常用DML和UDF和連線介面使用

大資料技術學習筆記之Hadoop框架基礎2-MapReduce程式設計及執行流程

大資料技術學習筆記之Hadoop框架基礎5-Hadoop高階特性HA及二次排序思想

大資料技術學習筆記之Hadoop框架基礎3-網站日誌分析及MapReduce過程詳解

大資料技術學習筆記之Hadoop框架基礎4-MapReduceshuffer過程詳解及zookeeper框架學習

大資料技術學習筆記之網站流量日誌分析專案：資料採集層的實現3

大資料技術學習筆記之網站流量日誌分析專案：網站業務與企業架構2

大資料技術學習筆記之網站流量日誌分析專案：Flume日誌採集系統1

大資料技術學習筆記之linux基礎3-軟體管理與shell指令碼開發

大資料技術學習筆記之linux基礎2-基礎環境與系統管理

大資料技術學習筆記之linux基礎1-基礎環境與基礎命令

流媒體技術學習筆記之（八）海康、大華IpCamera RTSP地址和格式

流媒體技術學習筆記之（三）Nginx-Rtmp-Module統計某頻道在線觀看流的客戶數

大資料學習路線圖讓你精準掌握大資料技術學習

資料結構學習筆記之線性表

大資料Hadoop學習筆記（三）

大資料Hadoop學習筆記（二）

大資料技術學習筆記之hive框架基礎1-基本架構及環境部署

相關推薦