hive的基本簡介及安裝、配置、使用（一）

hive是什麼？

由facebook開源，用於解決海量結構化日誌的資料統計；

基於hadoop的一個數據倉庫工具，使用HDFS進行儲存並將結構化資料檔案對映成一張表，並提供類sql查詢的功能，其底層採用MR進行計算；

本質是將HQL轉化成MR程式。

hive架構圖

安裝前的準備

Java 1.7 (preferred)
Hadoop 2.x (preferred), 1.x (not supported by Hive 2.0.0 onward).

簡單安裝HIVE（以0.13.1版本為例）

# 1. 下載解壓安裝包 

wget http://archive.apache.org/dist/hive/hive-0.13.1/apache-hive-0.13.1-bin.tar.gz
tar -zxvf apache-hive-0.13.1-bin.tar.gz -C /opt/module

# 2. 配置檔案
# 2.1. conf/hive-env.sh
cp hive-env.sh.template  hive-env.sh
vim hive-env.sh
    HADOOP_HOME=/opt/modules/hadoop-2.5.0
    export HIVE_CONF_DIR=/opt/modules/apache-hive-0.13 
.1-bin/conf

# 3. 在HDFS上建立預設的儲存路徑(/tmp 和 /user/hive/warehouse，並且賦予許可權chmod g+w)
# 通過查詢hive-default.xml.template可知，該值可以通過 hive.metastore.warehouse.dir 引數設定
$HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

# 4. 啟動hive的cli客戶端，執行一個MR測試任務 

bin/hive
> show databases;
> use default;
> show tables;
> create table test_log(ip string, user string, requrl string);
> desc test_log;
> select count(*) from test_log;

使用mysql儲存hive元資料

hive元資料預設是存放在derby記憶體資料庫中的，也就是說一次只允許一個客戶端操作，這時候如果有另一個客戶端執行（bin/hive），則就會報如下錯誤：

Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

Caused by: javax.jdo.JDOFatalDataStoreException: Unable to open a test connection to the given database. JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating connection pool (set lazyInit to true if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Failed to start database 'metastore_db' with class loader [email protected]404b9385, see the next exception for details.

在企業中通常推薦使用mysql儲存hive的元資料，如下是hive與mysql的整合方法：

# 1. 拷貝mysql的驅動jar包到 hive的lib資料夾中
cp mysql-connector-java-5.1.38.jar $HIVE_HOME/lib/

# 2. 新建一個 conf/hive-site.xml 檔案用於配置 hive 的環境引數
cp hive-default.xml.template hive-site.xml
vim hive-site.xml
---------------------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <!--hive metastore with mysql-->
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://dbserver:3306/hivemetastore?createDatabaseIfNotExist=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jdbc.Driver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
        <description>username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>23wesdxc</value>
        <description>password to use against metastore database</description>
    </property>

    <!--cli header message-->
    <property>
        <name>hive.cli.print.header</name>
        <value>true</value>
        <description>Whether to print the names of the columns in query output.</description>
    </property>
    <property>
        <name>hive.cli.print.current.db</name>
        <value>true</value>
        <description>Whether to include the current database in the Hive prompt.</description>
    </property>
</configuration>

# 3. 執行hive客戶端之後會在mysql生成儲存metastore資料的資料庫（例如hivemetastore）
bin/hive


# [注1]
用mysql做元資料，需要在mysql命令列執行：
alter database hivemetastore character set latin1;
這樣才不會報  max key length is 767 bytes 
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes
這個異常

# [注2]
在hive中,show tables,create 等命令能正常執行,刪除表drop table x時,會出現卡住的現象.
進入mysql,
show variables like 'char%'
可以看到，按理說是正確的.
後面發現,是在建好hive資料庫後,沒有第一時間將character_set_database編碼由utf8修改為latin1.而是去hive中create了一張表.而後才將character_set_database編碼由utf8改成了latin

解決辦法:
在mysql中將drop hive
重新create hive,
修改編碼,alter database hive character set latin1
進入hive shell
建立表,drop表,正常!

日誌hive的簡單配置

# 1. 在模板的基礎上新建 conf/hive-log4j.properties 檔案（日誌引數設定）
cp hive-log4j.properties.template hive-log4j.properties

# 2.1 使用hive-log4j.properties日誌引數配置
bin/hive
Logging initialized using configuration in file:/opt/modules/apache-hive-0.13.1-bin/conf/hive-log4j.properties

# 2.2 使用自定義日誌引數配置
> bin/hive -help
usage: hive
 -d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -h <hostname>                    connecting to Hive Server on remote host
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

> bin/hive --hiveconf hive.root.logger=INFO,console

# 2.3 日誌檔案的位置
cat hive-log4j.properties
...
hive.log.dir=${java.io.tmpdir}/${user.name}  
hive.log.file=hive.log # 即 /tmp/ubuntu/hive.log 檔案
...

hive/bin
> set
> set system:user.name; #當前會話檢視
> set system:user.name=ubuntu; #當前會話設定
...
system:java.io.tmpdir=/tmp
system:user.name=ubuntu
...

hive cli常見互動命令

# hive cli 幫助
bin/hieve -help


# 檢視hive的函式的相關命令：
show functions;
desc function upper;
desc function extended upper;


# 在hive cli 互動視窗執行hdfs命令：
hive (default)> dfs -rm -R /user/ubuntu/xxx.log


# 在hive cli 互動視窗執行linux本地命令：
hive (default)> !ls /opt/datas


# 執行 hive sql 的五種方式：
# 方式一：
bin/hive -e "select * from db_hive.student";
# 方式二：
bin/hive -f xxx/xxx.sql
bin/hive -f xxx/xxx.sql > xxx/result.txt
# 方式三：(UDF)
bin/hive -i <filename> 
# 方式四：
hive (default)> source xxx/xxx.sql
# 方式五：
hive (default)> show databases;
hive (default)> create database db_hive;
hive (default)> use default;
hive (default)> show tables;

hive cli DDL/DML

資料庫相關：

-- 檢視資料庫(預設的資料庫為default)
show databases;
show databases like "db*";
desc database db_hive_01;

-- 建立和使用資料庫
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
          [COMMENT database_comment]
          [LOCATION hdfs_path]
          [WITH DBPROPERTIES (property_name=property_value, ...)];

create database db_hive_01;
create database if not exists db_hive_01;
create database if not exists db_hive_01 comment 'Hive資料庫01';
create database if not exists db_hive_01 comment 'Hive資料庫01' location '/user/ubuntu/hive/warehouse/db_hive_01.db';
use db_hive_01;


-- 修改資料庫
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)


-- 刪除資料庫
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
drop database if exists db_hive_01 cascade; --如果資料庫中存在表和資料，則刪除需要cascade關鍵字，這樣資料庫的目錄及其子目錄也將被刪掉

表相關：

-- 建立表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];  -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];

data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)

primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type
  : ARRAY < data_type >

map_type
  : MAP < primitive_type, data_type >

struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE



-- 建立普通表示例(用location來指定資料檔案的位置，這樣建立表的時候也載入了資料)
create table if not exists db_hive.student(
  id int comment '學生id', 
  name string comment '學生姓名'
) comment '學生表'
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/user/ubuntu/hive/warehouse/student'; 

-- 建立一個外部表示例(企業中大部分應用場景下使用外部表，建立外部表時必須要指定location引數)
create external table if not exists db_hive.student(
  id int comment '學生id', 
  name string comment '學生姓名'
) comment '學生表'
ROW FORMAT DELIMITED 
    FIELDS TERMINATED BY '\t'
    COLLECTION ITEMS TERMINATED BY '\n'
    STORED AS TEXTFILE
    LOCATION '/user/ubuntu/datas/student';

-- 普通表與外部表(external table)的區別：
1）建立表時：建立內部表時，會將資料移動到資料倉庫指向的路徑；若建立外部表，僅記錄資料所在的路徑，不對資料的位置做任何改變。
2）刪除表時：在刪除表的時候，內部表的元資料和資料會被一起刪除， 而外部表只刪除元資料，不刪除資料。這樣外部表相對來說更加安全些，資料組織也更加靈活，方便共享源資料。 
3）對於n個外部表指向HDFS同一個資料資料夾，當更改這個資料夾中的資料時，這n個外部表都會受到影響。 



-- 分割槽表
分割槽表實際上就是對應一個hdfs檔案系統上的一個獨立的資料夾，該資料夾下是該分割槽所有的資料檔案。hive中的分割槽就是分目錄，把一個大的資料集根據業務的需要劃分成更小的資料集。在查詢時通過where子句中的表示式來選擇查詢所需要的指定的分割槽，這樣的查詢效率會體改很多。
分割槽表的注意事項：對於普通表，先上傳(hdfs put)表相關資料檔案到HDFS相對應的檔案目錄再建立表的方式 與 先建立表再載入(hive load)表相關的資料檔案到HDFS 這兩種方式的效果是相同的；對於分割槽表，先上傳(hdfs put)表相關資料檔案到HDFS相對應的檔案目錄再建立表之後，select查詢該表格，結果是查不到剛才上傳的資料的，因為分割槽表除了在mysql資料庫存放與普通表相同的資訊，還在mysql儲存分割槽的相關資訊（可以查詢mysql-metastore：select * from patitions），這時候HIVE端分割槽表需要採取修復的方法，以在mysql端增加新建立分割槽表的metastore-patition資訊。
修復方法一：（一次性修復，msck：metastore check）
msck repair table db_hive_01.dept_patition;
修復方法二：（企業中常用）
alter table db_hive_01.dept_partition add patition(month='20170204',day='01');



-- 建立分割槽表
-- 先建立分割槽表，後向分割槽表中載入資料
create external table if not exists dept(deptno int, dname string, loc string)
partitioned by(month string, day string) --擁有兩級分割槽
row format delimited fields terminated by '\t'
location '/user/ubuntu/datas/dept';

load data local inpath '/opt/datas/dept01.txt' into table db_hive_01.dept partition(month='201708', day='01');
load data local inpath '/opt/datas/dept02.txt' into table db_hive_01.dept partition(month='201708', day='02');     

-- 先存存在資料檔案，後建立分割槽表
create external table if not exists dept(deptno int, dname string, loc string)
partitioned by(month string, day string) --擁有兩級分割槽
row format delimited fields terminated by '\t'
location '/user/ubuntu/datas/dept';

alter table dept add partition(month='201708', day='01');
alter table dept add partition(month='201708', day='02');
或
msck repair table `db_hive_01.dept`;


-- 以as的方式建立表(CATS, create table as select)
create table if not exists db_hive_01.student_ctas
as select id,name from db_hive_01.student_01_ext; --會生成內部表的資料檔案student_ctas/000000_0，包含了select查到的一些資料，刪除內部表時也只刪除內部表的資料檔案


-- 以like的方式建立表(資料非常大，可以每天的資料放在新建立的一張表中)
create external table if not exsist db_hive_01.student_ext_03
like db_hive_01.student_ext_02 location '/user/ubuntu/datas/student02';



-- 查看錶資訊
desc db_hive.student;
desc extended db_hive.student;
desc formatted db_hive.student;


--查詢某個分割槽的資料：
select * from db_hive_01.dept where month='201708';
-- 檢視某張分割槽表的分割槽
show partitions db_hive_01.dept_partition;     
-- 合併分割槽的資料：
$vi xxx.sql
    select dname from db_hive_01.dept_partition where month='201702' and day='01'                 
    union all
    select dname from db_hive_01.dept_partition where event_month='201703' and day='01';
$bin/hive -f xxx.sql > xxx/result.txt;


-- 修改表名(如果是內部表，則相應的資料儲存檔案路徑名也會被修改，注意修改的時候前面不能加資料庫名)
alter table student rename to student01;


-- 清空表資料(只能清空內部表的資料)
truncate table db_hive_01.student;


-- group by聚合查詢
-- 查詢每個部門的平均工資示例：
select e.deptno, avg(e.sal) from db_hibve_01.emp e group by e.deptno;  
-- 查詢每個部門中每個崗位的最高薪水示例：
select e.deptno,e.job,max(e.sal) maxsal from db_hibve_01.emp e group by e.deptno,e.job;


-- having過濾
-- having與where的區別：
-- where 是針對單條記錄進行篩選，having 是針對分組的結果進行篩選
-- 求部門平均薪水大於2000的部門示例：
select e.deptno,avg(sal) avgsal from db_hive_01.emp e group by e.deptno having avgsal>2000; 


-- 多表查詢join...on:
select e.empno,e.ename,d.deptno,d.dname from db_hive_01.emp e join db_hive_01.dept d on e.deptno=d.deptno;



-- Hive對查詢結果的排序
-- 1. order by，對全域性資料的排序，只在一臺節點執行，僅僅只有一個reduce，使用需謹慎！
select * from db_hive_01.emp order by empno;

-- 2. sort by，對每個reduce內部資料進行排序（即對每個分割槽的資料排序，預設的分割槽是hash-key對reduce個數取模，執行之前先設定reduce的個數），不是全域性排序，對全域性結果集來說是沒有排序的
set mapreduce.job.reduces = 3; --預設值為-1，一個reduce
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.emp sort by empno asc; -- 會生成mapreduce.job.reduces個結果檔案

-- 3. distribute by，設定MR分割槽，對資料進行分割槽，結合sort by使用
-- 注意：distribute by 必須要在sort by 之前！
-- 像下面的例子，一般有多少個部門編號(deptno)，就使用多少個reducer執行reduce任務(這樣每個結果檔案儲存了一個分割槽的資料)
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.emp e distribute by e.deptno sort by e.empno;

-- 4. cluster by
-- 當distribute和sort欄位相同時，使用該方式
insert overwrite local directory '/opt/datas/test/results/' 
select * from db_hive_01.dept d cluster by d.deptno;

表資料的匯入和匯出：

-- hive匯入源資料常見的九種形式：
-- 1. 載入本地檔案到hive表：
load data local inpath 'local_path' into table db_hive_01.dept;

-- 2. 載入dfs檔案檔hive表（載入完成後dfs檔案會被刪除）
load data inpath 'dfs_path' into table db_hive_01.dept;

-- 3. 覆蓋表中已有的資料
load data inpath 'dfs_path' overwrite into table db_hive_01.dept;

-- 4. 載入資料到分割槽表
load data [local] inpath ‘filepath’ [overwrite] db_hive_01.dept partition(month='201708', day='02',..);

-- 5. 建立表時通過select載入：
create [external] table if not exists db_hive_01.dept_ctas
as select * from db_hive_01.dept;

-- 6. 建立表時通過insert載入：
create [external] table if not exists db_hive_01.dept_like
like db_hive_01.dept；
insert into db_hive_01.dept_like select * from b_hive_01.dept;

-- 7. 建立表時通過location指定載入：
create [external] table if not exists db_hive_01.dept(
    deptno int,
    dname string,
    loc string
) 
row format delimited fields terminated by '\t'
location 'pathxxx';

-- 8. 修復表時載入：
msck repair table db_hive_01.dept_patition;
alert table db_hive_01.dept_partition add patition(month='201702', day='01'); 

-- 9. import，將外部資料匯入到hive表中去：
import [external] table db_hive_01.dept_partition [patition(month='201702', day='01')] from 'xxx/db_hive_01_dept_export' [location 'import_target_path'];



-- hive匯出結果資料的四種常見的方式：
-- 1. 將結果資料插入到本地磁碟：
hive(db_hive_01)> insert [overwrite] [local] directory 'filepath'
                            row format delimited
                                fields terminated by '\t'
                                collection items terminited by '\n'
                            select * from db_hive_01.dept;

-- 2. 直接在本地將執行結果重定向到指定檔案
bin/hive -e "select * from db_hive_01.dept;" > xxx/result.txt

-- 3. export方式將hive表中的資料匯出到外部
-- 匯出的路徑是hdfs上的檔案路徑，匯出操作會拷貝資料資料夾和檔案到目標路徑，而且會在目標路徑下生成_metadata檔案儲存表的元資料資訊
export table db_hive_01.dept_partition [partition(month='201702',day='01')] to 'distpath';

-- 4. sqoop工具匯入匯出hive資料到傳統關係資料庫 或 Hbase
略

其他hive命令:

-- hive互動視窗執行的命令歷史記錄位於
~/.hivehistory

-- 檢視函式
show functions;
desc function upper;
desc function extended upper;

HIVE的在ETL中的應用

1. 建立表及載入原始資料（E: extract）
        create table keyllo_log_201703(
            conetnt string
        );
        load xxx;

2. 資料預處理 （T: transafer）
        create table xxx AS select xxx
        python 
            輸入 >> content
            |
            |regex
            |
            輸出 << ip,req_url,http_ref


3. 子表載入資料（L: load）
        load

使用者自定義函式（UDF）

UDF是使用者自定義函式允許使用者擴充套件HiveSQL的功能，具體有三種UDF（UDF、UDAF、UDTF）

UDF：一進一出
UDAF：User Defined Aggregation Function，聚集函式，多進一出，類似於 count/min/max
UDTF：User Defined Table-Generating Function，一進多出，如lateral view explore()

下面以自定義一個轉化小寫函式進行說明：

//1. 需要的jar包
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>${hive.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>${hive.version}</version>
</dependency>


//2. 編碼定義自己的UDF
package com.keyllo.hive.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
/**
 * 定義自己的Hive function
 *   [注意事項]  
 *      a. UDF必須要有返回型別，可以返回null，但是返回型別不能是void 
 *      b. UDF中常用mapreduce中的Text/LongWritable 等型別，不推薦使用java型別
 */
public class MyLowerUDF extends UDF {
    // 繼承UDF類，實現evaluate函式（支援過載）
    public Text evaluate(Text str) {
        if (str == null) {
            return null;
        }
    return new Text(str.toString().toLowerCase());
}


//3. 將剛寫的自定義函式打包上傳到 本地 或 HDFS檔案系統上，並在hive中註冊使用自定義函式，使用函式
//方式一：(臨時建立，企業中常用，一般放在指令碼中執行)
hive(db_hive_01)> add jar /opt/datas/test/lib/hive-udf.jar;
hive(db_hive_01)> create temporary function my_lower1 as 'com.keyllo.hive.udf.MyLowerUDF';
hive(db_hive_01)> select ename,my_lower1(ename) lower_ename from emp limit 5;

//方式二：（永久建立，需要將jar包上傳到HDFS檔案系統）
hive(db_hive_01)> create function my_lower2 as 'com.keyllo.hive.udf.MyLowerUDF' using jar 'hdfs://ns1/user/ubuntu/lib/hive-udf.jar';
hive(db_hive_01)> select ename,my_lower2(ename) lower_ename from emp limit 5;

hiveserver2 & beeline簡介（應用較少）

問題導讀

1.hive允許遠端客戶端使用哪些程式語言？

2.已經存在HiveServer為什麼還需要HiveServer2？

3.HiveServer2有哪些優點？

4.hive.server2.thrift.min.worker.threads-最小工作執行緒數，預設為多少？

5.啟動Hiveserver2有哪兩種方式？

在之前的學習和實踐Hive中，使用的都是CLI或者hive –e的方式，該方式僅允許使用HiveQL執行查詢、更新等操作，並且該方式比較笨拙單一。幸好Hive提供了輕客戶端的實現，通過HiveServer或者HiveServer2，客戶端可以在不啟動CLI的情況下對Hive中的資料進行操作，兩者都允許遠端客戶端使用多種程式語言如Java、Python向Hive提交請求，取回結果。

HiveServer或者HiveServer2都是基於Thrift的，但HiveSever有時被稱為Thrift server，而HiveServer2卻不會。既然已經存在HiveServer為什麼還需要HiveServer2呢？這是因為HiveServer不能處理多於一個客戶端的併發請求，這是由於HiveServer使用的Thrift介面所導致的限制，不能通過修改HiveServer的程式碼修正。因此在Hive-0.11.0版本中重寫了HiveServer程式碼得到了HiveServer2，進而解決了該問題。HiveServer2支援多客戶端的併發和認證，為開放API客戶端如JDBC、ODBC提供了更好的支援。

既然HiveServer2提供了更強大的功能，將會對其進行著重學習，但也會簡單瞭解一下HiveServer的使用方法。在命令中輸入hive –service help，結果如下。

$ bin/hive --service help
Usage ./hive <parameters> --service serviceName <service parameters>
Service List: beeline cli help hiveserver2 hiveserver hwi jar lineage metastore metatool orcfiledump rcfilecat schemaTool version 
Parameters parsed:
  --auxpath : Auxillary jars 
  --config : Hive configuration directory
  --service : Starts specific service/component. cli is default
Parameters used:
  HADOOP_HOME or HADOOP_PREFIX : Hadoop install directory
  HIVE_OPT : Hive options
For help on a particular service:
  ./hive --service serviceName --help
Debug help:  ./hive --debug --help

$ bin/hive --service hiveserver -help
Starting Hive Thrift Server
usage: hiveserver
 -h,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --maxWorkerThreads <arg>      maximum number of worker threads,
                                  default:2147483647
    --minWorkerThreads <arg>      minimum number of worker threads,
                                  default:100
 -p <port>                        Hive Server port number, default:10000
 -v,--verbose                     Verbose mode

#啟動hiveserver服務，預設hiveserver執行在埠10000，最小100工作執行緒，最大2147483647工作執行緒。
$ bin/hive --service hiveserver -v
Starting Hive Thrift Server
Starting hive server on port 10000 with 100 min worker threads and 2147483647 max worker threads

接下來學習更強大的hiveserver2。Hiveserver2允許在配置檔案hive-site.xml中進行配置管理，具體的引數為：

hive.server2.thrift.min.worker.threads – 最小工作執行緒數，預設為5。
hive.server2.thrift.max.worker.threads – 最小工作執行緒數，預設為500。
hive.server2.thrift.port – TCP 的監聽埠，預設為10000。
hive.server2.thrift.bind.host – TCP繫結的主機，預設為localhost。

也可以設定環境變數HIVE_SERVER2_THRIFT_BIND_HOST和HIVE_SERVER2_THRIFT_PORT覆蓋hive-site.xml設定的主機和埠號。從Hive-0.13.0開始，HiveServer2支援通過HTTP傳輸訊息，該特性當客戶端和伺服器之間存在代理中介時特別有用。與HTTP傳輸相關的引數如下：

hive.server2.transport.mode – 預設值為binary（TCP），可選值HTTP。
hive.server2.thrift.http.port – HTTP的監聽埠，預設值為10001。
hive.server2.thrift.http.path – 服務的端點名稱，預設為 cliservice。
hive.server2.thrift.http.min.worker.threads – 服務池中的最小工作執行緒，預設為5。
hive.server2.thrift.http.max.worker.threads – 服務池中的最小工作執行緒，預設為500。

啟動Hiveserver2有兩種方式:

一種是上面已經介紹過的 bin/hive –service hiveserver2
另一種更為簡潔，為 bin/hiveserver2

預設情況下，HiveServer2以提交查詢的使用者執行查詢（true），如果hive.server2.enable.doAs設定為false，查詢將以執行hiveserver2程序的使用者執行。為了防止非加密模式下的記憶體洩露，可以通過設定下面的引數為true禁用檔案系統的快取：

fs.hdfs.impl.disable.cache – 禁用HDFS檔案系統快取，預設值為false。
fs.file.impl.disable.cache – 禁用本地檔案系統快取，預設值為false。

簡單使用hiveserver2：參考

# 後臺程序啟動hiveserver2服務端
$ bin/hiveserver2 &


# 檢視beeline幫助
/beeline  --help

# 使用beeline連線hiveserver2服務
$ bin/beeline
> !connect jdbc:hive2://localhost:10000 ubuntu pwdxxx
> ...
0: jdbc:hive2://localhost:10000> show databases;
+----------------+
| database_name  |
+----------------+
| db_hive_01     |
| default        |
+----------------+

# 或者使用beeline連線hiveserver2服務
$ bin/beeline -u jdbc:hive2://localhost:10000/db_hive_01 -n ubuntu -p pwdxxx

hiveserver2 jdbc 驅動使用：參考 , 通常使用hive的 select * 快速查詢。

package com.keyllo.hive.jdbc;
import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;
/**
 * hiveserver2 jdbc的用途：
 *      將分析的結果儲存在hive結果表【資料量小】中，前端可以通過dao程式碼進行資料的查詢 
 * hiveserver2的缺點：
 *      併發性不好，企業中不常用
 */
public class HiveJdbcClient {
    private static String DRIVERNAME = "org.apache.hive.jdbc.HiveDriver";

    public static void main(String[] args) throws SQLException {
        try {
            Class.forName(DRIVERNAME);
        } catch (ClassNotFoundException e) {
            e.printStackTrace();
            System.exit(1);
        }

        // get connection
        Connection conn = DriverManager.
          getConnection("jdbc:hive2://hadoop01:10000/db_hive_01", "ubuntu", "xxxxxx");


        /*
        //create table
        Statement stmt = conn.createStatement();
        String tableName = "dept";
        stmt.execute("drop table if exists " + tableName);
        stmt.execute("create table " + tableName + " (deptno int, dname string, loc string)");

        // show tables
        String sql = "show tables '" + tableName + "'";
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        if (res.next()) {
            System.out.println(res.getString(1));
        }

        // describe table
        sql = "describe " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1) + "\t" + res.getString(2));
        }

        // load data into table
        // NOTE: filepath has to be local to the hive server
        // NOTE: '/opt/datas/test/datatap/db_hive_01_dept.txt' is a ctrl-A separated file with two fields per line
        String filepath = "/opt/datas/test/datatap/db_hive_01_dept.txt";
        sql = "load data local inpath '" + filepath + "' into table " + tableName;
        System.out.println("Running: " + sql);
        stmt.execute(sql);
        */

        // select * query
        Statement stmt = conn.createStatement();
        String tableName = "dept";      
        String sql = "select * from " + tableName;
        System.out.println("Running: " + sql);
        ResultSet res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(String.valueOf(res.getInt(1)) + "\t" + res.getString(2));
        }

        // regular hive query
        sql = "select count(1) from " + tableName;
        System.out.println("Running: " + sql);
        res = stmt.executeQuery(sql);
        while (res.next()) {
            System.out.println(res.getString(1));
        }
    }
}

hive 常見資料壓縮技術

常見的壓縮格式：

              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    Ansible基礎認識及安裝使用詳解（一）--技術流ken
        
Ansible簡介 
 
  
   
   ansible是新出現的自動化運維工具，基於Python開發，集合了眾多運維工具(puppet、cfengine、chef、func、fabric)的優點，實現了批量系統配置、批量程式部署、批量執行命令等功能。 
   
  
 
 
  
    

  
 

    

    
    hive的基本簡介及安裝、配置、使用（一）
      
							
							
							
  hive是什麼？
  
  
  由facebook開源，用於解決海量結構化日誌的資料統計；
  基於hadoop的一個數據倉庫工具，使用HDFS進行儲存並將結構化資料檔案對映成一張表，並提供類sql查詢的功能，其底層採用MR進行計算；
  本質是將HQL 

  
 

    

    
    Mongodb簡介及安裝部署配置
      red   pid   amount   move   num   root   follow   pac   lookup   1、Mongodb簡介及安裝部署
Mongodb 邏輯結構：Mongodb 邏輯結構                         MySQL邏輯結構庫database      

  
 

    

    
    rsync文件同步工具介紹、常用選項及rsync通過ssh同步 （一）
      20180514一、rsync工具介紹（文件同步工具）1?實現a目錄保存到b目錄下，但是a目錄的數據一直在更新。用cp命令很浪費時間。這時我們就要用到rsync命令了。它可以實現增量拷貝，也支持遠程同步。本地拷貝：rsync -av /etc/passwd  /tmp/1.txt遠程拷貝：rsync -av  

  
 

    

    
    Asp.net Core 使用Jenkins + Dockor 實現持續整合、自動化部署（一）：Jenkins安裝
       
寫在前面 
其實園子裡很多大佬都寫過，我也是一個搬運工很多東西不是原創的，不過還是想把自己安裝的過程，記錄下來如果能幫到大家的忙，也是一件功德無量的事； 
執行環境 
centos：7.2 cpu:1核 2G記憶體 1M頻寬 其實用的騰訊雲 
安裝jenkins 
這裡的jenkins就不從docker  

  
 

    

    
    linux常用基本命令之使用者、許可權管理（一）
      
							
							
							簡介
⽤戶是Unix/Linux系統⼯作中重要的⼀環，⽤戶管理包括⽤戶與組賬號的管理。在Unix/Linux系統中，不論是由本機或是遠端登入系統，每個系統都必須擁有⼀個賬號，並且對於不同的系統資源擁有不同的使⽤許可權。Unix/Linux系統中的root賬號通常 

  
 

    

    
    一、VMware安裝centos連線Xshell（一）
       
 
 
  下載安裝VMware  
 
       VMware官網地址：https://www.vmware.com，進入首頁->點選download： 
  
 選擇下載版本： 
  
 下載完成後，預設安裝即可，其中安裝的路徑可自己選擇，許 

  
 

    

    
    Java強、軟、弱和虛引用及GC Root——記憶體優化（一）
      
							
							
							

你也可以檢視我的其他同類文章，也會讓你有一定的收貨！



記憶體優化

記憶體優化的兩個主要方向：


記憶體洩露：已經沒有使用的物件，GC Root 還對其保持強引用，導致GC無法回收。
記憶體抖動：頻繁的建立物件，導致 GC 頻率較高，導致應用的卡頓
 

  
 

    

    
    Python 中的進程、線程、協程、同步、異步、回調（一）
      互聯網   科技   編程   一、上下文切換技術簡述在進一步之前，讓我們先回顧一下各種上下文切換技術。不過首先說明一點術語。當我們說“上下文”的時候，指的是程序在執行中的一個狀態。通常我們會用調用棧來表示這個狀態——棧記載了每個調用層級執行到哪裏，還有執行時的環境情況等所有有關的信息。當我們說“上下文切換” 

  
 

    

    
    解析Java中的String、StringBuilder、StringBuffer類（一）
      world!   index   ret   ofb   body   理解   rgs   private   引入   引言
String 類及其相關的StringBuilder、StringBuffer 類在 Java 中的使用相當的多，在各個公司的面試中也是必不可少的。因此，在本周，我打算花費一些時間 

  
 

    

    
    在CentOS上安裝Hadoop集群（一）-- Centos系統配置
      修改   AD   vi命令   分享圖片   命令   wall   eth0   host   log   在CentOS上安裝Hadoop集群（一）
1、  Centos的系統配置
1.1打開終端方式： 
方式1:在桌面單擊右鍵，>>Open in terminal
 
 
方式2:Appl 

  
 

    

    
    CentOS7上安裝配置GitLab（一）
      gitlab   GitLab   CentOS7 GitLab       雖然GitHub已經很好了，但是我們必須聯上公網才可以使用並且如果不付費的話，你的代碼在網上就是公開的！但是在企業環境中，我們公司的代碼不希望被公開並且也不想付費給GitHub，這時怎麽辦呢？我們可以用GitLab搭建企業自己的Gi 

  
 

    

    
    三、處理機管理（一）--程序的引入，程序
       
 
 
   程序的引入  
 
 一個程式通常由若干個程式段組成，他們必須按照某種先後次序執行，前一個操作執行完後，才能執行後繼操作，這種計算過程即程式的順序執行過程。 
 順序執行的特性：順序性、封閉性、可再現性 
 這樣系統中一次只能執行一個獨立程式，導致計算機不同部件之間有忙有閒，不能夠充分發揮系 

  
 

    

    
    移動端開發~視口viewport 、meta常用設定、常見問題（一）
       
 
 viewport 視口 (可視區視口)； 
 視口（viewport）是使用者網頁的可視區域，也可稱之為視區。 
 預設不設定 viewport 可視區視窗的寬度在移動端的時候是980；  
 meta標籤的設定   設定視口viewport
<meta name="viewport 

  
 

    

    
    Hadoop介紹、儲存模型、副本策略、架構模型（一）
       
 
  
  
 Hadoop簡介 
 Hadoop 的作者 Doug cutting， Google 在2003年-2004年公開了部分 GFS 和 Mapreduce 思想的細節，以此為基礎 Doug Cutting 等人用了2年業餘時間實現了 DFS 和 Maperduce機制，一個微縮版：Nutc 

  
 

    

    
    Elam的caffe筆記之配置篇（一）：CentOS6.5編譯安裝gcc4.8.2
       
  
  
 Elam的caffe筆記之配置篇（一）：CentOS6.5編譯安裝gcc4.8.2 
  
 配置要求： 
 系統：centos6.5  目標：基於CUDA8.0+Opencv3.1+Cudnnv5.1+python3.6介面的caffe框架 
  
 任何對linux處於入門級別的小白都應 

  
 

    

    
    Python 多執行緒、多程序 （一）之 原始碼執行流程、GIL
      Python 多執行緒、多程序 （一）之 原始碼執行流程、GIL Python 多執行緒、多程序 （二）之 多執行緒、同步、通訊 Python 多執行緒、多程序 （三）之 執行緒程序對比、多執行緒 
一、python程式的執行原理 
許多時候，在執行一個python檔案的時候，會發現在同一目錄下會出現一個__ 

  
 

    

    
    Jenkins持續整合介紹及外掛安裝版本更新演示（一）--技術流ken
        
Jenkins介紹 
  
 Jenkins是一個開源軟體專案，是基於Java開發的一種持續整合工具，用於監控持續重複的工作，旨在提供一個開放易用的軟體平臺，使軟體的持續整合變成可能。 
Jenkins功能包括: 
1、持續的軟體版本釋出/測試專案。 
2、監控外部呼叫執行 

  
 

    

    
    軟件安裝配置筆記（一）——oracle的安裝與配置
      管理器   系統   同時   acl   安裝配置   默認方法   network   設置   分號   註：
1、當ArcGIS Server 和 ArcMap 安裝在一臺服務器上，Oracle 安裝在另一臺服務器上時，ArcGIS Server 和 ArcMap的服務器需要同時安裝 32 位 和 6 

  
 

    

    
    軟體安裝配置筆記（一）——oracle的安裝與配置
      注： 
1、當ArcGIS Server 和 ArcMap 安裝在一臺伺服器上，Oracle 安裝在另一臺伺服器上時，ArcGIS Server 和 ArcMap的伺服器需要同時安裝 32 位 和 64 位 Oracle 客戶端。 
（server需64位，m

hive的基本簡介及安裝、配置、使用（一）

hive架構圖

安裝前的準備

簡單安裝HIVE（以0.13.1版本為例）

使用mysql儲存hive元資料

日誌hive的簡單配置

hive cli常見互動命令

hive cli DDL/DML

HIVE的在ETL中的應用

使用者自定義函式（UDF）

hiveserver2 & beeline簡介（應用較少）

hive 常見資料壓縮技術

Ansible基礎認識及安裝使用詳解（一）--技術流ken

hive的基本簡介及安裝、配置、使用（一）

Mongodb簡介及安裝部署配置

rsync文件同步工具介紹、常用選項及rsync通過ssh同步（一）

Asp.net Core 使用Jenkins + Dockor 實現持續整合、自動化部署（一）：Jenkins安裝

linux常用基本命令之使用者、許可權管理（一）

一、VMware安裝centos連線Xshell（一）

Java強、軟、弱和虛引用及GC Root——記憶體優化（一）

Python 中的進程、線程、協程、同步、異步、回調（一）

解析Java中的String、StringBuilder、StringBuffer類（一）

在CentOS上安裝Hadoop集群（一）-- Centos系統配置

CentOS7上安裝配置GitLab（一）

三、處理機管理（一）--程序的引入，程序

移動端開發~視口viewport 、meta常用設定、常見問題（一）

Hadoop介紹、儲存模型、副本策略、架構模型（一）

Elam的caffe筆記之配置篇（一）：CentOS6.5編譯安裝gcc4.8.2

Python 多執行緒、多程序（一）之原始碼執行流程、GIL

Jenkins持續整合介紹及外掛安裝版本更新演示（一）--技術流ken

軟件安裝配置筆記（一）——oracle的安裝與配置

軟體安裝配置筆記（一）——oracle的安裝與配置