1. 程式人生 > >Hive(24):例項:hive shell指令碼實現自動載入資料

Hive(24):例項:hive shell指令碼實現自動載入資料

一、實現功能

日誌檔案需要按時自動上傳到hdfs、hive,然後,才可以進行下一步的ETL。所以,定時定點將日誌資訊按時上傳時非常重要的。

二、實現

1.hive中建立源表

create database load_hive;

create table load_hive.load_tb(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)partitioned by(`date` string,hour string)
row format delimited fields terminated by "\t";

2.通過hive -e的方式

(1)建立一個指令碼 load_to_hive.sql

#!/bin/sh
#拿到昨天日期,例如:20180526
YESTERDAY=`date -d '-1 days' +%Y%m%d`

ACCESS_LOG_DIR=/opt/access_logs/$YESTERDAY

#HIVE_HOME=/opt/modules/class22/apache-hive-1.2.1-bin
HIVE_HOME=/opt/modules/hive-1.2.1

for file in `ls $ACCESS_LOG_DIR`
do
	DAY=${file:0:8}
	HOUR=${file:8:2}
	echo "${DAY}${HOUR}"
	$HIVE_HOME/bin/hive -e "load data local inpath '$ACCESS_LOG_DIR/$file' into
	table load_hive.load_tb partition(date='${DAY}',hour='${HOUR}')"
done
	$HIVE_HOME/bin/hive -e "show partitions load_hive.load_tb"
	

(2)建立目錄

/opt/access_logs/20180526

(3)把日誌檔案都考過來並且改名字

[[email protected] access_logs]# cd 20180526/
[[email protected] 20180526]# cp /opt/datas/2015082818 ./
[[email protected] 20180526]# cp /opt/datas/2015082819 ./
[[email protected] 20180526]# mv 2015082818 2018052601.log
[[email protected] 20180526]# mv 2015082819 2018052602.log
[
[email protected]
20180526]# cp /opt/datas/2015082819 ./ [[email protected] 20180526]# mv 2015082819 2018052603.log

(4)hive-site新增,取消關鍵字檢查
 

        <property>
          <name>hive.support.sql11.reserved.keywords</name>
          <value>false</value>
        </property>

(5)hive目錄下執行指令碼

sh -x load_to_hive.sql

(6)重新啟動hiveserver2和beeline

(7)執行指令碼

sh -x load_to_hive.sql

結果:
partition
date=20180526/hour=01.log
date=20180526/hour=02.log
date=20180526/hour=03.log
Time taken: 1.994 seconds, Fetched: 3 row(s)

(8)查看錶的分割槽:

show partitions load_hive.load_tb;

3. 通過hive -f的方式

  可以通過--hiveconf 傳遞引數

(1)日誌目錄

在/opt/access_logs/20181110下面有以下三個日誌檔案
mv 2017120901.log 2018052601.log
mv 2017120902.log 2018052602.log
mv 2017120903.log 2018052603.log

(2)建立一個檔案:/opt/datas/hive_shell/load.sql

vi load.sql

新增

load data local inpath '${hiveconf:log_dir}/${hiveconf:file_path}' into table load_hive.load_tb partition (date='${hiveconf:DAY}',hour='${hiveconf:HOUR}')

(4)編寫load_to_hive_file.sh

#! /bin/bash

#定義昨天的日期時間
YESTERDAY=`date -d '-1 days' +%Y%m%d`

#定義資料目錄
ACCESS_LOG_DIR=/opt/access_logs/$YESTERDAY

#定義HIVE_HOME
HIVE_HOME=/opt/modules/hive-1.2.1

#定義遍歷目錄下檔名稱,獲取日期和時間指定分割槽
for FILE in `ls $ACCESS_LOG_DIR`
do
    DAY=${FILE:0:8}
	HOUR=${FILE:8:2}
	#echo "${DAY}${HOUR}"
    $HIVE_HOME/bin/hive --hiveconf log_dir=$ACCESS_LOG_DIR --hiveconf file_path=$FILE --hiveconf DAY=$DAY --hiveconf HOUR=$HOUR -f '/opt/datas/hive_shell/load.sql'
done
    $HIVE_HOME/bin/hive -e "show partitions load_hive.load_tb"

(5)刪除表中資料

truncate table load_hive.load_tb;

(6)檢視分割槽並且刪除全部分割槽

show partitions load_hive.load_tb;

+----------------------------+--+
|         partition          |
+----------------------------+--+
| date=20180526/hour=01      |
| date=20180526/hour=02      |
| date=20180526/hour=03      |
+----------------------------+--+

刪除分割槽

alter table load_hive.load_tb drop partition(date='20180526',hour='01');
alter table load_hive.load_tb drop partition(date='20180526',hour='02');
alter table load_hive.load_tb drop partition(date='20180526',hour='03');

(7)載入資料

sh -x load_to_hive_file.sh 

(8)結果(在指令碼中寫了)

date=20180526/hour=01
date=20180526/hour=02
date=20180526/hour=03