hive 的日誌處理統計網站的 PV 、UV案例 與 給合 python的數據清洗數據案例
阿新 • • 發佈:2018-04-12
大數據 hadoop hive 數據清洗
- 一:hive 清理日誌處理 統計PV、UV 訪問量
- 二: hive 數據python 的數據清洗
一: 日誌處理
統計每個時段網站的訪問量:
1.1 在hive 上面創建表結構:
在創建表時不能直接導入問題 create table db_bflog.bf_log_src ( remote_addr string, remote_user string, time_local string, request string, status string, body_bytes_sent string, request_body string, http_referer string, http_user_agent string, http_x_forwarded_for string, host string ) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘ WITH SERDEPROPERTIES ( "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")" ) STORED AS TEXTFILE;
1.2 加載數據到 hive 表當中:
load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;
1.3 自定義UDF函數
1.3.1:udf函數去除相關引號
package org.apache.hadoop.udf; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; /** * * New UDF classes need to inherit from this UDF class. * * @author zhangyy * */ public class RemoveQuotesUDF extends UDF { /* 1. Implement one or more methods named "evaluate" which will be called by Hive. 2."evaluate" should never be a void method. However it can return "null" if needed. */ public Text evaluate(Text str){ if(null == str){ return null; } // validate if(StringUtils.isBlank(str.toString())){ return null ; } // lower return new Text(str.toString().replaceAll("\"", "")); } public static void main(String[] args) { System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\""))); } }
1.3.2:udf函數時間格式進行轉換
package org.apache.hadoop.udf; import java.text.SimpleDateFormat; import java.util.Date; import java.util.Locale; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; /** * * New UDF classes need to inherit from this UDF class. * * @author zhangyy * */ public class DateTransformUDF extends UDF { private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yy:HH:mm:ss", Locale.ENGLISH) ; private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") ; /* 1. Implement one or more methods named "evaluate" which will be called by Hive. 2."evaluate" should never be a void method. However it can return "null" if needed. */ /** * input: * 31/Aug/2015:00:04:37 +0800 * output: * 2015-08-31 00:04:37 */ public Text evaluate(Text str){ Text output = new Text() ; if(null == str){ return null; } // validate if(StringUtils.isBlank(str.toString())){ return null ; } try{ // 1) parse Date parseDate = inputFormat.parse(str.toString().trim()); // 2) transform String outputDate = outputFormat.format(parseDate) ; // 3) set output.set(outputDate); }catch(Exception e){ e.printStackTrace(); } // lower return output; } public static void main(String[] args) { System.out.println(new DateTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800"))); } }
將RemoveQuotesUDF 與 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目錄下面:
1.4 去hive 上面 生成 udf 函數
RemoveQuotesUDF 加載成udf函數 :
add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;
create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;
DateTransformUDF 加載成udf 函數:
add jar /home/hadoop/jars/DateTransformUDF.jar ;
create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;
1.5 創建生成所要要求表:
create table db_bflog.bf_log_comm(
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
從原有表中提取 相關的數據處理:
insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;
執行sql 統計每小時的pv 訪問量:
select t.hour,count(*) cnt
from
(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) t
group by t.hour order by cnt desc ;
二: hive 數據python 的數據清洗
統計國外一家影院的每周看電影的人數
測試數據下載地址:
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
2.1 創建hive 的數據表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
STORED AS TEXTFILE;
2.2 加載數據:
LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘
OVERWRITE INTO TABLE u_data;
2.3 創建weekday_mapper.py 腳本
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split(‘\t‘)
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print ‘\t‘.join([userid, movieid, rating, str(weekday)])
2.4 創建臨時hive 表 用於提取數據:
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘;
增加python 腳本到hive
add FILE /home/hadoop/weekday_mapper.py;
2.5 從舊表中數據提取
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING ‘python weekday_mapper.py‘
AS (userid, movieid, rating, weekday)
FROM u_data;
2.6 查找所需要的數據:
SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;
hive 的日誌處理統計網站的 PV 、UV案例 與 給合 python的數據清洗數據案例