1. 程式人生 > >hive 的日誌處理統計網站的 PV 、UV案例 與 給合 python的數據清洗數據案例

hive 的日誌處理統計網站的 PV 、UV案例 與 給合 python的數據清洗數據案例

大數據 hadoop hive 數據清洗

  • 一:hive 清理日誌處理 統計PV、UV 訪問量
  • 二: hive 數據python 的數據清洗

一: 日誌處理

統計每個時段網站的訪問量:

1.1 在hive 上面創建表結構:

在創建表時不能直接導入問題
create table db_bflog.bf_log_src (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘
WITH SERDEPROPERTIES (
  "input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
)
STORED AS TEXTFILE;

技術分享圖片

1.2 加載數據到 hive 表當中:

load data local inpath ‘/home/hadoop/moodle.ibeifeng.access.log‘ into table db_bflog.bf_log_src ;

技術分享圖片

1.3 自定義UDF函數

1.3.1:udf函數去除相關引號

package org.apache.hadoop.udf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * * New UDF classes need to inherit from this UDF class.
 * 
 * @author zhangyy
 *
 */
public class RemoveQuotesUDF extends UDF {

    /*
    1. Implement one or more methods named "evaluate" which will be called by Hive.
    2."evaluate" should never be a void method. However it can return "null" if needed.
    */
    public Text evaluate(Text str){
        if(null == str){
            return null;
        }

        // validate 
        if(StringUtils.isBlank(str.toString())){
            return null ;
        }

        // lower
        return new Text(str.toString().replaceAll("\"", ""));
    }

    public static void main(String[] args) {
        System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));
    }
}

1.3.2:udf函數時間格式進行轉換

package org.apache.hadoop.udf;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
 * * New UDF classes need to inherit from this UDF class.
 * 
 * @author zhangyy
 *
 */
public class DateTransformUDF extends UDF {

    private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yy:HH:mm:ss", Locale.ENGLISH) ;
    private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") ;

    /*
    1. Implement one or more methods named "evaluate" which will be called by Hive.
    2."evaluate" should never be a void method. However it can return "null" if needed.
    */
    /**
     * input:
     *      31/Aug/2015:00:04:37 +0800
     * output:
     *      2015-08-31 00:04:37
     */
    public Text evaluate(Text str){
        Text output = new Text() ;

        if(null == str){
            return null;
        }

        // validate 
        if(StringUtils.isBlank(str.toString())){
            return null ;
        }

        try{
            // 1) parse 
            Date parseDate = inputFormat.parse(str.toString().trim());
            // 2) transform
            String outputDate = outputFormat.format(parseDate) ;
            // 3) set
            output.set(outputDate);
        }catch(Exception e){
            e.printStackTrace();
        }

        // lower
        return output;
    }

    public static void main(String[] args) {
        System.out.println(new DateTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800")));
    }
}
將RemoveQuotesUDF 與 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目錄下面:

技術分享圖片

1.4 去hive 上面 生成 udf 函數

  RemoveQuotesUDF 加載成udf函數 :

  add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;

  create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;

  DateTransformUDF 加載成udf 函數:

  add jar /home/hadoop/jars/DateTransformUDF.jar ;

  create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;

技術分享圖片
技術分享圖片
技術分享圖片
技術分享圖片


1.5 創建生成所要要求表:

create table db_bflog.bf_log_comm(
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

技術分享圖片

從原有表中提取 相關的數據處理:

insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;

技術分享圖片

執行sql 統計每小時的pv 訪問量:

select t.hour,count(*) cnt
from
(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) t
group by t.hour order by cnt desc ;

技術分享圖片

技術分享圖片

二: hive 數據python 的數據清洗

  統計國外一家影院的每周看電影的人數
  測試數據下載地址:

 wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
 unzip ml-100k.zip

2.1 創建hive 的數據表

 CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘
STORED AS TEXTFILE;

技術分享圖片

2.2 加載數據:

LOAD DATA LOCAL INPATH ‘/home/hadoop/ml-100k/u.data‘
OVERWRITE INTO TABLE u_data;

技術分享圖片

2.3 創建weekday_mapper.py 腳本

import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  userid, movieid, rating, unixtime = line.split(‘\t‘)
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print ‘\t‘.join([userid, movieid, rating, str(weekday)])

2.4 創建臨時hive 表 用於提取數據:

 CREATE TABLE u_data_new (
  userid INT,
  movieid INT,
  rating INT,
  weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t‘;

增加python 腳本到hive

add FILE /home/hadoop/weekday_mapper.py;

技術分享圖片

2.5 從舊表中數據提取

INSERT OVERWRITE TABLE u_data_new
SELECT
  TRANSFORM (userid, movieid, rating, unixtime)
  USING ‘python weekday_mapper.py‘
  AS (userid, movieid, rating, weekday)
FROM u_data;

技術分享圖片

2.6 查找所需要的數據:

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

技術分享圖片

技術分享圖片

hive 的日誌處理統計網站的 PV 、UV案例 與 給合 python的數據清洗數據案例