大資料 hive 15--hive日誌分析案例

阿新 • • 發佈：2019-01-12

1.1 專案來源

本次實踐的目的就在於通過對該技術論壇網站的tomcat access log日誌進行分析，計算該論壇的一些關鍵指標，供運營者進行決策時參考。

PS：開發該系統的目的是為了獲取一些業務相關的指標，這些指標在第三方工具中無法獲得的；

1.2 資料情況

該論壇資料有兩部分：

（1）歷史資料約56GB，統計到2012-05-29。這也說明，在2012-05-29之前，日誌檔案都在一個檔案裡邊，採用了追加寫入的方式。

（2）自2013-05-30起，每天生成一個數據檔案，約150MB左右。這也說明，從2013-05-30之後，日誌檔案不再是在一個檔案裡邊。

圖2展示了該日誌資料的記錄格式，其中每行記錄有5部分組成：訪問者IP、訪問時間、訪問資源、訪問狀態（HTTP狀態碼）、本次訪問流量。

圖2 日誌記錄資料格式

二、關鍵指標KPI

2.1 瀏覽量PV

（1）定義：頁面瀏覽量即為PV(Page View)，是指所有使用者瀏覽頁面的總和，一個獨立使用者每開啟一個頁面就被記錄1 次。

（2）分析：網站總瀏覽量，可以考核使用者對於網站的興趣，就像收視率對於電視劇一樣。

計算公式：記錄計數，從日誌中獲取訪問次數。

2.2 註冊使用者數

image

該論壇的使用者註冊頁面為member.php，而當用戶點選註冊時請求的又是member.php?mod=register的url。

計算公式：對訪問member.php?mod=register的url，計數。

2.3 IP數

（1）定義：一天之內，訪問網站的不同獨立 IP 個數加和。其中同一IP無論訪問了幾個頁面，獨立IP 數均為1。

（2）分析：這是我們最熟悉的一個概念，無論同一個IP上有多少電腦，或者其他使用者，從某種程度上來說，獨立IP的多少，是衡量網站推廣活動好壞最直接的資料。

計算公式：對不同的訪問者ip，計數

2.4 跳出率

（1）定義：只瀏覽了一個頁面便離開了網站的訪問次數佔總的訪問次數的百分比，即只瀏覽了一個頁面的訪問次數 / 全部的訪問次數彙總。

（2）分析：跳出率是非常重要的訪客黏性指標，它顯示了訪客對網站的興趣程度：跳出率越低說明流量質量越好，訪客對網站的內容越感興趣，這些訪客越可能是網站的有效使用者、忠實使用者。

PS：該指標也可以衡量網路營銷的效果，指出有多少訪客被網路營銷吸引到宣傳產品頁或網站上之後，又流失掉了，可以說就是煮熟的鴨子飛了。比如，網站在某媒體上打廣告推廣，分析從這個推廣來源進入的訪客指標，其跳出率可以反映出選擇這個媒體是否合適，廣告語的撰寫是否優秀，以及網站入口頁的設計是否使用者體驗良好。

計算公式：①統計一天內只出現一條記錄的ip，稱為跳出數；②跳出數/PV；

處理步驟

1 資料清洗
　　使用MapReduce對HDFS中的原始資料進行清洗，以便後續進行統計分析；

2 統計分析
　　使用Hive對清洗後的資料進行統計分析

資料清洗

一、資料情況分析

1.1 資料情況回顧

該論壇資料有兩部分：

（1）歷史資料約56GB，統計到2012-05-29。這也說明，在2012-05-29之前，日誌檔案都在一個檔案裡邊，採用了追加寫入的方式。

（2）自2013-05-30起，每天生成一個數據檔案，約150MB左右。這也說明，從2013-05-30之後，日誌檔案不再是在一個檔案裡邊。

圖1展示了該日誌資料的記錄格式，其中每行記錄有5部分組成：訪問者IP、訪問時間、訪問資源、訪問狀態（HTTP狀態碼）、本次訪問流量。

圖1 日誌記錄資料格式

本次使用資料來自於兩個2013年的日誌檔案，分別為access_2013_05_30.log與access_2013_05_31.log，下載地址為：http://pan.baidu.com/s/1pJE7XR9

1.2 要清理的資料

（1）根據前一篇的關鍵指標的分析，我們所要統計分析的均不涉及到訪問狀態（HTTP狀態碼）以及本次訪問的流量，於是我們首先可以將這兩項記錄清理掉；

（2）根據日誌記錄的資料格式，我們需要將日期格式轉換為平常所見的普通格式如20150426這種，於是我們可以寫一個類將日誌記錄的日期進行轉換；

（3）由於靜態資源的訪問請求對我們的資料分析沒有意義，於是我們可以將"GET /staticsource/"開頭的訪問記錄過濾掉，又因為GET和POST字串對我們也沒有意義，因此也可以將其省略掉；

二、資料清洗過程

2.1 定期上傳日誌至HDFS

首先，把日誌資料上傳到HDFS中進行處理，可以分為以下幾種情況：

（1）如果是日誌伺服器資料較小、壓力較小，可以直接使用shell命令把資料上傳到HDFS中；

（2）如果日誌伺服器非常多、資料量大，使用flume進行資料處理；

這裡我們的實驗資料檔案較小，因此直接採用第一種Shell命令方式。

清洗之前

27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682

清洗之後

110.52.250.126  20130530173820  data/cache/style_1_widthauto.css?y7a
110.52.250.126  20130530173820  source/plugin/wsh_wx/img/wsh_zk.css
110.52.250.126  20130530173820  data/cache/style_1_forum_index.css?y7a
110.52.250.126  20130530173820  source/plugin/wsh_wx/img/wx_jqr.gif
27.19.74.143    20130530173820  data/attachment/common/c8/common_2_verify_icon.png
27.19.74.143    20130530173820  data/cache/common_smilies_var.js?y7a

2.2 編寫MapReduce程式清理日誌

LogParser

package com.neusoft;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

public class LogParser {
    public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
            "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
    public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
            "yyyyMMddHHmmss");

    public static void main(String[] args) throws ParseException {
        final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
        LogParser parser = new LogParser();
        final String[] array = parser.parse(S1);
        System.out.println("樣例資料： " + S1);
        System.out.format(
                "解析結果：  ip=%s, time=%s, url=%s, status=%s, traffic=%s",
                array[0], array[1], array[2], array[3], array[4]);
    }

    /**
     * 解析英文時間字串
     *
     * @param string
     * @return
     * @throws ParseException
     */
    private Date parseDateFormat(String string) {
        Date parse = null;
        try {
            parse = FORMAT.parse(string);
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return parse;
    }
    /**
     * 解析日誌的行記錄
     *
     * @param line
     * @return 陣列含有5個元素，分別是ip、時間、url、狀態、流量
     */
    public String[] parse(String line) {
        String ip = parseIP(line);
        String time = parseTime(line);
        String url = parseURL(line);
        String status = parseStatus(line);
        String traffic = parseTraffic(line);

        return new String[] { ip, time, url, status, traffic };
    }

    private String parseTraffic(String line) {
        final String trim = line.substring(line.lastIndexOf("\"") + 1)
                .trim();
        String traffic = trim.split(" ")[1];
        return traffic;
    }

    private String parseStatus(String line) {
        final String trim = line.substring(line.lastIndexOf("\"") + 1)
                .trim();
        String status = trim.split(" ")[0];
        return status;
    }

    private String parseURL(String line) {
        final int first = line.indexOf("\"");
        final int last = line.lastIndexOf("\"");
        String url = line.substring(first + 1, last);
        return url;
    }

    private String parseTime(String line) {
        final int first = line.indexOf("[");
        final int last = line.indexOf("+0800]");
        String time = line.substring(first + 1, last).trim();
        Date date = parseDateFormat(time);
        return dateformat1.format(date);
    }

    private String parseIP(String line) {
        String ip = line.split("- -")[0].trim();
        return ip;
    }
}

LogCleanDriver

package com.neusoft;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LogCleanDriver {

    public static void main(String[] args) throws Exception {
        System.setProperty("HADOOP_USER_NAME", "root") ;
        System.setProperty("hadoop.home.dir", "e:/hadoop-2.8.3");
        if (args == null || args.length == 0) {
            return;
        }

        FileUtil.deleteDir(args[1]);
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        //jar
        job.setJarByClass(LogCleanDriver.class);


        job.setMapperClass(LogCleanMapper.class);
        job.setReducerClass(LogCleanReducer.class);

        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileInputFormat.setMaxInputSplitSize(job, 1024*1024);
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        boolean bResult = job.waitForCompletion(true);
        System.out.println("--------------------------------");
        System.exit(bResult ? 0 : 1);

    }


}

LogCleanMapper

package com.neusoft;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class LogCleanMapper extends Mapper<LongWritable,Text,LongWritable,Text>{
    LogParser logParser = new LogParser();
    Text outputValue = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        final String[] parsed = logParser.parse(value.toString());

        // step1.過濾掉靜態資源訪問請求
        if (parsed[2].startsWith("GET /static/")
                || parsed[2].startsWith("GET /uc_server")
                || parsed[2].endsWith(".css")
                || parsed[2].endsWith(".js")) {
            return;
        }
        // step2.過濾掉開頭的指定字串
        if (parsed[2].startsWith("GET /")) {
            parsed[2] = parsed[2].substring("GET /".length());
        } else if (parsed[2].startsWith("POST /")) {
            parsed[2] = parsed[2].substring("POST /".length());
        }
        // step3.過濾掉結尾的特定字串
        if (parsed[2].endsWith(" HTTP/1.1")) {
            parsed[2] = parsed[2].substring(0, parsed[2].length()
                    - " HTTP/1.1".length());
        }

        if (parsed[2].contains(".css")
                || parsed[2].contains(".js")
                || parsed[2].contains(".jpg")
                || parsed[2].contains(".png")
                || parsed[2].contains(".gif")
                || parsed[2].contains(".jpeg")) {
            return;
        }
        // step4.只寫入前三個記錄型別項
        outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
        context.write(key, outputValue);

    }
}

LogCleanReducer

package com.neusoft;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class LogCleanReducer extends Reducer<LongWritable,Text,Text,NullWritable> {
    @Override
    protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        context.write(values.iterator().next(),NullWritable.get());
    }
}

一、藉助Hive進行統計

1.1 準備工作：建立分割槽表

為了能夠藉助Hive進行統計分析，首先我們需要將清洗後的資料存入Hive中，那麼我們需要先建立一張表。這裡我們選擇分割槽表，以日期作為分割槽的指標，建表語句如下：（這裡關鍵之處就在於確定對映的HDFS位置，我這裡是/project/techbbs/cleaned即清洗後的資料存放的位置）

hive> dfs -mkdir -p /project/techbbs/cleaned

hive>CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/project/techbbs/cleaned';

建立了分割槽表之後，就需要增加一個分割槽，增加分割槽的語句如下：（這裡主要針對20150425這一天的日誌進行分割槽）

hive>ALTER TABLE techbbs ADD PARTITION(logdate='2015_04_25') LOCATION '/project/techbbs/cleaned/2015_04_25';

hive> load data local inpath '/root/cleaned' into table techbbs3 partition(logdate='2015_04_25');

1.2 使用HQL統計關鍵指標

（1）關鍵指標之一：PV量

頁面瀏覽量即為PV(Page View)，是指所有使用者瀏覽頁面的總和，一個獨立使用者每開啟一個頁面就被記錄1 次。這裡，我們只需要統計日誌中的記錄個數即可，HQL程式碼如下：

hive>CREATE TABLE techbbs_pv_2015_04_25 AS SELECT COUNT(1) AS PV FROM techbbs WHERE logdate='2015_04_25';

（3）關鍵指標之三：獨立IP數

一天之內，訪問網站的不同獨立 IP 個數加和。其中同一IP無論訪問了幾個頁面，獨立IP 數均為1。因此，這裡我們只需要統計日誌中處理的獨立IP數即可，在SQL中我們可以通過DISTINCT關鍵字，在HQL中也是通過這個關鍵字：

hive>CREATE TABLE techbbs_ip_2015_04_25 AS SELECT COUNT(DISTINCT ip) AS IP FROM techbbs WHERE logdate='2015_04_25';

（4）關鍵指標之四：跳出使用者數

只瀏覽了一個頁面便離開了網站的訪問次數，即只瀏覽了一個頁面便不再訪問的訪問次數。這裡，我們可以通過使用者的IP進行分組，如果分組後的記錄數只有一條，那麼即為跳出使用者。將這些使用者的數量相加，就得出了跳出使用者數，HQL程式碼如下：

hive>select count(*) from (select ip,count(ip) as num from techbbs group by ip) as tmpTable where tmpTable.num = 1;

PS：跳出率是指只瀏覽了一個頁面便離開了網站的訪問次數佔總的訪問次數的百分比，即只瀏覽了一個頁面的訪問次數 / 全部的訪問次數彙總。這裡，我們可以將這裡得出的跳出使用者數/PV數即可得到跳出率。