Hadoop學習筆記—20.網站日誌分析專案案例（二）資料清洗

阿新 • • 發佈：2019-01-11

網站日誌分析專案案例（二）資料清洗：當前頁面

一、資料情況分析

1.1 資料情況回顧

　　該論壇資料有兩部分：

　　（1）歷史資料約56GB，統計到2012-05-29。這也說明，在2012-05-29之前，日誌檔案都在一個檔案裡邊，採用了追加寫入的方式。

　　（2）自2013-05-30起，每天生成一個數據檔案，約150MB左右。這也說明，從2013-05-30之後，日誌檔案不再是在一個檔案裡邊。

　　圖1展示了該日誌資料的記錄格式，其中每行記錄有5部分組成：訪問者IP、訪問時間、訪問資源、訪問狀態（HTTP狀態碼）、本次訪問流量。

log

圖1 日誌記錄資料格式

　　本次使用資料來自於兩個2013年的日誌檔案，分別為access_2013_05_30.log與access_2013_05_31.log，下載地址為：https://pan.baidu.com/s/1cxNigzxLY9nFXuNrs3dcIA

1.2 要清理的資料

　　（1）根據前一篇的關鍵指標的分析，我們所要統計分析的均不涉及到訪問狀態（HTTP狀態碼）以及本次訪問的流量，於是我們首先可以將這兩項記錄清理掉；

　　（2）根據日誌記錄的資料格式，我們需要將日期格式轉換為平常所見的普通格式如20150426這種，於是我們可以寫一個類將日誌記錄的日期進行轉換；

　　（3）由於靜態資源的訪問請求對我們的資料分析沒有意義，於是我們可以將"GET /staticsource/"開頭的訪問記錄過濾掉，又因為GET和POST字串對我們也沒有意義，因此也可以將其省略掉；

二、資料清洗過程

2.1 定期上傳日誌至HDFS

　　首先，把日誌資料上傳到HDFS中進行處理，可以分為以下幾種情況：

　　（1）如果是日誌伺服器資料較小、壓力較小，可以直接使用shell命令把資料上傳到HDFS中；

　　（2）如果是日誌伺服器資料較大、壓力較大，使用NFS在另一臺伺服器上上傳資料；

　　（3）如果日誌伺服器非常多、資料量大，使用flume進行資料處理；

　　這裡我們的實驗資料檔案較小，因此直接採用第一種Shell命令方式。又因為日誌檔案時每天產生的，因此需要設定一個定時任務，在第二天的1點鐘自動將前一天產生的log檔案上傳到HDFS的指定目錄中。所以，我們通過shell指令碼結合crontab建立一個定時任務techbbs_core.sh，內容如下：

    #!/bin/sh

    #step1.get yesterday format string
    yesterday=$(date --date='1 days ago' +%Y_%m_%d)
    #step2.upload logs to hdfs
    hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data

* 1 * * * techbbs_core.sh

　　驗證方式：通過命令 crontab -l 可以檢視已經設定的定時任務

2.2 編寫MapReduce程式清理日誌

    static class LogParser {
        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
                "yyyyMMddHHmmss");/**
         * 解析英文時間字串
         * 
         * @param string
         * @return
         * @throws ParseException
         */
        private Date parseDateFormat(String string) {
            Date parse = null;
            try {
                parse = FORMAT.parse(string);
            } catch (ParseException e) {
                e.printStackTrace();
            }
            return parse;
        }

        /**
         * 解析日誌的行記錄
         * 
         * @param line
         * @return 陣列含有5個元素，分別是ip、時間、url、狀態、流量
         */
        public String[] parse(String line) {
            String ip = parseIP(line);
            String time = parseTime(line);
            String url = parseURL(line);
            String status = parseStatus(line);
            String traffic = parseTraffic(line);

            return new String[] { ip, time, url, status, traffic };
        }

        private String parseTraffic(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String traffic = trim.split(" ")[1];
            return traffic;
        }

        private String parseStatus(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String status = trim.split(" ")[0];
            return status;
        }

        private String parseURL(String line) {
            final int first = line.indexOf("\"");
            final int last = line.lastIndexOf("\"");
            String url = line.substring(first + 1, last);
            return url;
        }

        private String parseTime(String line) {
            final int first = line.indexOf("[");
            final int last = line.indexOf("+0800]");
            String time = line.substring(first + 1, last).trim();
            Date date = parseDateFormat(time);
            return dateformat1.format(date);
        }

        private String parseIP(String line) {
            String ip = line.split("- -")[0].trim();
            return ip;
        }
    }

　　（2）編寫MapReduce程式對指定日誌檔案的所有記錄進行過濾

　　Mapper類：

        static class MyMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {
        LogParser logParser = new LogParser();
        Text outputValue = new Text();

        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            final String[] parsed = logParser.parse(value.toString());

            // step1.過濾掉靜態資源訪問請求
            if (parsed[2].startsWith("GET /static/")
                    || parsed[2].startsWith("GET /uc_server")) {
                return;
            }
            // step2.過濾掉開頭的指定字串
            if (parsed[2].startsWith("GET /")) {
                parsed[2] = parsed[2].substring("GET /".length());
            } else if (parsed[2].startsWith("POST /")) {
                parsed[2] = parsed[2].substring("POST /".length());
            }
            // step3.過濾掉結尾的特定字串
            if (parsed[2].endsWith(" HTTP/1.1")) {
                parsed[2] = parsed[2].substring(0, parsed[2].length()
                        - " HTTP/1.1".length());
            }
            // step4.只寫入前三個記錄型別項
            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
            context.write(key, outputValue);
        }
    }

　　Reducer類：

    static class MyReducer extends
            Reducer<LongWritable, Text, Text, NullWritable> {
        protected void reduce(
                LongWritable k2,
                java.lang.Iterable<Text> v2s,
                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)
                throws java.io.IOException, InterruptedException {
            for (Text v2 : v2s) {
                context.write(v2, NullWritable.get());
            }
        };
    }

　　（3）LogCleanJob.java的完整示例程式碼

package techbbs;

import java.net.URI;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class LogCleanJob extends Configured implements Tool {

    public static void main(String[] args) {
        Configuration conf = new Configuration();
        try {
            int res = ToolRunner.run(conf, new LogCleanJob(), args);
            System.exit(res);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public int run(String[] args) throws Exception {
        final Job job = new Job(new Configuration(),
                LogCleanJob.class.getSimpleName());
        // 設定為可以打包執行
        job.setJarByClass(LogCleanJob.class);
        FileInputFormat.setInputPaths(job, args[0]);
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 清理已存在的輸出檔案
        FileSystem fs = FileSystem.get(new URI(args[0]), getConf());
        Path outPath = new Path(args[1]);
        if (fs.exists(outPath)) {
            fs.delete(outPath, true);
        }
        
        boolean success = job.waitForCompletion(true);
        if(success){
            System.out.println("Clean process success!");
        }
        else{
            System.out.println("Clean process failed!");
        }
        return 0;
    }

    static class MyMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {
        LogParser logParser = new LogParser();
        Text outputValue = new Text();

        protected void map(
                LongWritable key,
                Text value,
                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)
                throws java.io.IOException, InterruptedException {
            final String[] parsed = logParser.parse(value.toString());

            // step1.過濾掉靜態資源訪問請求
            if (parsed[2].startsWith("GET /static/")
                    || parsed[2].startsWith("GET /uc_server")) {
                return;
            }
            // step2.過濾掉開頭的指定字串
            if (parsed[2].startsWith("GET /")) {
                parsed[2] = parsed[2].substring("GET /".length());
            } else if (parsed[2].startsWith("POST /")) {
                parsed[2] = parsed[2].substring("POST /".length());
            }
            // step3.過濾掉結尾的特定字串
            if (parsed[2].endsWith(" HTTP/1.1")) {
                parsed[2] = parsed[2].substring(0, parsed[2].length()
                        - " HTTP/1.1".length());
            }
            // step4.只寫入前三個記錄型別項
            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
            context.write(key, outputValue);
        }
    }

    static class MyReducer extends
            Reducer<LongWritable, Text, Text, NullWritable> {
        protected void reduce(
                LongWritable k2,
                java.lang.Iterable<Text> v2s,
                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)
                throws java.io.IOException, InterruptedException {
            for (Text v2 : v2s) {
                context.write(v2, NullWritable.get());
            }
        };
    }

    /*
     * 日誌解析類
     */
    static class LogParser {
        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
                "yyyyMMddHHmmss");

        public static void main(String[] args) throws ParseException {
            final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
            LogParser parser = new LogParser();
            final String[] array = parser.parse(S1);
            System.out.println("樣例資料： " + S1);
            System.out.format(
                    "解析結果：  ip=%s, time=%s, url=%s, status=%s, traffic=%s",
                    array[0], array[1], array[2], array[3], array[4]);
        }

        /**
         * 解析英文時間字串
         * 
         * @param string
         * @return
         * @throws ParseException
         */
        private Date parseDateFormat(String string) {
            Date parse = null;
            try {
                parse = FORMAT.parse(string);
            } catch (ParseException e) {
                e.printStackTrace();
            }
            return parse;
        }

        /**
         * 解析日誌的行記錄
         * 
         * @param line
         * @return 陣列含有5個元素，分別是ip、時間、url、狀態、流量
         */
        public String[] parse(String line) {
            String ip = parseIP(line);
            String time = parseTime(line);
            String url = parseURL(line);
            String status = parseStatus(line);
            String traffic = parseTraffic(line);

            return new String[] { ip, time, url, status, traffic };
        }

        private String parseTraffic(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String traffic = trim.split(" ")[1];
            return traffic;
        }

        private String parseStatus(String line) {
            final String trim = line.substring(line.lastIndexOf("\"") + 1)
                    .trim();
            String status = trim.split(" ")[0];
            return status;
        }

        private String parseURL(String line) {
            final int first = line.indexOf("\"");
            final int last = line.lastIndexOf("\"");
            String url = line.substring(first + 1, last);
            return url;
        }

        private String parseTime(String line) {
            final int first = line.indexOf("[");
            final int last = line.indexOf("+0800]");
            String time = line.substring(first + 1, last).trim();
            Date date = parseDateFormat(time);
            return dateformat1.format(date);
        }

        private String parseIP(String line) {
            String ip = line.split("- -")[0].trim();
            return ip;
        }
    }
}

　　（4）匯出jar包，並將其上傳至Linux伺服器指定目錄中

2.3 定期清理日誌至HDFS

　　這裡我們改寫剛剛的定時任務指令碼，將自動執行清理的MapReduce程式加入指令碼中，內容如下：

    #!/bin/sh

    #step1.get yesterday format string
    yesterday=$(date --date='1 days ago' +%Y_%m_%d)
    #step2.upload logs to hdfs
    hadoop fs -put /usr/local/files/apache_logs/access_${yesterday}.log /project/techbbs/data
    #step3.clean log data
    hadoop jar /usr/local/files/apache_logs/mycleaner.jar /project/techbbs/data/access_${yesterday}.log /project/techbbs/cleaned/${yesterday}

　　這段指令碼的意思就在於每天1點將日誌檔案上傳到HDFS後，執行資料清理程式對已存入HDFS的日誌檔案進行過濾，並將過濾後的資料存入cleaned目錄下。

2.4 定時任務測試

（1）因為兩個日誌檔案是2013年的，因此這裡將其名稱改為2015年當天以及前一天的，以便這裡能夠測試通過。

　　（2）執行命令：techbbs_core.sh 2014_04_26

控制檯的輸出資訊如下所示，可以看到過濾後的記錄減少了很多：

15/04/26 04:27:20 INFO input.FileInputFormat: Total input paths to process : 1
15/04/26 04:27:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/04/26 04:27:20 WARN snappy.LoadSnappy: Snappy native library not loaded
15/04/26 04:27:22 INFO mapred.JobClient: Running job: job_201504260249_0002
15/04/26 04:27:23 INFO mapred.JobClient: map 0% reduce 0%
15/04/26 04:28:01 INFO mapred.JobClient: map 29% reduce 0%
15/04/26 04:28:07 INFO mapred.JobClient: map 42% reduce 0%
15/04/26 04:28:10 INFO mapred.JobClient: map 57% reduce 0%
15/04/26 04:28:13 INFO mapred.JobClient: map 74% reduce 0%
15/04/26 04:28:16 INFO mapred.JobClient: map 89% reduce 0%
15/04/26 04:28:19 INFO mapred.JobClient: map 100% reduce 0%
15/04/26 04:28:49 INFO mapred.JobClient: map 100% reduce 100%
15/04/26 04:28:50 INFO mapred.JobClient: Job complete: job_201504260249_0002
15/04/26 04:28:50 INFO mapred.JobClient: Counters: 29
15/04/26 04:28:50 INFO mapred.JobClient: Job Counters
15/04/26 04:28:50 INFO mapred.JobClient: Launched reduce tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=58296
15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
15/04/26 04:28:50 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
15/04/26 04:28:50 INFO mapred.JobClient: Launched map tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: Data-local map tasks=1
15/04/26 04:28:50 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=25238
15/04/26 04:28:50 INFO mapred.JobClient: File Output Format Counters
15/04/26 04:28:50 INFO mapred.JobClient: Bytes Written=12794925
15/04/26 04:28:50 INFO mapred.JobClient: FileSystemCounters
15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_READ=14503530
15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_READ=61084325
15/04/26 04:28:50 INFO mapred.JobClient: FILE_BYTES_WRITTEN=29111500
15/04/26 04:28:50 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=12794925
15/04/26 04:28:50 INFO mapred.JobClient: File Input Format Counters
15/04/26 04:28:50 INFO mapred.JobClient: Bytes Read=61084192
15/04/26 04:28:50 INFO mapred.JobClient: Map-Reduce Framework
15/04/26 04:28:50 INFO mapred.JobClient: Map output materialized bytes=14503530
15/04/26 04:28:50 INFO mapred.JobClient: Map input records=548160
15/04/26 04:28:50 INFO mapred.JobClient: Reduce shuffle bytes=14503530
15/04/26 04:28:50 INFO mapred.JobClient: Spilled Records=339714
15/04/26 04:28:50 INFO mapred.JobClient: Map output bytes=14158741
15/04/26 04:28:50 INFO mapred.JobClient: CPU time spent (ms)=21200
15/04/26 04:28:50 INFO mapred.JobClient: Total committed heap usage (bytes)=229003264
15/04/26 04:28:50 INFO mapred.JobClient: Combine input records=0
15/04/26 04:28:50 INFO mapred.JobClient: SPLIT_RAW_BYTES=133
15/04/26 04:28:50 INFO mapred.JobClient: Reduce input records=169857
15/04/26 04:28:50 INFO mapred.JobClient: Reduce input groups=169857
15/04/26 04:28:50 INFO mapred.JobClient: Combine output records=0
15/04/26 04:28:50 INFO mapred.JobClient: Physical memory (bytes) snapshot=154001408
15/04/26 04:28:50 INFO mapred.JobClient: Reduce output records=169857
15/04/26 04:28:50 INFO mapred.JobClient: Virtual memory (bytes) snapshot=689442816
15/04/26 04:28:50 INFO mapred.JobClient: Map output records=169857
Clean process success!

　　（3）通過Web介面檢視HDFS中的日誌資料：

　　存入的未過濾的日誌資料：/project/techbbs/data/

　　存入的已過濾的日誌資料：/project/techbbs/cleaned/

Hadoop學習筆記—20.網站日誌分析專案案例（二）資料清洗

一、資料情況分析

1.1 資料情況回顧

1.2 要清理的資料

二、資料清洗過程

2.1 定期上傳日誌至HDFS

2.2 編寫MapReduce程式清理日誌

2.3 定期清理日誌至HDFS

2.4 定時任務測試

Hadoop學習筆記—20.網站日誌分析專案案例（二）資料清洗

Hadoop學習筆記—20.網站日誌分析專案案例（三）統計分析

Hadoop學習筆記—20.網站日誌分析專案案例（一）專案介紹

大資料之電話日誌分析callLog案例（二）

mapReduce：網站日誌分析專案案例：資料清洗

Python學習筆記五函數式編程（二）

大資料之電話日誌分析callLog案例（四）

大資料之電話日誌分析callLog案例（三）

JavaFX學習筆記——重要理念的建立與辨析（二）

金字塔原理學習筆記第1篇-表達的邏輯（二）

lua原始碼分析-gc篇（二）資料結構

大資料技術學習筆記之Hadoop框架基礎3-網站日誌分析及MapReduce過程詳解

使用hadoop平臺進行小型網站日誌分析

caffe學習筆記20-BatchNorm層分析

Python爬蟲（入門+進階）學習筆記 3-1 爬蟲工程師進階（七）：HTTP請求分析

[知了堂學習筆記]_JS小遊戲之打飛機（3）-飛機之間的互相撞擊，boss的出現，以及控制boss死亡

學習筆記：python3，一些基本語句（2017）

CSS學習筆記——CSS中定位的浮動float（20171129002）

六LWIP學習筆記之用戶數據報協議（UDP）

ASP.NET MVC 學習筆記-7.自定義配置信息（後續）

Hadoop學習筆記—20.網站日誌分析專案案例（二）資料清洗

一、資料情況分析

1.1 資料情況回顧

1.2 要清理的資料

二、資料清洗過程

2.1 定期上傳日誌至HDFS

2.2 編寫MapReduce程式清理日誌

2.3 定期清理日誌至HDFS

2.4 定時任務測試

相關推薦