1. 程式人生 > >黑馬論壇日誌專案(hive、sqoop、flume、mysql)

黑馬論壇日誌專案(hive、sqoop、flume、mysql)

一、準備工作

1、專案描述
通過對黑馬技術論壇的apache common日誌進行分析, 計算論壇關鍵指標,供運營者決策。

2、資料情況
每行記錄有5部分組成: 訪問ip 、訪問時間 、訪問資源【跟著兩個訪問的Url】、訪問狀態 、本次流量 
擷取部分資料如下:

27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shy.gif HTTP/1.1" 200 2663
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/nv_a.png HTTP/1.1" 200 2076
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/titter.gif HTTP/1.1" 200 1398
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sweat.gif HTTP/1.1" 200 1879
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/mad.gif HTTP/1.1" 200 2423
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/hug.gif HTTP/1.1" 200 1054
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/lol.gif HTTP/1.1" 200 1443
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/victory.gif HTTP/1.1" 200 1275
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/time.gif HTTP/1.1" 200 687
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/kiss.gif HTTP/1.1" 200 987
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/handshake.gif HTTP/1.1" 200 1322
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/loveliness.gif HTTP/1.1" 200 1579
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/call.gif HTTP/1.1" 200 603
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/funk.gif HTTP/1.1" 200 2928
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/curse.gif HTTP/1.1" 200 1543
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/dizzy.gif HTTP/1.1" 200 1859
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/shutup.gif HTTP/1.1" 200 2500
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sleepy.gif HTTP/1.1" 200 2375
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=56212&size=middle HTTP/1.1" 301 -
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/uploadbutton_small.png HTTP/1.1" 200 690
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/fastreply.gif HTTP/1.1" 200 608
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=21212&size=middle HTTP/1.1" 301 -
8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=28823&size=middle HTTP/1.1" 301 -
8.35.201.161 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/taobao.gif HTTP/1.1" 200 1021
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/02/93/31_avatar_middle.jpg HTTP/1.1" 200 6519
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/security.png HTTP/1.1" 200 2203
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=36174&size=middle HTTP/1.1" 301 -
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pn_post.png HTTP/1.1" 200 3309
8.35.201.164 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/05/72/32_avatar_middle.jpg HTTP/1.1" 200 5333
8.35.201.144 - - [30/May/2013:17:38:22 +0800] "GET /static/image/common/icon_quote_e.gif HTTP/1.1" 200 287
8.35.201.161 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/avatar.php?uid=27067&size=small HTTP/1.1" 301 -
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/36/35_avatar_middle.jpg HTTP/1.1" 200 10087
8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /data/attachment/common/c5/common_13_usergroup_icon.jpg HTTP/1.1" 200 3462
8.35.201.160 - - [30/May/2013:17:38:22 +0800] "GET /static/image/magic/bump.small.gif HTTP/1.1" 200 1052
8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /static/image/common/arw.gif HTTP/1.1" 200 940
220.181.89.156 - - [30/May/2013:17:38:20 +0800] "GET /thread-24727-1-1.html HTTP/1.1" 200 79499
8.35.201.164 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/05/62/12_avatar_middle.jpg HTTP/1.1" 200 6415
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/logging.js?y7a HTTP/1.1" 200 603
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/forum.js?y7a HTTP/1.1" 200 15256
8.35.201.165 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/02/88/23_avatar_middle.jpg HTTP/1.1" 200 6733
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /static/js/md5.js?y7a HTTP/1.1" 200 5734
8.35.201.160 - - [30/May/2013:17:38:22 +0800] "GET /uc_server/data/avatar/000/02/12/12_avatar_middle.jpg HTTP/1.1" 200 6606
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/study_nge/css/nge.css HTTP/1.1" 200 2521
211.97.15.179 - - [30/May/2013:17:38:21 +0800] "GET /forum.php HTTP/1.1" 200 71064
211.97.15.179 - - [30/May/2013:17:38:22 +0800] "GET /source/plugin/study_nge/js/HoverLi.js HTTP/1.1" 200 324

3、關鍵指標
⊙瀏覽量PV

定義:頁面瀏覽量即為PV(Page View),是指所有使用者瀏覽頁面的總和,一個獨立使用者每開啟一個頁面就被記錄1 次。

分析:網站總瀏覽量,可以考核使用者對於網站的興趣,就像收視率對於電視劇一樣。但是對於網站運營者來說,更重要的是,每個欄目下的瀏覽量。

計算公式:記錄計數

⊙註冊使用者數

計算公式:對訪問member.php?mod=register的url,計數

⊙IP數

定義:一天之內,訪問網站的不同獨立IP 個數加和。其中同一IP無論訪問了幾個頁面,獨立IP 數均為1。

分析:這是我們最熟悉的一個概念,無論同一個IP上有多少電腦,或者其他使用者,從某種程度上來說,獨立IP的多少,是衡量網站推廣活動好壞最直接的資料。

公式:對不同ip,計數

⊙跳出率

定義:只瀏覽了一個頁面便離開了網站的訪問次數佔總的訪問次數的百分比,即只瀏覽了一個頁面的訪問次數 / 全部的訪問次數彙總。

分析:跳出率是非常重要的訪客黏性指標,它顯示了訪客對網站的興趣程度:跳出率越低說明流量質量越好,訪客對網站的內容越感興趣,這些訪客越可能是網站的有效使用者、忠實使用者。

該指標也可以衡量網路營銷的效果,指出有多少訪客被網路營銷吸引到宣傳產品頁或網站上之後,又流失掉了,可以說就是煮熟的鴨子飛了。比如,網站在某媒體上打廣告推廣,分析從這個推廣來源進入的訪客指標,其跳出率可以反映出選擇這個媒體是否合適,廣告語的撰寫是否優秀,以及網站入口頁的設計是否使用者體驗良好。

計算公式:(1)統計一天內只出現一條記錄的ip,稱為跳出數 (2)跳出數/PV

4、專案開發步驟

  1.flume
  2.對資料進行清洗
  3.使用hive進行資料的多維分析
  4.把hive分析結果通過sqoop匯出到mysql中
  5.提供檢視工具供使用者使用

上面介紹了專案的一些基本情況,下面我們將詳細介紹專案的開發過程:

二、開發過程

1、使用flume把日誌匯入到hdfs中

a4.conf檔案配置:

#定義agent名, source、channel、sink的名稱
a4.sources = r1
a4.channels = c1
a4.sinks = k1

#具體定義source
a4.sources.r1.type = spooldir
a4.sources.r1.spoolDir = /home/hadoop/logs

#具體定義channel
a4.channels.c1.type = memory
a4.channels.c1.capacity = 10000
a4.channels.c1.transactionCapacity = 100

#定義攔截器,為訊息新增時間戳
a4.sources.r1.interceptors = i1
a4.sources.r1.interceptors.i1.type = org.apache.flume.interceptor.TimestampInterceptor$Builder

#具體定義sink
a4.sinks.k1.type = hdfs
a4.sinks.k1.hdfs.path = hdfs://ns1/flume/%Y%m%d
a4.sinks.k1.hdfs.filePrefix = events-
a4.sinks.k1.hdfs.fileType = DataStream
#不按照條數生成檔案
a4.sinks.k1.hdfs.rollCount = 0
#HDFS上的檔案達到128M時生成一個檔案
a4.sinks.k1.hdfs.rollSize = 134217728
#HDFS上的檔案達到60秒生成一個檔案
a4.sinks.k1.hdfs.rollInterval = 60

#組裝source、channel、sink
a4.sources.r1.channels = c1
a4.sinks.k1.channel = c1

執行:  

bin/flume-ng agent -n a4 -c conf -f conf/a4.conf -Dflume.root.logger=INFO,console

 

2、初始化

  1. hive建立一張外部表

 create external table hmbbs (ip string, logtime string, url string)patitioned by (logdate string) row format delimited felds termianted by '\t' location '/flume';

  2.建立一個shell指令碼

  touch daily.sh

  # 新增執行許可權

  chmod +x daily.sh

 

3、寫shell指令碼(一步一步測試,註釋)

# 給變數賦值··飄號
CURRENT=`date +%y%m%d`
# 對日誌進行清洗篩選,使用MapReduce,cleaner.jar
/hadoop/bin/hadoop jar /root/cleaner.jar /flume/$CURRENT /cleaned/$CURRENT
# 分割槽表,使用經過dfs方法新增的資料,需要指定partition
/hive/bin/hive -e "alter table hmbbs add partition (logdate=$CURRENT) location '/cleaned/$CURRENT'";


# 把logdate是當天的,取出來數個數,儲存在新表裡pv
/hive/bin/hive -e "create table pv_$CURRENT row format delimited fields terminated by '\t' as select count(*) from hmbbs where logdate=$CURRENT";
# 按ip分組,統計個數,個數大於20的,排序,輸出20個,儲存新表vip
/hive/bin/hive -e "create table vip_$CURRENT row format delimited fields terminated by '\t' as select $CURRENT, ip, count(*) as hits from hmbbs where logdate=$CURRENT group by ip having hits > 20 order by hits desc limit 20"
# distinct 取不同的ip,最後計算個數,儲存新表uv
/hive/bin/hive -e "create table uv_$CURRENT row format delimited fields terminated by '\t' as select $CURRENT, count(distinct ip) from hmbbs where logdate=$CURRENT";

# instr hive的函式,查詢url裡是否含有該欄位
/hive/bin/hive -e "select count(*) from hmbbs where logdate=$CURRENT and instr(url,'member.php?mod=register')>0";
# 匯入到mysql裡 /sqoop-1.4.3/bin/sqoop export --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --export-dir "/user/hive/warehouse/pv_$CURRENT" --table pv --fields-terminated-by '\t'

寫入crontab裡,可以定時執行。

附:資料清洗java程式碼

package cn.itcast.hadoop;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class Cleaner extends Configured implements Tool{
    @Override
    public int run(String[] args) throws Exception {
        final String inputPath = args[0];
        final String outPath = args[1];
        
        final Configuration conf = new Configuration();
        final Job job = new Job(conf, Cleaner.class.getSimpleName());
        job.setJarByClass(Cleaner.class);
        
        FileInputFormat.setInputPaths(job, inputPath);
        
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);
        
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);
        FileOutputFormat.setOutputPath(job, new Path(outPath));
        
        job.waitForCompletion(true);
        return 0;
    }
    public static void main(String[] args)  throws Exception{
        ToolRunner.run(new Cleaner(), args);
    }
    
    
    static class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text>{
        LogParser parser = new LogParser();
        
        Text v2 = new Text();
        protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper<LongWritable,Text,LongWritable,Text>.Context context) throws java.io.IOException ,InterruptedException {
            final String line = value.toString();
            final String[] parsed = parser.parse(line);
            final String ip = parsed[0];
            final String logtime = parsed[1];
            String url = parsed[2];
            
            //過濾所有靜態的資源請求
            if(url.startsWith("GET /static")||url.startsWith("GET /uc_server")){
                return;
            }
            
            if(url.startsWith("GET")){
                url = url.substring("GET ".length()+1, url.length()-" HTTP/1.1".length());
            }
            if(url.startsWith("POST")){
                url = url.substring("POST ".length()+1, url.length()-" HTTP/1.1".length());
            }
            
            v2.set(ip+"\t"+logtime +"\t"+url);
            context.write(key, v2);
        };
    }
    
    static class MyReducer extends Reducer<LongWritable, Text, Text, NullWritable>{
        protected void reduce(LongWritable k2, java.lang.Iterable<Text> v2s, org.apache.hadoop.mapreduce.Reducer<LongWritable,Text,Text,NullWritable>.Context context) throws java.io.IOException ,InterruptedException {
            for (Text v2 : v2s) {
                context.write(v2, NullWritable.get());
            }
        };
    }
}
class LogParser {
    public static final SimpleDateFormat FORMAT = new SimpleDateFormat("d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
    public static final SimpleDateFormat dateformat1=new SimpleDateFormat("yyyyMMddHHmmss");
//    public static void main(String[] args) throws ParseException {
//        final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
//        LogParser parser = new LogParser();
//        final String[] array = parser.parse(S1);
//        System.out.println("樣例資料: "+S1);
//        System.out.format("解析結果:  ip=%s, time=%s, url=%s, status=%s, traffic=%s", array[0], array[1], array[2], array[3], array[4]);
//    }
    /**
     * 解析日誌的行記錄
     * @param line
     * @return 陣列含有5個元素,分別是ip、時間、url、狀態、流量
     */
    public String[] parse(String line){
        String ip = parseIP(line);
        String time;
        try {
            time = parseTime(line);
        } catch (Exception e1) {
            time = "null";
        }
        String url;
        try {
            url = parseURL(line);
        } catch (Exception e) {
            url = "null";
        }
        String status = parseStatus(line);
        String traffic = parseTraffic(line);
        
        return new String[]{ip, time ,url, status, traffic};
    }
    
    private String parseTraffic(String line) {
        final String trim = line.substring(line.lastIndexOf("\"")+1).trim();
        String traffic = trim.split(" ")[1];
        return traffic;
    }
    private String parseStatus(String line) {
        String trim;
        try {
            trim = line.substring(line.lastIndexOf("\"")+1).trim();
        } catch (Exception e) {
            trim = "null";
        }
        String status = trim.split(" ")[0];
        return status;
    }
    private String parseURL(String line) {
        final int first = line.indexOf("\"");
        final int last = line.lastIndexOf("\"");
        String url = line.substring(first+1, last);
        return url;
    }
    private String parseTime(String line) {
        final int first = line.indexOf("[");
        final int last = line.indexOf("+0800]");
        String time = line.substring(first+1,last).trim();
        try {
            return dateformat1.format(FORMAT.parse(time));
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return "";
    }
    private String parseIP(String line) {
        String ip = line.split("- -")[0].trim();
        return ip;
    }
    
}