1. 程式人生 > >統計日誌檔案中訪問數量,Spark中加強版WordCount

統計日誌檔案中訪問數量,Spark中加強版WordCount

 

寫在前面

學習Scala和Spark基本語法比較枯燥無味,搞搞簡單的實際運用可以有效的加深你對基本知識點的記憶,前面我們完成了最基本的WordCount功能的http://blog.csdn.net/whzhaochao/article/details/72358215,這篇主要是結合實際生產情況編寫一個簡單的功能,功能就是通過分析CDN或者Nginx的日誌檔案,統計出訪問的PV、UV、IP地址、訪問來源等相關資料,這裡只是提供一種練習思路,實際運用可能還需要複雜點

統計檔案請求數

如下圖所示為七牛CDN請求的日誌

223.93.159.226 HIT 203 [15/Feb/2017:11:14:35 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 5444007 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 62 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4866645 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 15 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 4854183 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
223.93.159.226 HIT 91 [15/Feb/2017:11:14:36 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 4751957 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko Core/1.53.2141.400 QQBrowser/9.5.10219.400"
61.164.41.226 HIT 2537 [15/Feb/2017:11:13:54 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 200 5173432 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.215.115.229 HIT 1 [15/Feb/2017:11:17:53 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 5.1; M578CA Build/LMY47D; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
115.236.173.95 HIT 1 [15/Feb/2017:11:17:49 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "http://v.abc.com.cn/video/iframe/player.html?id=139067&auto=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 QQ/6.6.9.412 V1_IPH_SQ_6.6.9_1_APP_A Pixel/1080 Core/UIWebView NetType/WIFI"
183.129.251.218 HIT 486 [15/Feb/2017:11:18:40 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4845881 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 34 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4976817 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 27 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 37 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141032.mp4 HTTP/1.1" 206 3859028 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 43 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.236.161.52 HIT 19 [15/Feb/2017:11:17:13 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5304429 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
115.228.161.136 HIT 1 [15/Feb/2017:11:16:51 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140994&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 10_2_1 like Mac OS X) AppleWebKit/602.4.6 (KHTML, like Gecko) Mobile/14D27 MicroMessenger/6.5.4 NetType/WIFI Language/zh_CN"
202.107.208.102 HIT 1226 [15/Feb/2017:11:19:10 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko"
115.231.248.162 HIT 34 [15/Feb/2017:11:17:56 +0800] "GET http://v-cdn.abc.com.cn/141035.mp4 HTTP/1.1" 206 1208743 "http://www.abc.com.cn/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; GWX:DOWNLOADED; GWX:RESERVED)"
221.234.216.142 HIT 744 [15/Feb/2017:11:17:09 +0800] "GET http://v-cdn.abc.com.cn/140995.mp4 HTTP/1.1" 206 4194896 "https://v.abc.com.cn/video/iframe/player.html?id=140995&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12B411 MicroMessenger/6.3.31 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video-js.css HTTP/1.1" 200 14382 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
183.132.170.2 HIT 1 [15/Feb/2017:11:15:22 +0800] "GET http://v-cdn.abc.com.cn/videojs/video.js HTTP/1.1" 200 173397 "https://v.abc.com.cn/video/iframe/player.html?id=140976&autoPlay=1" "Mozilla/5.0 (Linux; Android 6.0; HUAWEI MT7-CL00 Build/HuaweiMT7-CL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/53.0.2785.49 Mobile MQQBrowser/6.2 TBS/043024 Safari/537.36 MicroMessenger/6.5.4.1000 NetType/WIFI Language/zh_CN"
112.17.240.97 HIT 1440 [15/Feb/2017:11:20:31 +0800] "GET http://v-cdn.abc.com.cn/140941.mp4 HTTP/1.1" 206 6284261 "https://v.abc.com.cn/video/iframe/player.html?id=140941&autoPlay=1" "Mozilla/5.0 (iPhone; CPU iPhone OS 8_2 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Mobile/12D508 (zjxw;3.5.1;iPhone6,2;8.2;zh;bianfeng;b541b2039c2c00c66c14c7fb7e26df19fccd9cf4)"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 1637949 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 31 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5042489 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 32 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4517997 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 40 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4911485 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
125.118.106.43 HIT 30 [15/Feb/2017:11:20:57 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 4583601 "http://www.abc.com.cn/" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
60.190.59.200 HIT 1741 [15/Feb/2017:11:20:05 +0800] "GET http://v-cdn.abc.com.cn/141027.mp4 HTTP/1.1" 206 5173425 "http://www.abc.com.cn/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

日誌的格式為

IP 命中率 響應時間 請求時間 請求方法 請求URL    請求協議 狀態嗎 響應大小 referer 使用者代理
ClientIP Hit/Miss ResponseTime [Time Zone] Method URL Protocol StatusCode TrafficSize Referer UserAgent

計算獨立IP數

計算思路

計算獨立IP數主要是兩步 
1. 從每行日誌中篩選出IP地址 
2. 去除重複的IP得到獨立IP數

統計獨立IP程式碼

   //匹配IP地址正則
  val  IPPattern="((?:(?:25[0-5]|2[0-4]\\d|((1\\d{2})|([1-9]?\\d)))\\.){3}(?:25[0-5]|2[0-4]\\d|((1\\d{2})|([1-9]?\\d))))".r

    //1.統計獨立IP數
    val ipNums=input.flatMap(x=>IPPattern findFirstIn(x)).map(x=>(x,1)).reduceByKey((x,y)=>x+y).sortBy(_._2,false)
    //輸出IP訪問數前量前10位
    ipNums.take(10).foreach(println)
    println("獨立IP數:"+ipNums.count())

計算過程

  • flatMap(x=>IPPattern findFirstIn(x)) 通過正則取出每行日誌中的IP地址
  • map(x=>(x,1)) 將每行中的IP對映成 (IP,1),形成一個Pair RDD
  • reduceByKey((x,y)=>x+y) 將相同的IP合併,得到 (IP,數量)
  • sortBy(_._2,false) 按IP大小排序

統計結果

(114.55.227.102,9348)
(220.191.255.197,2640)
(115.236.173.94,2476)
(183.129.221.102,2187)
(112.53.73.66,1794)
(115.236.173.95,1650)
(220.191.254.129,1278)
(218.88.25.200,751)
(183.129.221.104,569)
(115.236.173.93,529)
獨立IP數:43649

統計每個視訊獨立IP數

有時我們不但需要知道全網訪問的獨立IP數,更想知道每個視訊訪問的獨立IP數

計算思路

計算過程主要分為三步 
1. 篩選視訊檔案將每行日誌拆分成 (檔名,IP地址)形式 
2. 按檔名分組,相當於資料庫的Group by 這時RDD的結構為(檔名,[IP1,IP1,IP2,…]),這時IP有重複 
3. 將每個檔名中的IP地址去重,這時RDD的結果為(檔名,[IP1,IP2,…]),這時IP沒有重複

計算程式碼


  //匹配檔名
  val  fileNamePattern="([0-9]+).mp4".r
  def getFileNameAndIp(line:String)={
    (fileNamePattern.findFirstIn(line).mkString,IPPattern.findFirstIn(line).mkString)
  }
  //2.統計每個視訊獨立IP數
    input.filter(x=>x.matches(".*([0-9]+)\\.mp4.*")).map(x=>getFileNameAndIp(x)).groupByKey().map(x=>(x._1,x._2.toList.distinct)).
      sortBy(_._2.size,false).take(10).foreach(x=>println("視訊:"+x._1+" 獨立IP數:"+x._2.size))

計算過程

  • filter(x=>x.matches(“.([0-9]+)\.mp4.“)) 篩選日誌中的視訊請求
  • map(x=>getFileNameAndIp(x)) 將每行日誌格式化成 (檔名,IP)這種格式
  • groupByKey() 按檔名分組,這時RDD 結構為 (檔名,[IP1,IP1,IP2….]),IP有重複
  • map(x=>(x._1,x._2.toList.distinct)) 去除value中重複的IP地址
  • sortBy(_._2.size,false) 按IP數排序

計算結果

視訊:141081.mp4 獨立IP數:2393
視訊:140995.mp4 獨立IP數:2050
視訊:141027.mp4 獨立IP數:1784
視訊:141090.mp4 獨立IP數:1702
視訊:141032.mp4 獨立IP數:1528
視訊:89973.mp4 獨立IP數:1523
視訊:141080.mp4 獨立IP數:1425
視訊:141035.mp4 獨立IP數:1321
視訊:141082.mp4 獨立IP數:1272
視訊:140938.mp4 獨立IP數:816

統計一天中每個小時間的流量

有時我想知道網站每小時視訊的觀看流量,看看使用者都喜歡在什麼時間段過來看視訊

計算思路

  1. 將日誌中的訪問時間及請求大小兩個資料提取出來形成 RDD (訪問時間,訪問大小),這裡要去除404之類的非法請求
  2. 按訪問時間分組形成 RDD (訪問時間,[大小1,大小2,….])
  3. 將訪問時間對應的大小相加形成 (訪問時間,總大小)

計算程式碼

  //[15/Feb/2017:11:17:13 +0800]  匹配 2017:11 按每小時播放量統計
  val  timePattern=".*(2017):([0-9]{2}):[0-9]{2}:[0-9]{2}.*".r
  //匹配 http 響應碼和請求資料大小
  val httpSizePattern=".*\\s(200|206|304)\\s([0-9]+)\\s.*".r

  def  isMatch(pattern:Regex,str:String)={
    str match {
      case pattern(_*) => true
      case _ => false
    }
  }

//3.統計一天中每個小時間的流量
    input.filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)).map(x=>getTimeAndSize(x)).groupByKey()
      .map(x=>(x._1,x._2.sum)).sortByKey().foreach(x=>println(x._1+"時 CDN流量="+x._2/(1024*1024*1024)+"G"))

計算過程

  • filter(x=>isMatch(httpSizePattern,x)).filter(x=>isMatch(timePattern,x)) 過濾非法請求
  • map(x=>getTimeAndSize(x)) 將日誌格式化成 RDD(請求小時,請求大小)
  • groupByKey() 按請求時間分組形成 RDD(請求小時,[大小1,大小2,….])
  • map(x=>(x._1,x._2.sum)) 將每小時的請求大小相加,形成 RDD(請求小時,總大小)

計算結果

00時 CDN流量=14G
01時 CDN流量=3G
02時 CDN流量=5G
03時 CDN流量=3G
04時 CDN流量=3G
05時 CDN流量=4G
06時 CDN流量=11G
07時 CDN流量=22G
08時 CDN流量=43G
09時 CDN流量=52G
10時 CDN流量=61G
11時 CDN流量=45G
12時 CDN流量=46G
13時 CDN流量=51G
14時 CDN流量=55G
15時 CDN流量=45G
16時 CDN流量=45G
17時 CDN流量=44G
18時 CDN流量=45G
19時 CDN流量=51G
20時 CDN流量=55G
21時 CDN流量=53G
22時 CDN流量=42G
23時 CDN流量=25G

學習資料及原始碼

http://git.oschina.net/whzhaochao/spark-learning

原文地址:http://blog.csdn.net/whzhaochao/article/details/72416956