1. 程式人生 > >flume 對日誌監控,和日誌資料正則清洗最後實時集中到hbase中的示例

flume 對日誌監控,和日誌資料正則清洗最後實時集中到hbase中的示例

今天學習了flume的簡單用法,順便思考了一下,對標準日誌格式的資料實時清洗和集中儲存

今天介紹一下運用正則表示式對資料進行實時清洗,將資料儲存到hbase中,前面簡單的不分列的儲存,就直接貼程式碼

1、運用flume的HBasesink--SimpleHbaseEventSerializer

程式碼如下


###define agent
a5_hbase.sources = r5
a5_hbase.channels = c5
a5_hbase.sinks = k5


#define sources
a5_hbase.sources.r5.type = exec
a5_hbase.sources.r5.command = tail -f /opt/module/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
a5_hbase.sources.r5.checkperiodic = 50


#define channels
a5_hbase.channels.c5.type = file
a5_hbase.channels.c5.checkpointDir = /opt/module/cdh/flume-1.5.0-cdh5.3.6/flume_file/checkpoint
a5_hbase.channels.c5.dataDirs = /opt/module/cdh/flume-1.5.0-cdh5.3.6/flume_file/data

#define sinks
a5_hbase.sinks.k5.type = org.apache.flume.sink.hbase.AsyncHBaseSink
a5_hbase.sinks.k5.table  =  flume_table5
a5_hbase.sinks.k5.columnFamily  = hivelog_info
a5_hbase.sinks.k5.serializer = org.apache.flume.sink.hbase.SimpleAsyncHbaseEventSerializer
a5_hbase.sinks.k5.serializer.payloadColumn = hiveinfo

#bind 
a5_hbase.sources.r5.channels = c5
a5_hbase.sinks.k5.channel = c5

可以對hive.log的日誌更新,儲存到hiveinfo這個列中,具體儲存的值,可以執行幾個hql,看看具體儲存的值,以及對value是按照怎樣的格式進行取值儲存

2、利用RegexHbaseEventSerializer序列化模式,對日誌資料進行解析

日誌格式如下


抽取其中兩條做實驗

"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /course/view.php?id=27 HTTP/1.1" "303" "440" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"
"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /login/index.php HTTP/1.1" "303" "465" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"

針對這個日誌資料我們正則表示式如下所示


###define agent
a6_hive.sources = r6
a6_hive.channels = c6
a6_hive.sinks = k6

#define sources
a6_hive.sources.r6.type = exec
a6_hive.sources.r6.command = tail -f /opt/module/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log
a6_hive.sources.r6.checkperiodic = 50

#define channels
a6_hive.channels.c6.type = file
a6_hive.channels.c6.checkpointDir = /opt/module/cdh/flume-1.5.0-cdh5.3.6/flume_file/checkpoint
a6_hive.channels.c6.dataDirs = /opt/module/cdh/flume-1.5.0-cdh5.3.6/flume_file/data



#define sinks
a6_hive.sinks.k6.type = org.apache.flume.sink.hbase.HBaseSink
#a6_hive.sinks.k6.type = hbase
a6_hive.sinks.k6.table  =  flume_table_regx2
a6_hive.sinks.k6.columnFamily  = log_info
a6_hive.sinks.k6.serializer = org.apache.flume.sink.hbase.RegexHbaseEventSerializer
a6_hive.sinks.k6.serializer.regex = \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ (.*?)\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"\\ \\"(.*?)\\"
a6_hive.sinks.k6.serializer.colNames  = ip,x1,date_now,web,statu1,statu2,user,web2,type,user2,web3
#bind 
a6_hive.sources.r6.channels = c6
a6_hive.sinks.k6.channel = c6

我們先啟動flume的這個agent,開啟監控頁面

bin/flume-ng agent \
--name a6_hive \
--conf conf \
--conf-file conf/a6_hive.conf \
-Dflume.root.logger=DEBUG,console

對hive.log資料進行更新

echo '"27.38.5.159" "-" "31/Aug/2015:00:04:37 +0800" "GET /login/index.php HTTP/1.1" "303" "465" - "http://www.ibeifeng.com/user.php?act=mycourse" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "-" "learn.ibeifeng.com"' >> /opt/module/cdh/hive-0.13.1-cdh5.3.6/logs/hive.log

我們可以看到監控頁面的情況,在對資料進行解析


隨後我們檢視,hbase的表中的資料,如下,可以看到得到的資料是符合我們的要求的,只要在對其中的資料進行函式分析,就可以得到我們想要的結果,這個當然在 hive中完成,可以參考如下

https://blog.csdn.net/maketubu7/article/details/80513072


以上。