Flume數據采集之常見集群配置案例

阿新 • • 發佈：2018-04-07

大數據 Flume

[TOC]

非集群配置

這種情況非集群配置方式，比較簡單，可以直接參考我整理的《Flume筆記整理》，其基本結構圖如下：

技術分享圖片

Flume集群之多個Agent一個source

結構說明

結構圖如下：

技術分享圖片

說明如下：

即可以把我們的Agent部署在不同的節點上，上面是兩個Agent的情況。其中Agent foo可以部署在日誌產生的節點上，
比如，可以是我們web服務器例如tomcat或者nginx的節點上，foo的source可以配置為監控日誌文件數據的變化，
channel則可以基於內存或基於文件進行存儲，而sink即日誌落地可以配置為avro，即輸出到下一個Agent中。

Agent bar可以部署在另一個節點上，當然跟foo在同一個節點也是沒有問題，因為本身Flume是可以多個實例在同一個
節點上運行的。bar主要作用是收集來自不同avro source的節點的日誌數據，實際上，如果我們的web環境是集群的，
那麽web服務器就會有多個節點，這時就有多個web服務器節點產生日誌，我們需要在這多個web服務器上都部署agent，
此時，bar的source就會有多個，後面的案例正是如此，不過在這個小節中，只討論多個agent一個source的情況。
而對於agent bar的數據下沈方式，也是可以選擇多種方式，詳細可以參考官網文檔，這裏選擇sink為HDFS。

不過需要註意的是，在agent foo中，source只有一個，在後面的案例中，會配置多個source，即在這一個agent中，
可以采集不同的日誌文件，後面要討論的多個source，指的是多個不同日誌文件的來源，即foo中的多個source，例如
data-access.log、data-ugctail.log、data-ugchead.log等等。

配置案例

環境說明

如下：

技術分享圖片

即這裏有兩個節點：

uplooking01：
其中的日誌文件 /home/uplooking/data/data-clean/data-access.log
為web服務器生成的用戶訪問日誌，並且每天會產生一個新的日誌文件。
在這個節點上，我們需要部署一個Flume的Agent，其source為該日誌文件，sink為avro。

uplooking03：
這個節點的作用主要是收集來自不同Flume Agent的日誌輸出數據，例如上面的agent，然後輸出到HDFS中。

說明：在我的環境中，有uplooking01 uplooking02 uplooking03三個節點，並且三個節點配置了Hadoop集群。

配置

uplooking01

#########################################################
##
##主要作用是監聽文件中的新增數據，采集到數據之後，輸出到avro
##    註意：Flume agent的運行，主要就是配置source channel sink
##  下面的a1就是agent的代號，source叫r1 channel叫c1 sink叫k1
#########################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#對於source的配置描述 監聽文件中的新增數據 exec
a1.sources.r1.type = exec
a1.sources.r1.command  = tail -F /home/uplooking/data/data-clean/data-access.log

#對於sink的配置描述 使用avro日誌做數據的消費
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = uplooking03
a1.sinks.k1.port = 44444

#對於channel的配置描述 使用文件做數據的臨時緩存 這種的安全性要高
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/uplooking/data/flume/checkpoint
a1.channels.c1.dataDirs = /home/uplooking/data/flume/data

#通過channel c1將source r1和sink k1關聯起來
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

uplooking03

#########################################################
##
##主要作用是監聽avro，采集到數據之後，輸出到hdfs
##    註意：Flume agent的運行，主要就是配置source channel sink
##  下面的a1就是agent的代號，source叫r1 channel叫c1 sink叫k1
#########################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#對於source的配置描述 監聽avro
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

#對於sink的配置描述 使用log日誌做數據的消費
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /input/data-clean/access/%y/%m/%d
a1.sinks.k1.hdfs.filePrefix = flume
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.inUsePrefix = tmpFlume
a1.sinks.k1.hdfs.inUseSuffix = .tmp
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = second
#配置下面兩項後，保存到HDFS中的數據才是文本
#否則通過hdfs dfs -text查看時，顯示的是經過壓縮的16進制
a1.sinks.k1.hdfs.serializer = TEXT
a1.sinks.k1.hdfs.fileType = DataStream

#對於channel的配置描述 使用內存緩沖區域做數據的臨時緩存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#通過channel c1將source r1和sink k1關聯起來
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

測試

首先要確保會有日誌生成，其輸出為/home/uplooking/data/data-clean/data-access.log。

在uplooking03上啟動Flume Agent：

[uplooking@uplooking03 flume]$ flume-ng agent -n a1 -c conf --conf-file conf/flume-source-avro.conf -Dflume.root.logger=INFO,console

在uplooking01上啟動Flume Agent：

flume-ng agent -n a1 -c conf --conf-file conf/flume-sink-avro.conf -Dflume.root.logger=INFO,console

一段時間後，便可以在hdfs中看到寫入的日誌文件：

[uplooking@uplooking02 ~]$ hdfs dfs -ls /input/data-clean/access/18/04/07
18/04/07 08:52:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 26 items
-rw-r--r--   3 uplooking supergroup       1131 2018-04-07 08:50 /input/data-clean/access/18/04/07/flume.1523062248369.log
-rw-r--r--   3 uplooking supergroup       1183 2018-04-07 08:50 /input/data-clean/access/18/04/07/flume.1523062248370.log
-rw-r--r--   3 uplooking supergroup       1176 2018-04-07 08:50 /input/data-clean/access/18/04/07/flume.1523062248371.log
......

查看文件中的數據：

[uplooking@uplooking02 ~]$ hdfs dfs -text /input/data-clean/access/18/04/07/flume.1523062248369.log
18/04/07 08:55:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
1000    220.194.55.244  null    40604   0       POST /check/init HTTP/1.1       500     null    Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.3      1523062236368
1002    221.8.9.6 80    886a1533-38ca-466c-86e1-0b84022f781b    20201   1       GET /top HTTP/1.0       500     null      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.3      1523062236869
1002    61.172.249.96   99fb19c4-ec59-4abd-899c-4059dea39ead    0       0       POST /updateById?id=21 HTTP/1.1 408       null    Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko    1523062237370
1003    61.172.249.96   886a1533-38ca-466c-86e1-0b84022f781b    10022   1       GET /tologin HTTP/1.1   null    /update/pass      Mozilla/5.0 (Windows; U; Windows NT 5.1)Gecko/20070309 Firefox/2.0.0.3  1523062237871
1003    125.39.129.67   6839fff8-7b3a-48f5-90cd-0f45c7be1aeb    10022   1       GET /tologin HTTP/1.0   408     null      Mozilla/5.0 (Windows; U; Windows NT 5.1)Gecko/20070309 Firefox/2.0.0.3  1523062238372
1000    61.172.249.96   89019ae0-6140-4e5a-9061-e3af74f3e4a8    10022   1       POST /stat HTTP/1.1     null    /passpword/getById?id=11  Mozilla/4.0 (compatible; MSIE 5.0; WindowsNT)   1523062238873

如果在uplooking03的Flume agent不配置hdfs.serializer=TEXT和hdfs.fileType=DataStream，那麽上面查看到的數據會是16進制數據。

Flume集群之多個Agent多個source

結構說明

如下：

技術分享圖片

配置案例

環境說明

在我們的環境中，如下：
技術分享圖片

即在我們的環境中，日誌源有三份，分別是data-access.log、data-ugchead.log、data-ugctail.log
不過在下面的實際配置中，日誌源的agent我們只使用兩個，uplooking01和uplooking02，它們的sink都
輸出到uplooking03的source中。

配置

uplooking01和uplooking02的配置都是一樣的，如下：

#########################################################
##
##主要作用是監聽文件中的新增數據，采集到數據之後，打印在控制臺
##    註意：Flume agent的運行，主要就是配置source channel sink
##  下面的a1就是agent的代號，source叫r1 channel叫c1 sink叫k1
#########################################################
a1.sources = r1 r2 r3
a1.sinks = k1
a1.channels = c1

#對於source r1的配置描述 監聽文件中的新增數據 exec
a1.sources.r1.type = exec
a1.sources.r1.command  = tail -F /home/uplooking/data/data-clean/data-access.log
a1.sources.r1.interceptors = i1 i2
a1.sources.r1.interceptors.i1.type = static
##靜態的在header中添加一個key value，下面就配置了兩個攔截器，i1和i2
a1.sources.r1.interceptors.i1.key = type
a1.sources.r1.interceptors.i1.value = access
a1.sources.r1.interceptors.i2.type = timestamp
## timestamp的作用：這裏配置了的話，在負責集中收集日誌的flume agent就不需要配置
## a1.sinks.k1.hdfs.useLocalTimeStamp = true也能通過這些%y/%m/%d獲取時間信息
## 這樣一來的話，就可以減輕集中收集日誌的flume agent的負擔，因為此時的時間信息可以直接從source中獲取

#對於source r2的配置描述 監聽文件中的新增數據 exec
a1.sources.r2.type = exec
a1.sources.r2.command  = tail -F /home/uplooking/data/data-clean/data-ugchead.log
a1.sources.r2.interceptors = i1 i2
a1.sources.r2.interceptors.i1.type = static
##靜態的在header中添加一個key value，下面就配置了兩個攔截器，i1和i2
a1.sources.r2.interceptors.i1.key = type
a1.sources.r2.interceptors.i1.value = ugchead
a1.sources.r2.interceptors.i2.type = timestamp

#對於source r3的配置描述 監聽文件中的新增數據 exec
a1.sources.r3.type = exec
a1.sources.r3.command  = tail -F /home/uplooking/data/data-clean/data-ugctail.log
a1.sources.r3.interceptors = i1 i2
a1.sources.r3.interceptors.i1.type = static
##靜態的在header中添加一個key value，下面就配置了兩個攔截器，i1和i2
a1.sources.r3.interceptors.i1.key = type
a1.sources.r3.interceptors.i1.value = ugctail
a1.sources.r3.interceptors.i2.type = timestamp

#對於sink的配置描述 使用avro日誌做數據的消費
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = uplooking03
a1.sinks.k1.port = 44444

#對於channel的配置描述 使用文件做數據的臨時緩存 這種的安全性要高
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /home/uplooking/data/flume/checkpoint
a1.channels.c1.dataDirs = /home/uplooking/data/flume/data

#通過channel c1將source r1 r2 r3和sink k1關聯起來
a1.sources.r1.channels = c1
a1.sources.r2.channels = c1
a1.sources.r3.channels = c1
a1.sinks.k1.channel = c1

uplooking03的配置如下：

#########################################################
##
##主要作用是監聽avro，采集到數據之後，輸出到hdfs
##    註意：Flume agent的運行，主要就是配置source channel sink
##  下面的a1就是agent的代號，source叫r1 channel叫c1 sink叫k1
#########################################################
a1.sources = r1
a1.sinks = k1
a1.channels = c1

#對於source的配置描述 監聽avro
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

#對於sink的配置描述 使用log日誌做數據的消費
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /input/data-clean/%{type}/%Y/%m/%d
a1.sinks.k1.hdfs.filePrefix = %{type}
a1.sinks.k1.hdfs.fileSuffix = .log
a1.sinks.k1.hdfs.inUseSuffix = .tmp
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollSize = 10485760
# 如果希望上面配置的日誌文件滾動策略生效，則必須要配置下面這一項
a1.sinks.k1.hdfs.minBlockReplicas = 1
#配置下面兩項後，保存到HDFS中的數據才是文本
#否則通過hdfs dfs -text查看時，顯示的是經過壓縮的16進制
a1.sinks.k1.hdfs.serializer = TEXT
a1.sinks.k1.hdfs.fileType = DataStream

#對於channel的配置描述 使用內存緩沖區域做數據的臨時緩存
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

#通過channel c1將source r1和sink k1關聯起來
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

測試

首先需要保證uplooking01和uplooking02上都能正常地產生日誌。

在uplooking03上啟動Agent：

[uplooking@uplooking03 flume]$ flume-ng agent -n a1 -c conf --conf-file conf/flume-source-avro.conf -Dflume.root.logger=INFO,console

分別在uplooking01和uplooking02上啟動Agent：

flume-ng agent -n a1 -c conf --conf-file conf/flume-sink-avro.conf -Dflume.root.logger=INFO,console

一段時間後，可以在HDFS中查看相應的日誌文件：

$ hdfs dfs -ls /input/data-clean
18/04/08 01:34:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
drwxr-xr-x   - uplooking supergroup          0 2018-04-07 22:00 /input/data-clean/access
drwxr-xr-x   - uplooking supergroup          0 2018-04-07 22:00 /input/data-clean/ugchead
drwxr-xr-x   - uplooking supergroup          0 2018-04-07 22:00 /input/data-clean/ugctail

查看某個日誌目錄下的日誌文件：

[uplooking@uplooking02 data-clean]$ hdfs dfs -ls /input/data-clean/access/2018/04/07
18/04/08 01:35:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r--   3 uplooking supergroup    2447752 2018-04-08 01:02 /input/data-clean/access/2018/04/08/access.1523116801502.log
-rw-r--r--   3 uplooking supergroup       5804 2018-04-08 01:02 /input/data-clean/access/2018/04/08/access.1523120538070.log.tmp

可以看到日誌文件數量非常少，那是因為前面在配置uplooking03的agent時，日誌文件滾動的方式為，單個文件滿10M再進行切分日誌文件。

Flume數據采集之常見集群配置案例

大數據 Flume [TOC] 非集群配置這種情況非集群配置方式，比較簡單，可以直接參考我整理的《Flume筆記整理》，其基本結構圖如下： Flume集群之多個Agent一個source 結構說明結構圖如下：說明如下：即可以把我們的Agent部署在不同的節點上，上面是兩個Agent的情況。

Python項目實戰：福布斯系列之數據采集

sce nmp mgr 上市 sts nor 頁面數據都差不多 afa 1 數據采集概述開始一個數據分析項目，首先需要做的就是get到原始數據，獲得原始數據的方法有多種途徑。比如：獲取數據集（dataset）文件使用爬蟲采集數據直接獲得excel、

《Python網絡數據采集》筆記之BeautifulSoup

text 便簽 pip 使用 dal findall con content attribute 一初見網絡爬蟲都是使用的python3。一個簡單的例子： from urllib.request import urlopen html = urlopen("ht

python網絡爬蟲-數據采集之遍歷單個爬蟲

target follow ndt 數據采集 http lan python www win 8D湛91G坡嗇1訝Dhttp://www.facebolw.com/space/2102892/following T判捕9墳17猿9PFV瞬http://www.facebo

開源數據采集組件比較: scribe、chukwa、kafka、flume

方案來源接口場景 hadoop集群取數據數據源配置角色 thrift 針對每天TB級的數據采集，一般而言，這些系統需要具有以下特征：構建應用系統和分析系統的橋梁，並將它們之間的關聯解耦；支持近實時的在線分析系統和類似於Hadoop之類的離線分析系統；具有

數據采集之采集引擎學習路線

采集引擎什麽是插件式監控平臺為什麽使用插件式監控平臺插件式監控平臺的構成與工作原理插件式監控平臺的環境配置采集器下載和部署配置步驟采集器和插件的代碼說明如何開發新的插件常見問題及解決方法如何添加插件如何創建采集器如何關聯監控項目如何控制采集器和插件采集器運行時出錯采集器正常運行，但獲取不到數據Python版本

大數據之數據采集

可能目標過程結合重要 jdb 就是 ont 數據合並大數據之數據采集大數據體系一般分為：數據采集、數據計算、數據服務、以及數據應用幾大層次。在數據采集層，主要分為日誌采集和數據源數據同步。日誌采集根據產品的類型又有可以分為： - 瀏覽器頁面

大數據模塊開發之數據采集

容錯能力 follow 部署 nginx 要求信息 file ref 完全 1．需求在網站web流量日誌分析這種場景中，對數據采集部分的可靠性、容錯能力要求通常不會非常嚴苛，因此使用通用的flume日誌采集框架完全可以滿足需求。2． Flume日誌采集系統2.1． Fl

用NI的數據采集卡實現簡單電子測試之3——繪制二極管V-I特性曲線圖

流程圖實現圖片 tro 參考 dac 反饋流控 ron 本文從本人的163博客搬遷至此。接下來用USB-6009和LabVIEW實現對二極管最重要的特性曲線“V-I特性曲線”的測試和繪制。一、什麽是二極管V-I特性曲線康華光版的《電子技術基礎

用NI的數據采集卡實現簡單電子測試之5——壓控振蕩器的測試

圖片 max 運算放大器 image 電容改變理論 usb 延時本文從本人的163博客搬遷至此。為了展示連續信號采集的方法，以其外部觸發采集功能。我用運算放大器實現了一個最簡單的低頻壓控振蕩器(VCO)，作為USB-6009采集的信號源。在LabVIEW下編寫的應用

網站運維技術與實踐之數據采集、傳輸與過濾

nac 管理 mongodb 協議有用生成 rem ive sphere 一、采集點的取舍說到數據分析，首先當然是數據越全面越詳細越好。因為這有助於分析得出比較正確的結果，從而做出合理的決策。 1.服務器數據采集的服務器數據主要圍繞著這麽幾個？ (1)服務器負載 (

用戶行為分析之離線數據采集

百萬內部就是京東 linux sql 調度系統 lin 很多轉載於：紮心了，老鐵我們的數據從哪來？互聯網行業：網站、APP、系統（交互系統）。傳統行業：電信、上網、打電話、發短信等等。數據源：網站、APP。等等，這些用戶行為都回向我們的後臺發送請求各種各樣

外貿建站之數據采集常用PHP代碼分享

all pre 新網站 arr fun 圖片一起學 rim color 相信很多人有過網站升級的經驗，那就一定會想到怎樣將舊網站數據搬遷采集到新網站。也有很多搞外貿建站站群推廣的，也會有數據采集的需要。不同網站系統中的數據只能通過采集才能獲得了，除非編程搞一個數據接口，

Kettle數據采集部署安裝

kettlekettle 是一個開源的數據采集的工具，可以把一個數據庫表中的數據采集到另一臺服務器數據庫的表中，不同數據庫之間也可相互采集，本地采集和不同服務器采集都行。安裝：（Windows）設置java---jdk環境變量先下載好 JDK 安裝包我這裏下載好後放在下面目錄下：在這裏設置環境變量在"系統

Zabbix系統數據采集方法總結

zabbix 系統數據在Zabbix系統中有多達十三種數據采集方法，每種方法所使用的原理和場景也不一樣。下表列出了這十三種數據采集方法的原理及適合的場景。序號方法名稱描述1通過Zabbix被監控設備代理(agent)采集數據在被監控設置安裝並運行zabbix被監控設備代理進程（Zabbix系統自帶的

Python網絡數據采集

html now() 數據采集 ont 網絡數據函數網絡 mytag dal 一、正則表達式 * 表匹配0次或者多次 a*b* + 表至少一次 [ ] 匹配任意一個 ( ) 辨識一個編組 {m，n} m或者n 次 [^] 匹配任意不在中括號裏的

基於TableStore的數據采集分析系統介紹

數據存儲摘要：摘要在互聯網高度發達的今天，ipad、手機等智能終端設備隨處可見，運行在其中的APP、網站也非常多，如何采集終端數據進行分析，提升軟件的品質非常重要，例如PV/UV統計、用戶行為數據統計與分析等。雖然場景簡單，但是數據量大，對系統的吞吐量、實時性、分析能力、查詢能力都有較高的要求

Python網絡數據采集pdf

font 安裝mysql 按鈕 2.6 word 時間 tran 3.3 ack 下載地址：網盤下載內容簡介 · · · · · ·本書采用簡潔強大的Python語言，介紹了網絡數據采集，並為采集新式網絡中的各種數據類型提供了全面的指導。第一部分重點介紹網絡數據采集的

《Python網絡數據采集》讀書筆記（一）

urllib BeautifulSoup 思考“網絡爬蟲”時通常的想法：? 通過網站域名獲取 HTML 數據? 根據目標信息解析數據? 存儲目標信息? 如果有必要，移動到另一個網頁重復這個過程當網絡瀏覽器遇到一個標簽時，比如<img src="cuteKitten.jpg"&

《Python網絡數據采集》讀書筆記（二）

find child descendant sibling parent 1、通過的名稱和屬性查找標簽和之前一樣，抓取整個頁面，然後創建一個BeautifulSoup對象。這裏面“lxml”解析器需要另外下載。pip3 install lxml>>> from urlli

Flume數據采集之常見集群配置案例

非集群配置

Flume集群之多個Agent一個source

結構說明

配置案例

環境說明

配置

測試

Flume集群之多個Agent多個source

結構說明

配置案例

環境說明

配置

測試

相關推薦