分散式資料收集：（Flume原理與應用）

阿新 • • 發佈：2019-02-07

第一：背景介紹

常見的開源資料收集系統

非結構化日誌（資料）收集
- flume
結構化日誌（資料）收集
- sqoop全量匯入
- canal（Alibaba）增量匯入
- Databus（linkedin）增量匯入

第二：Flume（NG）介紹

- Event

flume以事件的形式傳輸資料單元
事件由一個header和載有資料的byte array構成
header是一個字典結構的資料，可以在上下文路由中擴充套件使用

- Client

client是一個將原始log包裝成event並且傳送他們到一個或者多個agent的實體
client不是必須的

- Agent

source
1. 接收或者產生event，並且批量傳送到一個或者多個channel
2. 不同型別的source
  - 與系統整合的source：syslog，netCat
  - 自動生成事件的source：Exce
  - 監聽資料夾下檔案變化的：Spolling Directory Source，Talidir Source
  - 用於Agent與Agent通訊的IPC Source：Avro，Thrift
channel
1. 位於Source和Sink之間，用於快取event
2. 支援事務
3. 不同型別的channel
  - Memory Channel：volatile
  - File Channel：
  - JDBC Channel：
sink
1. 將event傳輸到下一步或者最終目的地，成功後將event從Channel中清除
2. 不同型別的sink
  - 儲存event到最終終端的sink：HDFS，HBASE
  - 自動消耗的Channel：Null sink
  - 用於Agent之間通訊的IPC sink：Avro

第三：Sqooq介紹

傳統關係型資料庫和Hadoop之間的橋樑
- 把關係型資料的資料匯入到hadoop系統
- 把hadoop系統的資料匯入到關係型資料庫中
利用MapReduce加快資料傳輸速度
批處理方式進行資料傳輸

第四：CDC介紹

canal：
databus

第五：專案實踐

一，專案說明

將命名為record.list裡面不斷生成的內容，收集到hadoop叢集中。其中source採用exec的sourc，Channel採用file的Channel，sink採用hdfs的sink

二，安裝flume

[hadoop@hadoopa ~]$ tar -zxvf apache-flume-1.7.0-bin.tar.gz

三，配置flume

配置 flume-conf-logAnalysis.properties

[hadoop@hadoopa conf]$ pwd
/home/hadoop/apache-flume-1.7.0-bin/conf
[hadoop@hadoopa conf]$ vi flume-conf-logAnalysis.properties

[[email protected] conf]$ cat flume-conf-logAnalysis.properties
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#  http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.


# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'agent'

logAgent.sources = logSource
logAgent.channels = fileChannel
logAgent.sinks = hdfsSink

# For each one of the sources, the type is defined
logAgent.sources.logSource.type = exec
logAgent.sources.logSource.command = tail -F /home/hadoop/hadooptraining/datasource/record.list

# The channel can be defined as follows.
logAgent.sources.logSource.channels = fileChannel

# Each sink's type must be defined
logAgent.sinks.hdfsSink.type = hdfs
logAgent.sinks.hdfsSink.hdfs.path = hdfs://hadoopA:8020/flume/record/%Y-%m-%d/%H%M
logAgent.sinks.hdfsSink.hdfs.filePrefix= transaction_log
logAgent.sinks.hdfsSink.hdfs.rollInterval= 600
logAgent.sinks.hdfsSink.hdfs.rollCount= 10000
logAgent.sinks.hdfsSink.hdfs.rollSize= 0
logAgent.sinks.hdfsSink.hdfs.round = true
logAgent.sinks.hdfsSink.hdfs.roundValue = 10
logAgent.sinks.hdfsSink.hdfs.roundUnit = minute
logAgent.sinks.hdfsSink.hdfs.fileType = DataStream
logAgent.sinks.hdfsSink.hdfs.useLocalTimeStamp = true
#Specify the channel the sink should use
logAgent.sinks.hdfsSink.channel = fileChannel

# Each channel's type is defined.
logAgent.channels.fileChannel.type = file
logAgent.channels.fileChannel.checkpointDir= /home/hadoop/apache-flume-1.7.0-bin/dataCheckpointDir
logAgent.channels.fileChannel.dataDirs= /home/hadoop/apache-flume-1.7.0-bin/dataDir

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel

配置 flume-env.sh

[hadoop@hadoopa conf]$ pwd
/home/hadoop/apache-flume-1.7.0-bin/conf
[hadoop@hadoopa conf]$ vi flume-env.sh

[[email protected] conf]$ cat flume-env.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
export JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"

# Let Flume write raw event data and configuration information to its log files for debugging
# purposes. Enabling these flags is not recommended in production,
# as it may result in logging sensitive user information or encryption secrets.
# $JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "

# Foll. classpath will be included in Flume's classpath.
# Note that the Flume conf directory is always included in the classpath.
FLUME_CLASSPATH="$HADOOP_HOME/share/hadoop/common/hadoop-common-2.7.3.jar"   # Example:  "path1;path2;path3"

四，執行flume

[[email protected] conf]$ flume-ng agent --conf /home/hadoop/apache-flume-1.7.0-bin/conf --conf-file /home/hadoop/apache-flume-1.7.0-bin/conf/flume-conf-logAnalysis.properties --name logAgent -Dflume.root.logger=DEBUG,console -Dflume.monitoring.type=http -Dflume.monitoring.port=34545

五，驗證結果

通過瀏覽器訪問hdfs：

http://192.168.1.201:50070/explorer.html#/flume/record/

分散式資料收集：（Flume原理與應用）

第一：背景介紹常見的開源資料收集系統非結構化日誌（資料）收集 flume 結構化日誌（資料）收集 sqoop全量匯入 canal（Alibaba）增量匯入 Databus（linkedin）增量匯入第二：Flume（NG）介紹 - E

華為交換機私有hybird接口模式：（案例+原理詳解）

華為 hybird 華為交換機私有hybird接口模式：（案例+原理詳解）實驗說明：準備：如圖pc1 pc2同屬於VLAN10，配置相應的ippc3 pc4同屬於VLAN20 配置相應的ipClient 屬於 VLAN30 配置pc1同網段ipPc1 pc2 client 屬於同網段

springboot2.0x全系列一springboot2.0x整合ActiveMQ（簡單整合與應用）

ActiveMQ ActiveMQ 是Apache出品，最流行的，能力強勁的開源訊息匯流排。ActiveMQ 是一個完全支援JMS1.1和J2EE 1.4規範的 JMS Provider實現，儘管JMS規範出臺已經是很久的事情了，但是JMS在當今的J2EE應用中間仍然扮演著特殊的地位。特性多種語

quartz （從原理到應用）詳解篇

一、Quartz 基本介紹 1.1 Quartz 概述 1.2 Quartz特點 1.3 Quartz 叢集配置二、Quartz 原理及流程 2.1 quartz基本原理

論文：基於粒子群優化的測試資料生成及其實證分析－－－－－生成過程以及實驗（計算機研究與發展）

來源：２０１２年版的計算機研究與發展期刊基於ＰＳＯ的測試資料生成（１）　核心問題：如何保證PSO搜尋演算法和測試過程的協作執行演算法的基本的流程：（1）對被測程式P進行靜態分析並完成： 1. 提取程式的介面資

資料結構篇：校園最短路徑導航（二：弗洛伊德演算法理解與應用）

求最短路徑最常用的有迪傑斯特拉（Dijkstra）和弗洛伊德（Floyd）演算法兩種。本著簡潔為王道的信條，我選擇了Floyd演算法。 Floyd演算法首先來看一個簡單圖，紅色標記代表在陣列的下標，橙色標記代表距離（邊權值）我們用D[6][6]這個矩陣儲存兩點之間最短路徑，

（87）--Python資料分析：指數密度函式與指數分佈圖

# 指數密度函式與指數分佈圖 lambd = 0.5 x = np.arange(0,15,0.1) y = lambd*np.exp(-lambd*x) plt.plot(x,y) plt.title

基礎演算法（二）：Kmeans聚類演算法的基本原理與應用

Kmeans聚類演算法的基本原理與應用內容說明：主要介紹Kmeans聚類演算法的數學原理，並使用matlab程式設計實現Kmeans的簡單應用，不對之處還望指正。一、Km

網路基本功（十九）：細說NAT原理與配置

網路基本功（十九）：細說NAT原理與配置介紹 NAT技術讓少數公有IP地址被使用私有地址的大量主機所共享。這一機制允許遠多於IP地址空間所支援的主機共享網路。同時，由於NAT遮蔽了內部網路，也為區域網內的機器提供了安全保障。 NAT的基本實施過程包括使

深入理解Spark 2.1 Core （一）：RDD的原理與原始碼分析

本文連結：http://blog.csdn.net/u011239443/article/details/53894611 該論文來自Berkeley實驗室，英文標題為：Resilient Distributed Datasets: A Fault-Toler

短時傅立葉變換的原理與應用：電話撥號聲分析（4）

現在要做的就是，對於每一幀（語譜圖中，固定時間k，頻率從0 - f_sample/2），把應該可以分別歸到697、770、……、1633 Hz這些類的頻率組分所對應的DFT的模值，都加起來。意思是，把從 0 - F_sample/2的頻率，換算成第#1 - #8，一共8個頻率組別。語言的表達是不客觀，不清晰。