1. 程式人生 > >sparkStreaming+flume實現記憶體計算(小資料量情況下)

sparkStreaming+flume實現記憶體計算(小資料量情況下)

架構分析sparkStreaming一般結合kafka使用,但是如果你的資料量比較小,就可以不用搭建kafka叢集,那麼flume提供了兩種提供資料給sparkStreaming的方式一種是push,一種是Pull,Pull是sparkStreaming向flume拉取資料效果更好一些.因為push只能提供資料給一個spark,而Pull可以從多個flume進行拉取
1安裝flume到伺服器
上傳apache-flume-1.6.0-bin.tar.gz到伺服器
解壓縮
tar -zxf apache-flume-1.6.0-bin.tar.gz
改名
mv apache-flume-1.6.0-bin flume
2安裝JDK
請參考我的另一篇部落格(

https://blog.csdn.net/qq_16563637/article/details/81738113)
2修改配置檔案(重點)
cd flume/conf/
vi flume-pull.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = org.apache.spark.streaming.flume.sink.SparkSink
#一定要寫成flume所在那臺機器的地址
a1.sinks.k1.hostname = 192.168.1.103
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

儲存
或者採用push方法
vi flume-push.conf

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /export/data/flume
a1.sources.r1.fileHeader = true

# Describe the sink
a1.sinks.k1.type = avro
#這是接收方這裡要寫成worker所在的IP地址
a1.sinks.k1.hostname = 192.168.31.172
a1.sinks.k1.port = 8888

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

儲存
3(重要)
spark-streaming-flume-sink_2.10-1.6.1.jar複製到flume的lib目錄
commons-lang3-3.3.2.jar複製到flume的lib目錄
scala-library-2.10.5.jar複製到flume的lib目錄
4設定flume的JAVA_HOME
vi flume-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_161
儲存
5先啟動flume
bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
6在本地啟動程式

package cn.itcast.spark.day5

import java.net.InetSocketAddress

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume.FlumeUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
//flume 執行語句
//bin/flume-ng agent --conf conf --conf-file conf/flume-pull.conf --name a1 -Dflume.root.logger=INFO,console
object FlumePollWordCount {
  def main(args: Array[String]) {
    //設定日誌級別
    LoggerLevels.setStreamingLogLevels()

    val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(conf, Seconds(5))
    //從flume中拉取資料(flume的地址)
    val address = Seq(new InetSocketAddress("192.168.1.103", 8888))
    val flumeStream = FlumeUtils.createPollingStream(ssc, address, StorageLevel.MEMORY_AND_DISK)
    val words = flumeStream.flatMap(x => new String(x.event.getBody().array()).split(" ")).map((_,1))
    val results = words.reduceByKey(_+_)
    results.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

7在伺服器上建立一個檔案test.txt
vi test.txt
zhangsan is holle
lisi is hello
儲存
cp test.txt /export/data/flume/test.txt
檢視控制檯輸出
完畢