Spark學習(拾叄)- Spark Streaming整合Flume&Kafka
阿新 • • 發佈:2018-12-09
文章目錄
處理流程畫圖剖析
日誌產生器開發並結合log4j完成日誌的輸出
import org.apache.log4j.Logger;
/**
* 模擬日誌產生
*/
public class LoggerGenerator {
private static Logger logger = Logger.getLogger(LoggerGenerator.class.getName());
public static void main(String[] args) throws Exception{
int index = 0;
while(true) {
Thread.sleep(1000);
logger.info("value : " + index++);
}
}
}
建立一個resources
設定格式
在此資料夾下建立檔案log4j.properties
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss, SSS} [%t] [%c] [%p] - %m%n
使用Flume採集Log4j產生的日誌
streaming.conf
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=log-sink
#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
#define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.log-sink.type=logger
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.log-sink.channel=logger-channel
啟動flume
flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
log4j對接flume
log4j.rootLogger=INFO,stdout,flume
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS} [%t] [%c] [%p] - %m%n
//在log4j.properties加上下面配置
log4j.appender.flume = org.apache.flume.clients.log4jappender.Log4jAppender
log4j.appender.flume.Hostname = hadoop000
log4j.appender.flume.Port = 41414
log4j.appender.flume.UnsafeMode = true
可能會報錯
java.lang.ClassNotFoundException: org.apache.flume.clients.log4jappender.Log4jAppender
解決方法加上下面以來
<dependency>
<groupId>org.apache.flume.flume-ng-clients</groupId>
<artifactId>flume-ng-log4jappender</artifactId>
<version>1.6.0</version>
</dependency>
此時;IDEA控制檯和flume伺服器控制檯都可以看到日誌輸出
使用KafkaSInk將Flume收集到的資料輸出到Kafka
1、啟動zk程序、kafka程序
2、建立一個topic
./kafka-topics.sh --create --zookeeper hadoop000:2181 --replication-factor 1 --partitions 1 --topic streamingtopic
3、更改flume的sink配置項
streaming2.conf
agent1.sources=avro-source
agent1.channels=logger-channel
agent1.sinks=kafka-sink
#define source
agent1.sources.avro-source.type=avro
agent1.sources.avro-source.bind=0.0.0.0
agent1.sources.avro-source.port=41414
#define channel
agent1.channels.logger-channel.type=memory
#define sink
agent1.sinks.kafka-sink.type=org.apache.flume.sink.kafka.KafkaSink
agent1.sinks.kafka-sink.topic = streamingtopic
agent1.sinks.kafka-sink.brokerList = hadoop000:9092
agent1.sinks.kafka-sink.requiredAcks = 1
agent1.sinks.kafka-sink.batchSize = 20
agent1.sources.avro-source.channels=logger-channel
agent1.sinks.kafka-sink.channel=logger-channel
4、啟動flume
flume-ng agent \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/streaming2.conf \
--name agent1 \
-Dflume.root.logger=INFO,console
5、在IDEA的log4j輸出正常的情況下;檢視kafka消費此topic的消費控制檯;有正常日誌輸出就是正確。
Spark Streaming消費Kafka的資料進行統計
package com.imooc.spark
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
/**
* Spark Streaming對接Kafka
*/
object KafkaStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4) {
System.err.println("Usage: KafkaStreamingApp <zkQuorum> <group> <topics> <numThreads>")
}
val Array(zkQuorum, group, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("KafkaReceiverWordCount")
.setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(5))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
// TODO... Spark Streaming如何對接Kafka
val messages = KafkaUtils.createStream(ssc, zkQuorum, group,topicMap)
// TODO... 自己去測試為什麼要取第二個
messages.map(_._2).count().print()
ssc.start()
ssc.awaitTermination()
}
}
執行IDEA的sparkstreaming應用程式
本地測試和生產環境使用的拓展
我們現在是在本地進行測試的,在IDEA中執行LoggerGenerator,
然後使用Flume、Kafka以及Spark Streaming進行處理操作。
在生產上肯定不是這麼幹的,怎麼幹呢?
- 打包jar,執行LoggerGenerator類
- Flume、Kafka和我們的測試是一樣的
- Spark Streaming的程式碼也是需要打成jar包,然後使用spark-submit的方式進行提交到環境上執行
可以根據你們的實際情況選擇執行模式:local/yarn/standalone/mesos
在生產上,整個流處理的流程都一樣的,區別在於業務邏輯的複雜性