Spark Streaming和Flink的Word Count對比

阿新 • • 發佈：2019-01-12

準備：

nccat for windows/linux 都可以通過 TCP 套接字連線，從流資料中建立了一個 Spark DStream/ Flink DataSream, 然後進行處理, 時間視窗大小為10s
因為示例需要, 所以需要下載一個netcat, 來構造流的輸入。

程式碼：

spark streaming

package cn.kee.spark;
public final class JavaNetworkWordCount {  
	private static final Pattern SPACE = Pattern.compile(" ");  

	public static void main(String[] args) throws Exception {  
		if (args.length < 2) {  
			System.err.println("Usage: JavaNetworkWordCount <hostname> <port>");  
			System.exit(1);  
		}  
		StreamingExamples.setStreamingLogLevels();  
		SparkConf sparkConf = new SparkConf().setAppName("JavaNetworkWordCount");  
		JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1));  
		JavaReceiverInputDStream<String> lines = ssc.socketTextStream(  
				args[0], Integer.parseInt(args[1]), StorageLevels.MEMORY_AND_DISK_SER);  
		JavaDStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {  
			@Override  
			public Iterator<String> call(String x) {  
				return Arrays.asList(SPACE.split(x)).iterator();  
			}  
		});  
		JavaPairDStream<String, Integer> wordCounts = words.mapToPair(  
				new PairFunction<String, String, Integer>() {  
					@Override  
					public Tuple2<String, Integer> call(String s) {  
						return new Tuple2<>(s, 1);  
					}  
				}).reduceByKey(new Function2<Integer, Integer, Integer>() {  
					@Override  
					public Integer call(Integer i1, Integer i2) {  
						return i1 + i2;  
					}  
				});  
		wordCounts.print();  
		ssc.start();  
		ssc.awaitTermination();  
	}  
}

Flink DataSream

package cn.kee.flink;

import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
 * Example :SocketWindowWordCount
 * @author keehang
 *
 */
public class SocketWindowWordCount {

	public static void main(String[] args) throws Exception {

		// the port to connect to
		final int port = 9999;
		/*try {
			final ParameterTool params = ParameterTool.fromArgs(args);
			port = params.getInt("port");
		} catch (Exception e) {
			System.err.println("No port specified. Please run 'SocketWindowWordCount --port <port>'");
			return;
		}*/
	
		// get the execution environment
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// get input data by connecting to the socket
		DataStream<String> text = env.socketTextStream("localhost", port, "\n");

		// parse the data, group it, window it, and aggregate the counts
		DataStream<WordWithCount> windowCounts = text
				.flatMap(new FlatMapFunction<String, WordWithCount>() {
					@Override
					public void flatMap(String value, Collector<WordWithCount> out) {
						for (String word : value.split("\\s")) {
							out.collect(new WordWithCount(word, 1L));
						}
					}
				})
				.keyBy("word")
				.timeWindow(Time.seconds(5), Time.seconds(1))
				.reduce(new ReduceFunction<WordWithCount>() {
					@Override
					public WordWithCount reduce(WordWithCount a, WordWithCount b) {
						return new WordWithCount(a.word, a.count + b.count);
					}
				});

		// print the results with a single thread, rather than in parallel
		windowCounts.print().setParallelism(1);

		env.execute("Socket Window WordCount");
	}
}

結果：

Spark是一種快速、通用的計算集群系統，Spark提出的最主要抽象概念是彈性分散式資料集(RDD)，它是一個元素集合，劃分到叢集的各個節點上，可以被並行操作。使用者也可以讓Spark保留一個RDD在記憶體中，使其能在並行操作中被有效的重複使用。

Flink是可擴充套件的批處理和流式資料處理的資料處理平臺，設計思想主要來源於Hadoop、MPP資料庫、流式計算系統等，支援增量迭代計算。

總結：Spark和Flink全部都執行在Hadoop YARN上，效能為Flink > Spark > Hadoop(MR)，迭代次數越多越明顯，效能上，Flink優於Spark和Hadoop最主要的原因是Flink支援增量迭代，具有對迭代自動優化的功能

流式計算比較

它們都支援流式計算，Flink是一行一行處理，而Spark是基於資料片集合（RDD）進行小批量處理，所以Spark在流式處理方面，不可避免增加一些延時。Flink的流式計算跟Storm效能差不多，支援毫秒級計算，而Spark則只能支援秒級計算。

SQL支援

都支援，Spark對SQL的支援比Flink支援的範圍要大一些，另外Spark支援對SQL的優化，而Flink支援主要是對API級的優化。

Spark 感覺2.x 後主要在spark sql 這裡發展優勢,快速Join操作，以及繼續擴充套件sql支援。至於Flink，其對於流式計算和迭代計算支援力度將會更加增強。

Spark Streaming和Flink的Word Count對比

準備：

程式碼：

結果：

Spark Streaming和Flink的Word Count對比

【轉】Spark Streaming和Kafka整合開發指南

spark配置和word-count

Spark2.2（三十三）：Spark Streaming和Spark Structured Streaming更新broadcast總結

Spark Streaming 和 Flink 誰是資料開發者的最愛

Spark Streaming和Storm架構比對

Spark Streaming和Storm的區別和聯絡

Spark Streaming 和kafka 整合指導（kafka 0.8.2.1 或以上版本）

Spark Streaming與Storm的對比分析

Structure Streaming和spark streaming原生API訪問HDFS檔案資料對比

Dataflow編程模型和spark streaming結合

Spark的Streaming和Spark的SQL簡單入門學習

提高MSSQL數據庫性能(1)對比count() 和替代count()

flink和spark stream等框架的對比

Apache 流框架 Flink，Spark Streaming，Storm對比分析（2）

Spark Streaming整合flume(Poll方式和Push方式)

Spark Streaming狀態管理函式（一）——updateStateByKey和mapWithState

Spark Streaming實時流處理筆記（6）—— Kafka 和 Flume的整合

Apache 流框架 Flink，Spark Streaming，Storm對比分析（二）

Spark Streaming 輸入DStream和Receiver詳解

Spark Streaming和Flink的Word Count對比

準備：

程式碼：

結果：

相關推薦