本文首發於:Java大資料與資料倉庫,Flink實時計算pv、uv的幾種方法

實時統計pv、uv是再常見不過的大資料統計需求了,前面出過一篇SparkStreaming實時統計pv,uv的案例,這裡用Flink實時計算pv,uv。

我們需要統計不同資料型別每天的pv,uv情況,並且有如下要求.

  • 每秒鐘要輸出最新的統計結果;
  • 程式永遠跑著不會停,所以要定期清理記憶體裡的過時資料;
  • 收到的訊息裡的時間欄位並不是按照順序嚴格遞增的,所以要有一定的容錯機制;
  • 訪問uv並不一定每秒鐘都會變化,重複輸出對IO是巨大的浪費,所以要在uv變更時在一秒內輸出結果,未變更時不輸出;

Flink資料流上的型別和操作

DataStream是flink流處理最核心的資料結構,其它的各種流都可以直接或者間接通過DataStream來完成相互轉換,一些常用的流直接的轉換關係如圖:

可以看出,DataStream可以與KeyedStream相互轉換,KeyedStream可以轉換為WindowedStream,DataStream不能直接轉換為WindowedStream,WindowedStream可以直接轉換為DataStream。各種流之間雖然不能相互直接轉換,但是都可以通過先轉換為DataStream,再轉換為其它流的方法來實現。

在這個計算pv,uv的需求中就主要用到DataStream、KeyedStream以及WindowedStream這些資料結構。

這裡需要用到window和watermark,使用視窗把資料按天分割,使用watermark可以通過“水位”來定期清理視窗外的遲到資料,起到清理記憶體的作用。

業務程式碼

我們的資料是json型別的,含有date,helperversion,guid這3個欄位,在實時統計pv,uv這個功能中,其它欄位可以直接丟掉,當然了在離線資料倉庫中,所有有含義的業務欄位都是要保留到hive當中的。

其它相關概念就不說了,會專門介紹,這裡直接上程式碼吧。

  1. <?xml version="1.0" encoding="UTF-8"?>
  2. <project xmlns="http://maven.apache.org/POM/4.0.0"
  3. xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  4. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  5. <modelVersion>4.0.0</modelVersion>
  6. <groupId>com.ddxygq</groupId>
  7. <artifactId>bigdata</artifactId>
  8. <version>1.0-SNAPSHOT</version>
  9. <properties>
  10. <scala.version>2.11.8</scala.version>
  11. <flink.version>1.7.0</flink.version>
  12. <pkg.name>bigdata</pkg.name>
  13. </properties>
  14. <dependencies>
  15. <dependency>
  16. <groupId>org.apache.flink</groupId>
  17. <artifactId>flink-scala_2.11</artifactId>
  18. <version>{flink.version}</version>
  19. </dependency>
  20. <dependency>
  21. <groupId>org.apache.flink</groupId>
  22. <artifactId>flink-streaming-scala_2.11</artifactId>
  23. <version>flink.version</version>
  24. </dependency>
  25. <dependency>
  26. <groupId>org.apache.flink</groupId>
  27. <artifactId>flink-streaming-java_2.11</artifactId>
  28. <version>{flink.version}</version>
  29. </dependency>
  30. <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-connector-kafka-0.8 -->
  31. <dependency>
  32. <groupId>org.apache.flink</groupId>
  33. <artifactId>flink-connector-kafka-0.10_2.11</artifactId>
  34. <version>flink.version</version>
  35. </dependency>
  36. <build>
  37. <!--測試程式碼和檔案-->
  38. <!--<testSourceDirectory>{basedir}/src/test</testSourceDirectory>-->
  39. <finalName>basedir/src/test</testSourceDirectory>−−><finalName>{pkg.name}</finalName>
  40. <sourceDirectory>src/main/java</sourceDirectory>
  41. <resources>
  42. <resource>
  43. <directory>src/main/resources</directory>
  44. <includes>
  45. <include>*.properties</include>
  46. <include>*.xml</include>
  47. </includes>
  48. <filtering>false</filtering>
  49. </resource>
  50. </resources>
  51. <plugins>
  52. <!-- 跳過測試外掛-->
  53. <plugin>
  54. <groupId>org.apache.maven.plugins</groupId>
  55. <artifactId>maven-surefire-plugin</artifactId>
  56. <configuration>
  57. <skip>true</skip>
  58. </configuration>
  59. </plugin>
  60. <!--編譯scala外掛-->
  61. <plugin>
  62. <groupId>org.scala-tools</groupId>
  63. <artifactId>maven-scala-plugin</artifactId>
  64. <version>2.15.2</version>
  65. <executions>
  66. <execution>
  67. <goals>
  68. <goal>compile</goal>
  69. <goal>testCompile</goal>
  70. </goals>
  71. </execution>
  72. </executions>
  73. </plugin>
  74. </plugins>
  75. </build>
  76. </project>

主要程式碼,主要使用scala開發:

  1. package com.ddxygq.bigdata.flink.streaming.pvuv
  2. import java.util.Properties
  3. import com.alibaba.fastjson.JSON
  4. import org.apache.flink.runtime.state.filesystem.FsStateBackend
  5. import org.apache.flink.streaming.api.CheckpointingMode
  6. import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
  7. import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
  8. import org.apache.flink.streaming.api.windowing.time.Time
  9. import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
  10. import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
  11. import org.apache.flink.streaming.util.serialization.SimpleStringSchema
  12. import org.apache.flink.streaming.api.scala.extensions._
  13. import org.apache.flink.api.scala._
  14. /**
  15. * @ Author: keguang
  16. * @ Date: 2019/3/18 17:34
  17. * @ version: v1.0.0
  18. * @ description:
  19. */
  20. object PvUvCount {
  21. def main(args: Array[String]): Unit = {
  22. val env = StreamExecutionEnvironment.getExecutionEnvironment
  23. // 容錯
  24. env.enableCheckpointing(5000)
  25. env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
  26. env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/flink/tagApp"))
  27. // kafka 配置
  28. val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
  29. val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
  30. val TRANSACTION_GROUP = "flink-count"
  31. val TOPIC_NAME = "flink"
  32. val kafkaProps = new Properties()
  33. kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
  34. kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
  35. kafkaProps.setProperty("group.id", TRANSACTION_GROUP)
  36. // watrmark 允許資料延遲時間
  37. val MaxOutOfOrderness = 86400 * 1000L
  38. // 消費kafka資料
  39. val streamData: DataStream[(String, String, String)] = env.addSource(
  40. new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
  41. ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
  42. override def extractTimestamp(element: String): Long = {
  43. val t = JSON.parseObject(element)
  44. val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
  45. time.toLong
  46. }
  47. }).map(x => {
  48. var date = "error"
  49. var guid = "error"
  50. var helperversion = "error"
  51. try {
  52. val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
  53. val datetime = messageJsonObject.getString("time")
  54. date = datetime.split(" ")(0)
  55. // hour = datetime.split(" ")(1).substring(0, 2)
  56. val decrypted_data_string = messageJsonObject.getString("decrypted_data")
  57. if (!"".equals(decrypted_data_string)) {
  58. val decrypted_data = JSON.parseObject(decrypted_data_string)
  59. guid = decrypted_data.getString("guid").trim
  60. helperversion = decrypted_data.getString("helperversion")
  61. }
  62. } catch {
  63. case e: Exception => {
  64. println(e)
  65. }
  66. }
  67. (date, helperversion, guid)
  68. })
  69. // 這上面是設定watermark並解析json部分
  70. // 聚合視窗中的資料,可以研究下applyWith這個方法和OnWindowedStream這個類
  71. val resultStream = streamData.keyBy(x => {
  72. x._1 + x._2
  73. }).timeWindow(Time.days(1))
  74. .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
  75. .applyWith(("", List.empty[Int], Set.empty[Int], 0L, 0L))(
  76. foldFunction = {
  77. case ((_, list, set, _, 0), item) => {
  78. val date = item._1
  79. val helperversion = item._2
  80. val guid = item._3
  81. (date + "_" + helperversion, guid.hashCode +: list, set + guid.hashCode, 0L, 0L)
  82. }
  83. }
  84. , windowFunction = {
  85. case (key, window, result) => {
  86. result.map {
  87. case (leixing, list, set, _, _) => {
  88. (leixing, list.size, set.size, window.getStart, window.getEnd)
  89. }
  90. }
  91. }
  92. }
  93. ).keyBy(0)
  94. .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
  95. case ((key, numpv, numuv, begin, end), curr) =>
  96. curr match {
  97. case Some(numCurr) if numCurr == (numuv, numpv) =>
  98. (Seq.empty, Some((numuv, numpv))) //如果之前已經有相同的資料,則返回空結果
  99. case _ =>
  100. (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
  101. }
  102. }
  103. // 最終結果
  104. val resultedStream = resultStream.map(x => {
  105. val keys = x._1.split("_")
  106. val date = keys(0)
  107. val helperversion = keys(1)
  108. (date, helperversion, x._2, x._3)
  109. })
  110. resultedStream.print()
  111. env.execute("PvUvCount")
  112. }
  113. }

使用List集合的size儲存pv,使用Set集合的size儲存uv,從而達到實時統計pv,uv的目的。

這裡用了幾個關鍵的函式:

applyWith:裡面需要的引數,初始狀態變數,和foldFunction ,windowFunction ;

存在的問題

顯然,當資料量很大的時候,這個List集合和Set集合會很大,並且這裡的pv是否可以不用List來儲存,而是通過一個狀態變數,不斷做累加,對應操作就是更新狀態來完成。

改進版

使用了一個計數器來儲存pv的值。

  1. packagecom.ddxygq.bigdata.flink.streaming.pvuv
  2. import java.util.Properties
  3. import com.alibaba.fastjson.JSON
  4. import org.apache.flink.api.common.accumulators.IntCounter
  5. import org.apache.flink.runtime.state.filesystem.FsStateBackend
  6. import org.apache.flink.streaming.api.CheckpointingMode
  7. import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
  8. import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
  9. import org.apache.flink.streaming.api.windowing.time.Time
  10. import org.apache.flink.streaming.api.windowing.triggers.ContinuousProcessingTimeTrigger
  11. import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010
  12. import org.apache.flink.streaming.util.serialization.SimpleStringSchema
  13. import org.apache.flink.streaming.api.scala.extensions._
  14. import org.apache.flink.api.scala._
  15. import org.apache.flink.core.fs.FileSystem
  16. object PvUv2 {
  17. def main(args: Array[String]): Unit = {
  18. val env = StreamExecutionEnvironment.getExecutionEnvironment
  19. // 容錯
  20. env.enableCheckpointing(5000)
  21. env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
  22. env.setStateBackend(new FsStateBackend("file:///D:/space/IJ/bigdata/src/main/scala/com/ddxygq/bigdata/flink/checkpoint/streaming/counter"))
  23. // kafka 配置
  24. val ZOOKEEPER_HOST = "hadoop01:2181,hadoop02:2181,hadoop03:2181"
  25. val KAFKA_BROKERS = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
  26. val TRANSACTION_GROUP = "flink-count"
  27. val TOPIC_NAME = "flink"
  28. val kafkaProps = new Properties()
  29. kafkaProps.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
  30. kafkaProps.setProperty("bootstrap.servers", KAFKA_BROKERS)
  31. kafkaProps.setProperty("group.id", TRANSACTION_GROUP)
  32. // watrmark 允許資料延遲時間
  33. val MaxOutOfOrderness = 86400 * 1000L
  34. val streamData: DataStream[(String, String, String)] = env.addSource(
  35. new FlinkKafkaConsumer010[String](TOPIC_NAME, new SimpleStringSchema(), kafkaProps)
  36. ).assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[String](Time.milliseconds(MaxOutOfOrderness)) {
  37. override def extractTimestamp(element: String): Long = {
  38. val t = JSON.parseObject(element)
  39. val time = JSON.parseObject(JSON.parseObject(t.getString("message")).getString("decrypted_data")).getString("time")
  40. time.toLong
  41. }
  42. }).map(x => {
  43. var date = "error"
  44. var guid = "error"
  45. var helperversion = "error"
  46. try {
  47. val messageJsonObject = JSON.parseObject(JSON.parseObject(x).getString("message"))
  48. val datetime = messageJsonObject.getString("time")
  49. date = datetime.split(" ")(0)
  50. // hour = datetime.split(" ")(1).substring(0, 2)
  51. val decrypted_data_string = messageJsonObject.getString("decrypted_data")
  52. if (!"".equals(decrypted_data_string)) {
  53. val decrypted_data = JSON.parseObject(decrypted_data_string)
  54. guid = decrypted_data.getString("guid").trim
  55. helperversion = decrypted_data.getString("helperversion")
  56. }
  57. } catch {
  58. case e: Exception => {
  59. println(e)
  60. }
  61. }
  62. (date, helperversion, guid)
  63. })
  64. val resultStream = streamData.keyBy(x => {
  65. x._1 + x._2
  66. }).timeWindow(Time.days(1))
  67. .trigger(ContinuousProcessingTimeTrigger.of(Time.seconds(1)))
  68. .applyWith(("", new IntCounter(), Set.empty[Int], 0L, 0L))(
  69. foldFunction = {
  70. case ((_, cou, set, _, 0), item) => {
  71. val date = item._1
  72. val helperversion = item._2
  73. val guid = item._3
  74. cou.add(1)
  75. (date + "_" + helperversion, cou, set + guid.hashCode, 0L, 0L)
  76. }
  77. }
  78. , windowFunction = {
  79. case (key, window, result) => {
  80. result.map {
  81. case (leixing, cou, set, _, _) => {
  82. (leixing, cou.getLocalValue, set.size, window.getStart, window.getEnd)
  83. }
  84. }
  85. }
  86. }
  87. ).keyBy(0)
  88. .flatMapWithState[(String, Int, Int, Long, Long),(Int, Int)]{
  89. case ((key, numpv, numuv, begin, end), curr) =>
  90. curr match {
  91. case Some(numCurr) if numCurr == (numuv, numpv) =>
  92. (Seq.empty, Some((numuv, numpv))) //如果之前已經有相同的資料,則返回空結果
  93. case _ =>
  94. (Seq((key, numpv, numuv, begin, end)), Some((numuv, numpv)))
  95. }
  96. }
  97. // 最終結果
  98. val resultedStream = resultStream.map(x => {
  99. val keys = x._1.split("_")
  100. val date = keys(0)
  101. val helperversion = keys(1)
  102. (date, helperversion, x._2, x._3)
  103. })
  104. val resultPath = "D:\\space\\IJ\\bigdata\\src\\main\\scala\\com\\ddxygq\\bigdata\\flink\\streaming\\pvuv\\result"
  105. resultedStream.writeAsText(resultPath, FileSystem.WriteMode.OVERWRITE)
  106. env.execute("PvUvCount")
  107. }
  108. }

參考資料

https://flink.sojb.cn/dev/event_time.html

http://wuchong.me/blog/2016/05/20/flink-internals-streams-and-operations-on-streams

https://segmentfault.com/a/1190000006235690