1. 程式人生 > >使用SparkSql進行表的分析與統計

使用SparkSql進行表的分析與統計

背景

​ 我們的資料探勘平臺對資料統計有比較迫切的需求,而Spark本身對資料統計已經做了一些工作,希望梳理一下Spark已經支援的資料統計功能,後期再進行擴充套件。

準備資料

在參考文獻6中下載鳶尾花資料,此處格式為iris.data格式,先將data字尾改為csv字尾(不影響使用,只是為了保證後續操作不需要修改)。

資料格式如下:

SepalLength SepalWidth PetalLength PetalWidth Name
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa

資料說明見附錄中的鳶尾花資料

我們先把資料放到Spark sql數倉中

CREATE TABLE IF NOT EXISTS iris ( SepalLength FLOAT , SepalWidth FLOAT 
  , PetalLength FLOAT , PetalWidth FLOAT 
  , Species VARCHAR(100) 
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/mnt/disk1/starqiu/iris';

表的分析與統計

Analyze Table語法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [analyze_option]

Collect statistics about the table that can be used by the query optimizer to find a better plan.

可以看到Spark表的分析可以為spark sql做查詢優化,以便得到更好的查詢效能。Spark Sql預設使用CBO(基於代價的優化),這在多表join查詢時尤其有用。

此處的analyze_option

引數主要分為兩類,表統計和列統計。

表統計

表的基本統計資訊一般包括記錄總數和所佔空間。

Table statistics用法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS [NOSCAN]

Collect only basic statistics for the table (number of rows, size in bytes).

NOSCAN
Collect only statistics that do not require scanning the whole table (that is, size in bytes).

執行命令ANALYZE TABLE iris COMPUTE STATISTICS;可以得到表的記錄總數和所佔空間大小。如果不想全表掃描,加上NOSCAN關鍵字,不會全表掃描,但只能得到所佔空間大小。

表統計資訊的描述命令語法如下:

DESCRIBE [EXTENDED] [db_name.]table_name

Return the metadata of an existing table (column names, data types, and comments). If the table does not exist, an exception is thrown.

EXTENDED
Display detailed information about the table, including parent database, table type, storage information, and properties.
Describe Partition

執行DESCRIBE EXTENDED iris;,結果如下:

spark-sql> DESCRIBE EXTENDED iris;
SepalLength float   NULL
SepalWidth  float   NULL
PetalLength float   NULL
PetalWidth  float   NULL
Species string  NULL
        
# Detailed Table Information    CatalogTable(
    Table: `default`.`iris`
    Owner: root
    Created: Sat Feb 16 17:24:32 CST 2019
    Last Access: Thu Jan 01 08:00:00 CST 1970
    Type: EXTERNAL
    Schema: [StructField(SepalLength,FloatType,true), StructField(SepalWidth,FloatType,true), StructField(PetalLength,FloatType,true), StructField(PetalWidth,FloatType,true), StructField(Species,StringType,true)]
    Provider: hive
    Properties: [rawDataSize=-1, numFiles=0, transient_lastDdlTime=1550311815, totalSize=0, COLUMN_STATS_ACCURATE=false, numRows=-1]
    Statistics: sizeInBytes=3808, rowCount=150, isBroadcastable=false
    Storage(Location: hdfs://data126:8020/mnt/disk1/starqiu/iris, InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties: [field.delim=,, serialization.format=,])
    Partition Provider: Catalog)    
Time taken: 0.112 seconds, Fetched 7 row(s)

通過Statistics:可以看到表的記錄總數是150條,所佔空間3808B,約4KB。

列統計

Column statistics用法如下:

ANALYZE TABLE [db_name.]table_name COMPUTE STATISTICS FOR COLUMNS col1 [, col2, ...]

Collect column statistics for the specified columns in addition to table statistics.

Tip

Use this command whenever possible because it collects more statistics so the optimizer can find better plans. Make sure to collect statistics for all columns used by the query.

列統計的描述命令語法如下:

DESCRIBE [EXTENDED][db_name.]table_name column_name

New in version runtime-3.3.

EXTENDED
Display detailed information about the specified columns, including the column statistics collected by the command ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column_name [column_name, ...].

需要注意的是這個功能在runtime-3.3版本才有的特性,而runtime-3.3封裝的是Spark 2.2,會詳見文末附錄的databricks Runtime版本與Spark版本的對應關係

執行命令ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;計算指定多列的統計資訊,

執行DESCRIBE EXTENDED iris SepalLength;獲取指定一列的統計資訊,結果如下:

spark-sql> ANALYZE TABLE iris COMPUTE STATISTICS FOR COLUMNS SepalLength, SepalWidth, PetalLength, PetalWidth, Species;
Time taken: 4.45 seconds
spark-sql> DESCRIBE EXTENDED iris PetalWidth;
col_name    PetalWidth
data_type   float
comment NULL
min 0.10000000149011612
max 2.5
num_nulls   0
distinct_count  21
avg_col_len 4
max_col_len 4
histogram   NULL
Time taken: 0.104 seconds, Fetched 10 row(s)

目前測試Spark2.2.2不支援該語句,但是Spark2.4.0支援。如果不支援,則可以通過訪問hive的元資料庫也可以得到這些資訊,sql語句如下:

select param_key, param_value 
from TABLE_PARAMS tp, TBLS t 
where tp.tbl_id=t.tbl_id and tbl_name = 'iris' 
and param_key like 'spark.sql.stat%';

以下是PetalWidth列的統計結果,可以看到包含不重複的記錄數,空值數,最大值、最小值,平均長度以及最大長度

param_key param_value
spark.sql.statistics.colStats.PetalWidth.avgLen 4
spark.sql.statistics.colStats.PetalWidth.distinctCount 21
spark.sql.statistics.colStats.PetalWidth.max 2.5
spark.sql.statistics.colStats.PetalWidth.maxLen 4
spark.sql.statistics.colStats.PetalWidth.min 0.10000000149011612
spark.sql.statistics.colStats.PetalWidth.nullCount 0
spark.sql.statistics.colStats.PetalWidth.version 1

總結

​ 可以看到這些統計資訊不僅對了解資料質量非常有用,對使用Spark sql進行查詢也能得到優化,進一步提升速度。後續再寫一篇CBO如何利用這些資訊進行優化。

​ 目前還不清楚Runtime中的Spark功能和開源版的有無差異,但Spark2.4支援表的分析統計操作,建議平臺後續專案升級到Spark2.4 。

附錄

鳶尾花資料說明

​ Iris資料集是常用的分類實驗資料集,由Fisher, 1936收集整理。Iris也稱鳶尾花卉資料集,是一類多重變數分析的資料集。資料集包含150個數據集,分為3類,每類50個數據,每個資料包含4個屬性。iris以鳶尾花的特徵作為資料來源,常用在分類操作中。該資料集由3種不同型別的鳶尾花的50個樣本資料構成。其中的一個種類與另外兩個種類是線性可分離的,後兩個種類是非線性可分離的。

四個屬性:

Sepal.Length(花萼長度),單位是cm;

Sepal.Width(花萼寬度),單位是cm;

Petal.Length(花瓣長度),單位是cm;

Petal.Width(花瓣寬度),單位是cm;

三個種類:

Iris Setosa(山鳶尾);

Iris Versicolour(雜色鳶尾);

Iris Virginica(維吉尼亞鳶尾)。

databricks Runtime

Runtime是databricks 統一分析平臺的一部分,官網描述如下:

Accelerate innovation by unifying data science, engineering and business, with the Databricks Unified Analytics Platform, from the original creators of Apache Spark™.

Runtime的描述如下:

Simplify operations and get up to 50x better performance with cloud-optimized Apache Spark™.

可以看到主要是基於雲優化來簡化操作並提升50倍以上的效能。

databricks Runtime版本與Spark版本的對應關係

Current Releases

Version Spark Version Release Date Deprecation Announcement Deprecation Date
5.2 Spark 2.4 Jan 24, 2019 May 27, 2019 Sep 30, 2019
5.1 Spark 2.4 Dec 18, 2018 Apr 18, 2019 Aug 19, 2019
5.0 Spark 2.4 Nov 08, 2018 Mar 08, 2019 Jul 08, 2019
4.3 Spark 2.3 Aug 10, 2018 Dec 09, 2018 Apr 09, 2019
4.2 Spark 2.3 Jul 09, 2018 Nov 05, 2018 Mar 05, 2019
3.5-LTS Spark 2.2 Dec 21, 2017 Jan 02, 2019 Jan 02, 2020

Marked for Deprecation

Version Spark Version Release Date Deprecation Announcement Deprecation Date
4.3 Spark 2.3 Aug 10, 2018 Dec 09, 2018 Apr 09, 2019
4.2 Spark 2.3 Jul 09, 2018 Nov 05, 2018 Mar 05, 2019
3.5-LTS Spark 2.2 Dec 21, 2017 Jan 02, 2019 Jan 02, 2020

Deprecated Releases

Version Spark Version Release Date Deprecation Announcement Deprecation Date
4.1 Spark 2.3 May 17, 2018 Sep 17, 2018 Jan 17, 2019
4.0 Spark 2.3 Mar 01, 2018 Jul 01, 2018 Nov 01, 2018
3.4 Spark 2.2 Nov 20, 2017 Mar 31, 2018 Jul 30, 2018
3.3 Spark 2.2 Oct 04, 2017 Mar 31, 2018 Jul 30, 2018
3.2 Spark 2.2 Sep 05, 2017 Jan 30, 2018 Apr 30, 2018
3.1 Spark 2.2 Aug 04, 2017 Oct 30, 2017
3.0 Spark 2.2 Jul 11, 2017 Sep 05, 2017
Spark 2.1 (Auto Updating) Spark 2.1 Dec 22, 2016 Mar 31, 2018 Jul 30, 2018
Spark 2.1.1-db6 Spark 2.1 Aug 03, 2017 Mar 31, 2018 Jul 30, 2018
Spark 2.1.1-db5 Spark 2.1 May 31, 2017 Aug 03, 2017
Spark 2.1.1-db4 Spark 2.1 Apr 25, 2017 Mar 31, 2018 Jul 30, 2018
Spark 2.0 (Auto Updating) Spark 2.0 Jul 26, 2016 Jan 30, 2018 Apr 30, 2018
Spark 2.0.2-db4 Spark 2.0 Mar 24, 2017 Jan 30, 2018 Apr 30, 2018
Spark 1.6.3-db2 Spark 1.6 Mar 24, 2017 Jan 30, 2018 Jun 30, 2018

參考文獻

  1. https://docs.databricks.com/spark/latest/spark-sql/language-manual/analyze-table.html
  2. https://docs.databricks.com/spark/latest/spark-sql/language-manual/describe-table.html
  3. https://docs.databricks.com/spark/latest/spark-sql/cbo.html
  4. https://docs.databricks.com/release-notes/runtime/databricks-runtime-ver.html#versioning
  5. https://blog.csdn.net/Albert201605/article/details/82313139
  6. https://archive.ics.uci.edu/ml/datasets/Iris

本文由部落格一文多發平臺 OpenWrite 釋出!