1. 程式人生 > >Hive on Spark系列一:CDH5.5配置支援hive on spark

Hive on Spark系列一:CDH5.5配置支援hive on spark



我寫文件中CDH5.7以上版本已經全面支援Hive on Spark,具體配置請參考官網。

我們目前使用的是CDH5.5.1,所以我就想嘗試下Hive on Spark如何,如果可以後期會升級CDH版本,下文以CDH 5.5作為介紹物件

重要:

CDH 5.4以後引入了Hive on Spark,但是在CDH5.5.x中是不被推薦使用的。為了嘗試這一特性,請使用測試環境直到Cloudera解決了當前存在的問題和侷限性來確保它能用於生產環境。

注意:

推薦使用HiveServer2的Beeline,當然使用Hive CLI也是可以的。

目錄:

安裝注意事項

啟用Hive on Spark

配置屬性

配置Hive

配置Executor Memory Size

安裝注意事項:

為了讓Hive工作在spark上,你必須在HiveServer2所在機器上部署spark gateway角色。另外,hive on spark不能讀取spark的配置,也不能提交spark作業。

在使用過程中,需要手動設定如下命令,以便讓之後的查詢都能使用spark引擎。

set hive.execution.engine=spark;

啟用hive on spark

預設hive on spark是禁用的,需要在Cloudera Manager中啟用。

1.登入CM介面,開啟hive服務。

2.單擊 配置標籤,查詢enable hive on spark

屬性。

3.勾選Enbale Hive on Spark(Unsupported),並儲存更改。

4.查詢Spark on YARN 服務,並勾選儲存。

5.儲存後,重新部署下客戶端使其生效。

配置屬性

注:官網提供了2個屬性配置,第1個屬性我沒有找到,兩個屬性都可以直接採取預設值。

屬性 描述
hive.stats.collect.rawdatasize Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:
  • totalSize
    : The approximate size of data on the disk
  • rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Default: True

hive.auto.convert.join.noconditionaltask.size The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Default: 20MB

配置hive

為了改善效能,Cloudera推薦配置如下hive附加屬性,在Cloudera Manager,設定如下屬性到HiveServer2服務中

hive.stats.fetch.column.stats=true

是否從metstore獲取行數統計

hive.optimize.index.filter=true

自動使用索引,預設是不開啟,設定為false

配置Executor Memory Size

注意:以下內容具體配法請參考官網

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 5, 6, or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, each executor only gets three cores, with four left unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Cloudera also recommends the following:
  • Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
  • spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.

總結:

經過上述配置,就可以在Hive CLI或用HiveServer2直接使用hive on spark了,使用和原來Hive on MapReduce沒什麼區別,只是在使用前執行下set hive.execution.engine=spark;即可使用spark引擎來執行hive了。