Hive on Spark系列一：CDH5.5配置支援hive on spark

阿新 • • 發佈：2019-01-11

我寫文件中CDH5.7以上版本已經全面支援Hive on Spark，具體配置請參考官網。

我們目前使用的是CDH5.5.1,所以我就想嘗試下Hive on Spark如何，如果可以後期會升級CDH版本，下文以CDH 5.5作為介紹物件

重要：

CDH 5.4以後引入了Hive on Spark，但是在CDH5.5.x中是不被推薦使用的。為了嘗試這一特性，請使用測試環境直到Cloudera解決了當前存在的問題和侷限性來確保它能用於生產環境。

注意：

推薦使用HiveServer2的Beeline，當然使用Hive CLI也是可以的。

安裝注意事項：

為了讓Hive工作在spark上，你必須在HiveServer2所在機器上部署spark gateway角色。另外，hive on spark不能讀取spark的配置，也不能提交spark作業。

在使用過程中，需要手動設定如下命令，以便讓之後的查詢都能使用spark引擎。

set hive.execution.engine=spark;

啟用hive on spark

預設hive on spark是禁用的，需要在Cloudera Manager中啟用。

1.登入CM介面，開啟hive服務。

2.單擊配置標籤，查詢enable hive on spark

屬性。

3.勾選Enbale Hive on Spark(Unsupported)，並儲存更改。

4.查詢Spark on YARN 服務，並勾選儲存。

5.儲存後，重新部署下客戶端使其生效。

配置屬性

注：官網提供了2個屬性配置，第1個屬性我沒有找到，兩個屬性都可以直接採取預設值。

屬性	描述
hive.stats.collect.rawdatasize	Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data: `totalSize` : The approximate size of data on the disk `rawDataSize`: The approximate size of data in memory When both metrics are available, Hive chooses `rawDataSize`. Default: `True`
hive.auto.convert.join.noconditionaltask.size	The threshold for the sum of all the small table size (by default, `rawDataSize`), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables. Default: `20MB`

屬性

描述

hive.stats.collect.rawdatasize

Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:

totalSize

: The approximate size of data on the disk
rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Default: True

hive.auto.convert.join.noconditionaltask.size

The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Default: 20MB

配置hive

為了改善效能，Cloudera推薦配置如下hive附加屬性，在Cloudera Manager，設定如下屬性到HiveServer2服務中

hive.stats.fetch.column.stats=true

是否從metstore獲取行數統計

hive.optimize.index.filter=true

自動使用索引，預設是不開啟，設定為false

配置Executor Memory Size

注意：以下內容具體配法請參考官網

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 5, 6, or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, each executor only gets three cores, with four left unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Cloudera also recommends the following:

Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.

總結：

經過上述配置，就可以在Hive CLI或用HiveServer2直接使用hive on spark了，使用和原來Hive on MapReduce沒什麼區別，只是在使用前執行下set hive.execution.engine=spark;即可使用spark引擎來執行hive了。

Hive on Spark系列一：CDH5.5配置支援hive on spark

重要：

注意：

目錄：

安裝注意事項：

啟用hive on spark

配置屬性

配置hive

配置Executor Memory Size

總結：

Hive on Spark系列一：CDH5.5配置支援hive on spark

ArcGIS Server10.5系列一：安裝和配置

webpack漸入佳境系列一：webpack環境配置與打包基礎【附帶各種 "坑" 與解決方案！持續更新中...】

重新編譯、安裝spark assembly，使CDH5.5.1支援sparkSQL

Spark MLlib系列(一)：入門介紹

sed修煉系列(一)：花拳繡腿之入門篇

Linux Kernel系列一：開篇和Kernel啟動概要

Exchange Server 2016安裝部署系列一：Exchange 簡述,環境需求及部署規劃

IO流系列一：輸入輸出流的轉換

WebService系列一：WebService簡介

[ 搭建Redis本地服務器實踐系列一 ] ：圖解CentOS7安裝Redis

elasticsearch系列一：elasticsearch（ES簡介、安裝&配置、集成Ikanalyzer）

Silverlight & Blend動畫設計系列一：偏移動畫（TranslateTransform）

Docker系列一：Docker的介紹和安裝

CAS源碼追蹤系列一：Filter的初始化

小白學習Spark系列四：rdd踩坑總結

Windows下USB磁碟開發系列一：列舉系統中U盤的碟符

CAS原始碼追蹤系列一：Filter的初始化

【轉載】JVM系列一：JVM記憶體組成及分配

Spark優化(一)：避免重複RDD

Hive on Spark系列一：CDH5.5配置支援hive on spark

重要：

注意：

目錄：

安裝注意事項：

啟用hive on spark

配置屬性

配置hive

配置Executor Memory Size

總結：

相關推薦