1. 程式人生 > >Hive On Spark環境搭建

Hive On Spark環境搭建

Spark原始碼編譯與環境搭建

Note that you must have a version of Spark which does not include the Hive jars;

Spark編譯:

git clone https://github.com/apache/spark.git spark_src
cd spark_src
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
./make-distribution.sh --name "spark-without-hive
" --tgz -Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.3.1 -Pyarn -DskipTests package

Spark搭建:見Spark環境搭建章節

Hive原始碼編譯與環境搭建

Hive編譯

git clone https://github.com/apache/hive.git hive_on_spark
git checkout spark
cd hive_on_spark
mvn clean install -Phadoop-2,dist -DskipTests

編譯完成後,hive安裝包的位置: /packaging/target/apache-hive-1.2.0-SNAPSHOT-bin

.tar.gz

注意pom.xml中spark.version要和spark的版本號對應

<spark.version>1.3.0</spark.version>

Hive安裝:見Hive環境搭建章節

本案例中Spark和Hive的安裝路徑如下:

Spark安裝目錄:/home/spark/app/spark-1.3.0-bin-spark-without-hive

Hive安裝目錄:/home/spark/app/apache-hive-1.2.0-SNAPSHOT-bin

新增Spark的依賴到Hive的方法

方式一: Set the property 'spark.home' to point to the Spark installation:

hive> set spark.home=/home/spark/app/spark-1.3.0-bin-spark-without-hive;

方式二: Define the SPARK_HOME environment variable before starting Hive CLI/HiveServer2:

export SPARK_HOME=/home/spark/app/spark-1.3.0-bin-spark-without-hive

方式三: Set the spark-assembly jar on the Hive auxpath:

hive --auxpath /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar

方式四: Add the spark-assembly jar for the current user session:

hive> add jar /home/spark/app/spark-1.3.0-bin-spark-without-hive/lib/spark-assembly-*.jar;

方式五: Link the spark-assembly jar to $HIVE_HOME/lib.

啟動Hive過程中可能出現的錯誤: 

[ERROR] Terminal initialization failed; falling back to unsupported
java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected
        at jline.TerminalFactory.create(TerminalFactory.java:101)
        at jline.TerminalFactory.get(TerminalFactory.java:158)
        at jline.console.ConsoleReader.<init>(ConsoleReader.java:229)
        at jline.console.ConsoleReader.<init>(ConsoleReader.java:221)
        at jline.console.ConsoleReader.<init>(ConsoleReader.java:209)
        at org.apache.hadoop.hive.cli.CliDriver.getConsoleReader(CliDriver.java:773)
        at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:715)
        at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)
        at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

解決方法:export HADOOP_USER_CLASSPATH_FIRST=true

其他場景的錯誤解決方法參見:https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

還有一個坑:需要設定spark.eventLog.dir引數,比如:

set spark.eventLog.dir= hdfs://hadoop000:8020/directory

否則查詢會報錯,這個坑深啊。。。。。。,否則一直報錯:/tmp/spark-event類似的資料夾不存在。。。。

啟動hive後設置執行引擎為spark:

hive> set hive.execution.engine=spark;

設定spark的執行模式:

hive> set spark.master=spark://hadoop000:7077

或者yarn:spark.master=yarn

Configure Spark-application configs for Hive

可以配置在spark-defaults.conf或者hive-site.xml

spark.master=<Spark Master URL>
spark.eventLog.enabled=true;            
spark.executor.memory=512m;             
spark.serializer=org.apache.spark.serializer.KryoSerializer;
spark.executor.memory=...  #Amount of memory to use per executor process.
spark.executor.cores=...  #Number of cores per executor.
spark.yarn.executor.memoryOverhead=...
spark.executor.instances=...  #The number of executors assigned to each application.
spark.driver.memory=...  #The amount of memory assigned to the Remote Spark Context (RSC). We recommend 4GB.
spark.yarn.driver.memoryOverhead=...  #We recommend 400 (MB).

執行sql語句後可以在監控頁面檢視job/stages等資訊

hive (default)> select city_id, count(*) c from page_views group by city_id order by c desc limit 5;
Query ID = spark_20150309173838_444cb5b1-b72e-4fc3-87db-4162e364cb1e
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
state = SENT
state = STARTED
state = STARTED
state = STARTED
state = STARTED
Query Hive on Spark job[0] stages:
0
1
2
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId: SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount [StageCost]
2015-03-09 17:38:11,822 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1  Stage-2_0: 0/1
state = STARTED
state = STARTED
state = STARTED
2015-03-09 17:38:14,845 Stage-0_0: 0(+1)/1      Stage-1_0: 0/1  Stage-2_0: 0/1
state = STARTED
state = STARTED
2015-03-09 17:38:16,861 Stage-0_0: 1/1 Finished Stage-1_0: 0(+1)/1      Stage-2_0: 0/1
state = SUCCEEDED
2015-03-09 17:38:17,867 Stage-0_0: 1/1 Finished Stage-1_0: 1/1 Finished Stage-2_0: 1/1 Finished
Status: Finished successfully in 10.07 seconds
OK
city_id c
-1000   22826
-10     17294
-20     10608
-1      6186
237     4158
Time taken: 18.417 seconds, Fetched: 5 row(s)

 

相關推薦

Hive On Spark環境搭建

Spark原始碼編譯與環境搭建 Note that you must have a version of Spark which does not include the Hive jars; Spark編譯: git clone https://github.com/apache/spark.gi

基於Spark2.0搭建Hive on Spark環境(Mysql本地和遠端兩種情況)

Hive的出現可以讓那些精通SQL技能、但是不熟悉MapReduce 、程式設計能力較弱與不擅長Java語言的使用者能夠在HDFS大規模資料集上很方便地利用SQL 語言查詢、彙總、分析資料,畢竟精通SQL語言的人要比精通Java語言的多得多。Hive適合處理離線非實時資料。h

Hive on spark搭建記錄

開發十年,就只剩下這套架構體系了! >>>   

Hive on Spark 偽分散式環境搭建過程記錄

進入hive cli是,會有如下提示: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. s

spark on hive原理與環境搭建 spark研習第三季

SparkSQL前身是Shark,Shark強烈依賴於Hive。Spark原來沒有做SQL多維度資料查詢工具,後來開發了Shark,Shark依賴於Hive的解釋引擎,部分在Spark中執行,還有一部分在Hadoop中執行。所以講SparkSQL必須講Hive。 一、

Spark環境搭建(四)-----------數據倉庫Hive環境搭建

apr 程序 版本 擴展 arch 表名 數據集 .tar.gz 自定義 Hive產生背景 1)MapReduce的編程不便,需通過Java語言等編寫程序 2) HDFS上的文缺失Schema(在數據庫中的表名列名等),方便開發者通過SQL的方式處理結構化的數據,而不需

Hive On Spark搭建(cdh)

hive 和 spark版本之前有強對應關係 apache hive 和 spark 對應關係表 master 2.3.0 3.0.x 2.3.0 2.3.x 2.0.0 2.2.x 1.6.0

偽分散式Spark + Hive on Spark搭建

  Spark大資料平臺有使用一段時間了,但大部分都是用於實驗而搭建起來用的,搭建過Spark完全分散式,也搭建過用於測試的偽分散式。現在是寫一遍隨筆,記錄一下曾經搭建過的環境,免得以後自己忘記了。也給和初學者以及曾經挖過坑的人用作參考。    Hive on Spark是Hive跑在Spark上

Ubuntu Spark 環境搭建(轉)

vim 能夠 span 有用 var sca 把他 要點 查看 在安裝Spark之前,我們需要在自己的系統當中先安裝上jdk和scala 可以去相應的官網上下載: JDK:http://www.oracle.com/technetwork/java/javase/downl

SparkSQL與Hive on Spark的比較

.cn local 順序 沒有 針對 ast custom spark manager 簡要介紹了SparkSQL與Hive on Spark的區別與聯系一、關於Spark簡介在Hadoop的整個生態系統中,Spark和MapReduce在同一個層級,即主要解決分布式計算框

hive on spark VS SparkSQL VS hive on tez

dir csdn cluster 並且 http 緩沖 快速 bsp pos http://blog.csdn.net/wtq1993/article/details/52435563 http://blog.csdn.net/yeruby/article/details

spark--環境搭建--Hive0.13搭建

local asto install onu grant pan scala ado export 在spark1上操作 1. 安裝hive $ cd /usr/local/ $ tar -zxvf apache-hive-0.13.1-bin.tar.gz $ mv ap

spark--環境搭建--ZooKeeper345集群搭建

進入 exp .bashrc dir pat con IT 檢查 sca $ cd /usr/local/ $ tar -zxvf zookeeper-3.4.5.tar.gz $ mv zookeeper-3.4.5 zk $ cd $ vi .bashrc expor

spark--環境搭建--5.kafka_292-081集群搭建

sca sep gin cer install 搭建 usr rom less 1. scala安裝 $ cd /usr/local $ tar -zxvf scala-2.11.4.tgz $ mv scala-2.11.4 scala $ vi ~/.bashrc e

大數據學習系列之六 ----- Hadoop+Spark環境搭建

csdn jdk sts htm ps命令 sta cnblogs 環境變量設置 lib 引言 在上一篇中 大數據學習系列之五 ----- Hive整合HBase圖文詳解 : http://www.panchengming.com/2017/12/18/pancm62/

hive on spark

技術分享 engine sele park cut bsp 配置 spark lec hive on spark 的配置及設置CDH都已配置好,直接使用就行,但是我在用的時候報錯,如下:    具體操作如下時報的錯: 在hive 裏執行以下命令: set

hive獨立模式環境搭建

環境centos7 (1)安裝mysql,這裡安裝的是mariadb (2)下載mysql驅動,並且把相關檔案放到/home/xie/bigdata/apache-hive-2.1.0-bin/lib目錄下。 注意:這裡下載的是

sparksql\hive on spark\hive on mr

Hive on Mapreduce Hive的原理大家可以參考這篇大資料時代的技術hive:hive介紹,實際的一些操作可以看這篇筆記:新手的Hive指南,至於還有興趣看Hive優化方法可以看看我總結的這篇Hive效能優化上的一些總結 Hive on Mapreduce執行流程

Spark SQL 筆記(3)——Spark 環境搭建

1 local 模式 直接執行即可 2 Standalone 模式 和 Hadoop/HDFS 的架構類似 /home/hadoop/apps/spark-2.1.3-bin-2.6.0-cdh5.7.0/conf 2.1 spark-env.sh SPARK_MA

hive on spark 效能引數調優

select * from stg_bankcard_auth_apply where length(idcardno) >= 1 and length(idcardno) <> 32; --該表儲存檔案格式為txt格式,是原始檔直接load進來的,mapreduce執行不管任何s