1. 程式人生 > >hive查詢報錯:java.io.IOException:org.apache.parquet.io.ParquetDecodingException

hive查詢報錯:java.io.IOException:org.apache.parquet.io.ParquetDecodingException

前言

本文解決如標題所述的一個hive查詢異常,詳細異常資訊為:

Failed with exception java.io.IOException:org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file hdfs://192.168.44.128:8888/user/hive/warehouse/test.db/test/part-00000-9596e4bd-f511-4f76-9030-33e426d0369c-c000.snappy.parquet

這個異常是用spark sql將oracle(不知道mysql中有沒有該問題,大家可以自己測試一下)中表資料查詢出來然後寫入hive表中,之後在hive命令列執行查詢語句時產生的,下面先具體看一下如何產生這個異常的。

1、建立相關的庫和表

1.1 建立hive測試庫

在hive裡執行如下語句

create database test;

1.2 建立oracle測試表

CREATE TABLE TEST
(   "ID" VARCHAR2(100), 
    "NUM" NUMBER(10,2)
)

1.3 在oracle表裡插入一條記錄

INSERT INTO TEST (ID, NUM) VALUES('1', 1);

2、spark sql程式碼

執行如下程式碼,便可以將之前在oracle裡建的test的表匯入到hive裡了,其中hive的表會自動建立,具體的spark連線hive,連線關係型資料庫,可以參考我的其他兩篇部落格:

spark連線hive(spark-shell和eclipse兩種方式)Spark Sql 連線mysql

package com.dkl.leanring.spark.sql

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

object Oracle2HiveTest {
  def main(args: Array[String]): Unit = {

    //初始化spark
    val spark = SparkSession
      .builder()
      .appName("Oracle2HiveTest"
) .master("local") // .config("spark.sql.parquet.writeLegacyFormat", true) .enableHiveSupport() .getOrCreate() //表名為我們新建的測試表 val tableName = "test" //spark連線oracle資料庫 val df = spark.read .format("jdbc") .option("url", "jdbc:oracle:thin:@192.168.44.128:1521:orcl") .option("dbtable", tableName) .option("user", "bigdata") .option("password", "bigdata") .option("driver", "oracle.jdbc.driver.OracleDriver") .load() //匯入spark的sql函式,用起來較方便 import spark.sql //切換到test資料庫 sql("use test") //將df中的資料儲存到hive表中(自動建表) df.write.mode(SaveMode.Overwrite).saveAsTable(tableName) //停止spark spark.stop } }

3、在hive裡查詢

hive
use test;
select * from test;

這時就可以出現如標題所述的異常了,附圖:

4、解決辦法

將2裡面spark程式碼中的.config(“spark.sql.parquet.writeLegacyFormat”, true)註釋去掉,再執行一次,即可解決該異常,該配置的預設值為false,如果設定為true,Spark將使用與Hive相同的約定來編寫Parquet資料。

5、異常原因

Root Cause:
This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.
eg: DECIMAL can be used to annotate the following types: int32: for 1 <= precision <= 9 int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution: The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue.

6、注意

1.2中的建表語句中NUMBER(10,2)的精度(10,2)必須要寫,如果改為NUMBER就不會出現該異常,至於其他精度會不會出現該問題,大家可自行測試。