問題定位分享(2)spark任務一定機率報錯java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
用yarn cluster方式提交spark任務時,有時會報錯,報錯機率是40%,報錯如下:
18/03/15 21:50:36 116 ERROR ApplicationMaster91: User class threw exception: org.apache.spark.sql.AnalysisException: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT;
org.apache.spark.sql.AnalysisException: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
at org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at scala.util.control.Breaks.breakable(Breaks.scala:38)
at app.package.APPClass$.main(APPClass.scala:177)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
Caused by: java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
at org.apache.hadoop.hive.ql.metadata.Hive.trashFilesUnderDir(Hive.java:1389)
at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:2873)
at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1621)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:728)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:676)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:676)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:676)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
at org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:675)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:768)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:766)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:766)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 25 more
大概流程是spark sql在執行InsertIntoHiveTable時會呼叫loadTable,這個操作最終會通過反射呼叫hive程式碼的loadTable方法
1 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
2 org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
3 org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:728)
4 java.lang.reflect.Method.invoke(Method.java:497)
5 org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1621)
6 org.apache.hadoop.hive.ql.metadata.Hive.trashFilesUnderDir(Hive.java:1389)
在第6步中報錯 java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT
這個問題通常會認為是hive-site.xml缺少配置
<property>
<name>hive.mv.files.thread</name>
<value>15</value>
</property>
但是檢視程式碼會發現spark2.1.1依賴的是hive1.2.1,在hive1.2.1中是沒有hive.mv.files.thread這個配置的,這個配置從hive2才開始出現,而且報錯的類org.apache.hadoop.hive.ql.metadata.Hive在hive1.2.1和hive2的相關程式碼完全不同,
在hive1.2.1的程式碼是:(trashFilesUnderDir方法是FileUtils類的)
if (FileUtils.isSubDir(oldPath, destf, fs2)) {
FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
}
在hive2的程式碼是:(trashFilesUnderDir方法是Hive類的)
private boolean trashFilesUnderDir(final FileSystem fs, Path f, final Configuration conf)
throws IOException {
FileStatus[] statuses = fs.listStatus(f, FileUtils.HIDDEN_FILES_PATH_FILTER);
boolean result = true;
final List<Future<Boolean>> futures = new LinkedList<>();
final ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ?
Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25),
new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Delete-Thread-%d").build()) : null;
所以第6步的報錯,執行的應該是hive2的程式碼,所以猜測問題可能是:
1)由於jar包汙染,啟動jvm的時候classpath裡同時有hive1和hive2的jar,有時載入類用到hive1的jar(可以正常執行),有時用到hive2的jar(會報錯);
2)叢集伺服器環境配置差異,有的伺服器classpath中沒有hive2的jar(可以正常執行),有的伺服器classpath有hive2的jar(可能報錯);
對比正常和報錯的伺服器的環境配置以及啟動命令發現都是一樣的,沒有發現hive2的jar,
通過在啟動任務時增加-verbose:class,發現正常和報錯的情況下,Hive類都是從Hive1.2.1的jar加載出來的,
[Loaded org.apache.hadoop.hive.ql.metadata.Hive from file:/export/Data/tmp/hadoop-tmp/nm-local-dir/filecache/98/hive-exec-1.2.1.spark2.jar]
否定了上邊的兩種猜測;
分析提交任務命令發現,用到了spark.yarn.jars,避免每次都上傳spark的jar,這些jar會被作為filecache快取在yarn.nodemanager.local-dirs下,
反編譯正常和報錯伺服器上filecache裡的hive-exec-1.2.1.spark2.jar最終發現問題,
正常伺服器上Hive類程式碼是:
if (FileUtils.isSubDir(oldPath, destf, fs2))
FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
報錯伺服器上的Hive類程式碼是:
private static boolean trashFilesUnderDir(final FileSystem fs, Path f, final Configuration conf) throws IOException {
FileStatus[] statuses = fs.listStatus(f, FileUtils.HIDDEN_FILES_PATH_FILTER);
boolean result = true;
List<Future<Boolean>> futures = new LinkedList();
ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ? Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25), (new ThreadFactoryBuilder()).setDaemon(true).setNameFormat("Delete-Thread-%d").build()) : null;
報錯伺服器上的Hive類用到ConfVars.HIVE_MOVE_FILES_THREAD_COUNT,但是在hive-common-1.2.1.jar中的ConfVars不存在這個屬性,所以報錯java.lang.NoSuchFieldError,
所以問題應該是hdfs上的hive-exec-1.2.1.spark2.jar一開始是對的,然後所有nodemanager下載到本地作為filecache,後來這個jar被改錯了(使用hive2編譯spark),然後新加的nodemanager會下載有問題的jar作為filecache,這樣結果就是有的伺服器執行正常,有的伺服器執行報錯;
yarn中的filecache清理有兩個配置
yarn.nodemanager.localizer.cache.cleanup.interval-ms:600000 Interval in between cache cleanups.
yarn.nodemanager.localizer.cache.target-size-mb:10240 Target size of localizer cache in MB, per local directory.
每隔cleanup.interval-ms會檢查本地filecache大小是否超過target-size-mb,超過才清理,不超過就一直使用filecache;