1. 程式人生 > >【原創】問題定位分享(17)spark查orc格式資料偶爾報錯NullPointerException

【原創】問題定位分享(17)spark查orc格式資料偶爾報錯NullPointerException

spark查orc格式的資料有時會報這個錯

Caused by: java.lang.NullPointerException
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560)
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010)
... 47 more

跟進程式碼

org.apache.hadoop.hive.ql.io.orc.OrcInputFormat

  static enum SplitStrategyKind {
    HYBRID,
    BI,
    ETL
  }
...

    Context(Configuration conf) {
      this.conf = conf;
      minSize = conf.getLong(MIN_SPLIT_SIZE, DEFAULT_MIN_SPLIT_SIZE);
      maxSize = conf.getLong(MAX_SPLIT_SIZE, DEFAULT_MAX_SPLIT_SIZE);
      String ss = conf.get(ConfVars.HIVE_ORC_SPLIT_STRATEGY.varname);
      
if (ss == null || ss.equals(SplitStrategyKind.HYBRID.name())) { splitStrategyKind = SplitStrategyKind.HYBRID; } else { LOG.info("Enforcing " + ss + " ORC split strategy"); splitStrategyKind = SplitStrategyKind.valueOf(ss); } ... switch(context.splitStrategyKind) {
case BI: // BI strategy requested through config splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; case ETL: // ETL strategy requested through config splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); break; default: // HYBRID strategy if (avgFileSize > context.maxSize) { splitStrategy = new ETLSplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } else { splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas, covered); } break; }

 

org.apache.hadoop.hive.conf.HiveConf.ConfVars

    HIVE_ORC_SPLIT_STRATEGY("hive.exec.orc.split.strategy", "HYBRID", new StringSet("HYBRID", "BI", "ETL"),
        "This is not a user level config. BI strategy is used when the requirement is to spend less time in split generation" +
        " as opposed to query execution (split generation does not read or cache file footers)." +
        " ETL strategy is used when spending little more time in split generation is acceptable" +
        " (split generation reads and caches file footers). HYBRID chooses between the above strategies" +
        " based on heuristics."),

 

可見hive.exec.orc.split.strategy預設是HYBRID,HYBRID時如果不滿足

if (avgFileSize > context.maxSize) {

splitStrategy = new BISplitStrategy(context, fs, dir, children, isOriginal, deltas,
covered);

報錯的就是BISplitStrategy,具體這個類為什麼報錯還沒有細看,不過可以修改設定避免這個問題

set hive.exec.orc.split.strategy=ETL

問題暫時解決,未完待續;