【原創】問題定位分享（21）spark執行insert overwrite非常慢，比hive還要慢

阿新 • • 發佈：2018-12-21

最近把一些sql執行從hive改到spark，發現執行更慢，sql主要是一些insert overwrite操作，從執行計劃看到，用到InsertIntoHiveTable

spark-sql> explain insert overwrite table test2 select * from test1;
== Physical Plan ==
InsertIntoHiveTable MetastoreRelation temp, test2, true, false
+- HiveTableScan [buyer_id#20, member_reg_gender#21, reg_birthday#22, reg_birthday1#23, age#24, age_range#25], MetastoreRelation temp, test1
Time taken: 0.404 seconds, Fetched 1 row(s)

跟進程式碼
org.apache.spark.sql.hive.execution.InsertIntoHiveTable

  protected override def doExecute(): RDD[InternalRow] = {
    sqlContext.sparkContext.parallelize(sideEffectResult.asInstanceOf[Seq[InternalRow]], 1)
  }

  /**
   * Inserts all the rows in the table into Hive.  Row objects are properly serialized with the
   * `org.apache.hadoop.hive.serde2.SerDe` and the
   * `org.apache.hadoop.mapred.OutputFormat` provided by the table definition.
   *
   * Note: this is run once and then kept to avoid double insertions.
    
*/
  protected[sql] lazy val sideEffectResult: Seq[InternalRow] = {
    // Have to pass the TableDesc object to RDD.mapPartitions and then instantiate new serializer
    // instances within the closure, since Serializer is not serializable while TableDesc is.
    val tableDesc = table.tableDesc
    val tableLocation  
= table.hiveQlTable.getDataLocation
    val tmpLocation = getExternalTmpPath(tableLocation)
    val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, false)
    val isCompressed = hadoopConf.get("hive.exec.compress.output", "false").toBoolean

    if (isCompressed) {
      // Please note that isCompressed, "mapred.output.compress", "mapred.output.compression.codec",
      // and "mapred.output.compression.type" have no impact on ORC because it uses table properties
      // to store compression information.
      hadoopConf.set("mapred.output.compress", "true")
      fileSinkConf.setCompressed(true)
      fileSinkConf.setCompressCodec(hadoopConf.get("mapred.output.compression.codec"))
      fileSinkConf.setCompressType(hadoopConf.get("mapred.output.compression.type"))
    }

    val numDynamicPartitions = partition.values.count(_.isEmpty)
    val numStaticPartitions = partition.values.count(_.nonEmpty)
    val partitionSpec = partition.map {
      case (key, Some(value)) => key -> value
      case (key, None) => key -> ""
    }

    // All partition column names in the format of "<column name 1>/<column name 2>/..."
    val partitionColumns = fileSinkConf.getTableInfo.getProperties.getProperty("partition_columns")
    val partitionColumnNames = Option(partitionColumns).map(_.split("/")).getOrElse(Array.empty)

    // By this time, the partition map must match the table's partition columns
    if (partitionColumnNames.toSet != partition.keySet) {
      throw new SparkException(
        s"""Requested partitioning does not match the ${table.tableName} table:
           |Requested partitions: ${partition.keys.mkString(",")}
           |Table partitions: ${table.partitionKeys.map(_.name).mkString(",")}""".stripMargin)
    }

    // Validate partition spec if there exist any dynamic partitions
    if (numDynamicPartitions > 0) {
      // Report error if dynamic partitioning is not enabled
      if (!hadoopConf.get("hive.exec.dynamic.partition", "true").toBoolean) {
        throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_DISABLED.getMsg)
      }

      // Report error if dynamic partition strict mode is on but no static partition is found
      if (numStaticPartitions == 0 &&
        hadoopConf.get("hive.exec.dynamic.partition.mode", "strict").equalsIgnoreCase("strict")) {
        throw new SparkException(ErrorMsg.DYNAMIC_PARTITION_STRICT_MODE.getMsg)
      }

      // Report error if any static partition appears after a dynamic partition
      val isDynamic = partitionColumnNames.map(partitionSpec(_).isEmpty)
      if (isDynamic.init.zip(isDynamic.tail).contains((true, false))) {
        throw new AnalysisException(ErrorMsg.PARTITION_DYN_STA_ORDER.getMsg)
      }
    }

    val jobConf = new JobConf(hadoopConf)
    val jobConfSer = new SerializableJobConf(jobConf)

    // When speculation is on and output committer class name contains "Direct", we should warn
    // users that they may loss data if they are using a direct output committer.
    val speculationEnabled = sqlContext.sparkContext.conf.getBoolean("spark.speculation", false)
    val outputCommitterClass = jobConf.get("mapred.output.committer.class", "")
    if (speculationEnabled && outputCommitterClass.contains("Direct")) {
      val warningMessage =
        s"$outputCommitterClass may be an output committer that writes data directly to " +
          "the final location. Because speculation is enabled, this output committer may " +
          "cause data loss (see the case in SPARK-10063). If possible, please use an output " +
          "committer that does not have this behavior (e.g. FileOutputCommitter)."
      logWarning(warningMessage)
    }

    val writerContainer = if (numDynamicPartitions > 0) {
      val dynamicPartColNames = partitionColumnNames.takeRight(numDynamicPartitions)
      new SparkHiveDynamicPartitionWriterContainer(
        jobConf,
        fileSinkConf,
        dynamicPartColNames,
        child.output)
    } else {
      new SparkHiveWriterContainer(
        jobConf,
        fileSinkConf,
        child.output)
    }

    @transient val outputClass = writerContainer.newSerializer(table.tableDesc).getSerializedClass
    saveAsHiveFile(child.execute(), outputClass, fileSinkConf, jobConfSer, writerContainer)

    val outputPath = FileOutputFormat.getOutputPath(jobConf)
    // TODO: Correctly set holdDDLTime.
    // In most of the time, we should have holdDDLTime = false.
    // holdDDLTime will be true when TOK_HOLD_DDLTIME presents in the query as a hint.
    val holdDDLTime = false
    if (partition.nonEmpty) {
      if (numDynamicPartitions > 0) {
        externalCatalog.loadDynamicPartitions(
          db = table.catalogTable.database,
          table = table.catalogTable.identifier.table,
          outputPath.toString,
          partitionSpec,
          overwrite,
          numDynamicPartitions,
          holdDDLTime = holdDDLTime)
      } else {
        // scalastyle:off
        // ifNotExists is only valid with static partition, refer to
        // https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
        // scalastyle:on
        val oldPart =
          externalCatalog.getPartitionOption(
            table.catalogTable.database,
            table.catalogTable.identifier.table,
            partitionSpec)

        var doHiveOverwrite = overwrite

        if (oldPart.isEmpty || !ifNotExists) {
          // SPARK-18107: Insert overwrite runs much slower than hive-client.
          // Newer Hive largely improves insert overwrite performance. As Spark uses older Hive
          // version and we may not want to catch up new Hive version every time. We delete the
          // Hive partition first and then load data file into the Hive partition.
          if (oldPart.nonEmpty && overwrite) {
            oldPart.get.storage.locationUri.foreach { uri =>
              val partitionPath = new Path(uri)
              val fs = partitionPath.getFileSystem(hadoopConf)
              if (fs.exists(partitionPath)) {
                if (!fs.delete(partitionPath, true)) {
                  throw new RuntimeException(
                    "Cannot remove partition directory '" + partitionPath.toString)
                }
                // Don't let Hive do overwrite operation since it is slower.
                doHiveOverwrite = false
              }
            }
          }

          // inheritTableSpecs is set to true. It should be set to false for an IMPORT query
          // which is currently considered as a Hive native command.
          val inheritTableSpecs = true
          externalCatalog.loadPartition(
            table.catalogTable.database,
            table.catalogTable.identifier.table,
            outputPath.toString,
            partitionSpec,
            isOverwrite = doHiveOverwrite,
            holdDDLTime = holdDDLTime,
            inheritTableSpecs = inheritTableSpecs)
        }
      }
    } else {
      externalCatalog.loadTable(
        table.catalogTable.database,
        table.catalogTable.identifier.table,
        outputPath.toString, // TODO: URI
        overwrite,
        holdDDLTime)
    }

    // Attempt to delete the staging directory and the inclusive files. If failed, the files are
    // expected to be dropped at the normal termination of VM since deleteOnExit is used.
    try {
      createdTempDir.foreach { path => path.getFileSystem(hadoopConf).delete(path, true) }
    } catch {
      case NonFatal(e) =>
        logWarning(s"Unable to delete staging directory: $stagingDir.\n" + e)
    }

    // un-cache this table.
    sqlContext.sparkSession.catalog.uncacheTable(table.catalogTable.identifier.quotedString)
    sqlContext.sessionState.catalog.refreshTable(table.catalogTable.identifier)

    // It would be nice to just return the childRdd unchanged so insert operations could be chained,
    // however for now we return an empty list to simplify compatibility checks with hive, which
    // does not return anything for insert operations.
    // TODO: implement hive compatibility as rules.
    Seq.empty[InternalRow]
  }

insert overwrite 執行分為三步，一個是select，一個是write，一個是load，前邊兩步沒什麼問題，主要是最後一步load，以loadPartition為例看下執行過程：

org.apache.spark.sql.hive.HiveExternalCatalog

  override def loadPartition(
      db: String,
      table: String,
      loadPath: String,
      partition: TablePartitionSpec,
      isOverwrite: Boolean,
      holdDDLTime: Boolean,
      inheritTableSpecs: Boolean): Unit = withClient {
    requireTableExists(db, table)

    val orderedPartitionSpec = new util.LinkedHashMap[String, String]()
    getTable(db, table).partitionColumnNames.foreach { colName =>
      // Hive metastore is not case preserving and keeps partition columns with lower cased names,
      // and Hive will validate the column names in partition spec to make sure they are partition
      // columns. Here we Lowercase the column names before passing the partition spec to Hive
      // client, to satisfy Hive.
      orderedPartitionSpec.put(colName.toLowerCase, partition(colName))
    }

    client.loadPartition(
      loadPath,
      db,
      table,
      orderedPartitionSpec,
      isOverwrite,
      holdDDLTime,
      inheritTableSpecs)
  }

這裡會呼叫HiveClientImpl.loadPartition

org.apache.spark.sql.hive.client.HiveClientImpl

  def loadPartition(
      loadPath: String,
      dbName: String,
      tableName: String,
      partSpec: java.util.LinkedHashMap[String, String],
      replace: Boolean,
      holdDDLTime: Boolean,
      inheritTableSpecs: Boolean): Unit = withHiveState {
    val hiveTable = client.getTable(dbName, tableName, true /* throw exception */)
    shim.loadPartition(
      client,
      new Path(loadPath), // TODO: Use URI
      s"$dbName.$tableName",
      partSpec,
      replace,
      holdDDLTime,
      inheritTableSpecs,
      isSkewedStoreAsSubdir = hiveTable.isStoredAsSubDirectories)
  }

這裡會呼叫Shim_v0_12.loadPartition

org.apache.spark.sql.hive.client.Shim_v0_12

  override def loadPartition(
      hive: Hive,
      loadPath: Path,
      tableName: String,
      partSpec: JMap[String, String],
      replace: Boolean,
      holdDDLTime: Boolean,
      inheritTableSpecs: Boolean,
      isSkewedStoreAsSubdir: Boolean): Unit = {
    loadPartitionMethod.invoke(hive, loadPath, tableName, partSpec, replace: JBoolean,
      holdDDLTime: JBoolean, inheritTableSpecs: JBoolean, isSkewedStoreAsSubdir: JBoolean)
  }

  private lazy val loadPartitionMethod =
    findMethod(
      classOf[Hive],
      "loadPartition",
      classOf[Path],
      classOf[String],
      classOf[JMap[String, String]],
      JBoolean.TYPE,
      JBoolean.TYPE,
      JBoolean.TYPE,
      JBoolean.TYPE)

這裡會反射呼叫hive的類Hive.loadPartition

org.apache.hadoop.hive.ql.metadata.Hive （1.2版本）

    public void loadPartition(Path loadPath, String tableName, Map<String, String> partSpec, boolean replace, boolean holdDDLTime, boolean inheritTableSpecs, boolean isSkewedStoreAsSubdir, boolean isSrcLocal, boolean isAcid) throws HiveException {
        Table tbl = this.getTable(tableName);
        this.loadPartition(loadPath, tbl, partSpec, replace, holdDDLTime, inheritTableSpecs, isSkewedStoreAsSubdir, isSrcLocal, isAcid);
    }

    public Partition loadPartition(Path loadPath, Table tbl, Map<String, String> partSpec, boolean replace, boolean holdDDLTime, boolean inheritTableSpecs, boolean isSkewedStoreAsSubdir, boolean isSrcLocal, boolean isAcid) throws HiveException {
        Path tblDataLocationPath = tbl.getDataLocation();
        Partition newTPart = null;

        try {
            Partition oldPart = this.getPartition(tbl, partSpec, false);
            Path oldPartPath = null;
            if (oldPart != null) {
                oldPartPath = oldPart.getDataLocation();
            }

            Path newPartPath = null;
            FileSystem oldPartPathFS;
            if (inheritTableSpecs) {
                Path partPath = new Path(tbl.getDataLocation(), Warehouse.makePartPath(partSpec));
                newPartPath = new Path(tblDataLocationPath.toUri().getScheme(), tblDataLocationPath.toUri().getAuthority(), partPath.toUri().getPath());
                if (oldPart != null) {
                    oldPartPathFS = oldPartPath.getFileSystem(this.getConf());
                    FileSystem loadPathFS = loadPath.getFileSystem(this.getConf());
                    if (FileUtils.equalsFileSystem(oldPartPathFS, loadPathFS)) {
                        newPartPath = oldPartPath;
                    }
                }
            } else {
                newPartPath = oldPartPath;
            }

            List<Path> newFiles = null;
            if (replace) {
                replaceFiles(tbl.getPath(), loadPath, newPartPath, oldPartPath, this.getConf(), isSrcLocal);
            } else {
                newFiles = new ArrayList();
                oldPartPathFS = tbl.getDataLocation().getFileSystem(this.conf);
                copyFiles(this.conf, loadPath, newPartPath, oldPartPathFS, isSrcLocal, isAcid, newFiles);
            }

            boolean forceCreate = !holdDDLTime;
            newTPart = this.getPartition(tbl, partSpec, forceCreate, newPartPath.toString(), inheritTableSpecs, newFiles);
            if (!holdDDLTime && isSkewedStoreAsSubdir) {
                org.apache.hadoop.hive.metastore.api.Partition newCreatedTpart = newTPart.getTPartition();
                SkewedInfo skewedInfo = newCreatedTpart.getSd().getSkewedInfo();
                Map<List<String>, String> skewedColValueLocationMaps = this.constructListBucketingLocationMap(newPartPath, skewedInfo);
                skewedInfo.setSkewedColValueLocationMaps(skewedColValueLocationMaps);
                newCreatedTpart.getSd().setSkewedInfo(skewedInfo);
                this.alterPartition(tbl.getDbName(), tbl.getTableName(), new Partition(tbl, newCreatedTpart));
                this.getPartition(tbl, partSpec, true, newPartPath.toString(), inheritTableSpecs, newFiles);
                return new Partition(tbl, newCreatedTpart);
            } else {
                return newTPart;
            }
        } catch (IOException var20) {
            LOG.error(StringUtils.stringifyException(var20));
            throw new HiveException(var20);
        } catch (MetaException var21) {
            LOG.error(StringUtils.stringifyException(var21));
            throw new HiveException(var21);
        } catch (InvalidOperationException var22) {
            LOG.error(StringUtils.stringifyException(var22));
            throw new HiveException(var22);
        }
    }

    protected static void replaceFiles(Path tablePath, Path srcf, Path destf, Path oldPath, HiveConf conf, boolean isSrcLocal) throws HiveException {
        try {
            FileSystem destFs = destf.getFileSystem(conf);
            boolean inheritPerms = HiveConf.getBoolVar(conf, ConfVars.HIVE_WAREHOUSE_SUBDIR_INHERIT_PERMS);

            FileSystem srcFs;
            FileStatus[] srcs;
            try {
                srcFs = srcf.getFileSystem(conf);
                srcs = srcFs.globStatus(srcf);
            } catch (IOException var20) {
                throw new HiveException("Getting globStatus " + srcf.toString(), var20);
            }

            if (srcs == null) {
                LOG.info("No sources specified to move: " + srcf);
            } else {
                List<List<Path[]>> result = checkPaths(conf, destFs, srcs, srcFs, destf, true);
                if (oldPath != null) {
                    try {
                        FileSystem fs2 = oldPath.getFileSystem(conf);
                        if (fs2.exists(oldPath)) {
                            if (FileUtils.isSubDir(oldPath, destf, fs2)) {
                                FileUtils.trashFilesUnderDir(fs2, oldPath, conf);
                            }

                            if (inheritPerms) {
                                inheritFromTable(tablePath, destf, conf, destFs);
                            }
                        }
                    } catch (Exception var19) {
                        LOG.warn("Directory " + oldPath.toString() + " cannot be removed: " + var19, var19);
                    }
                }

                if (srcs.length == 1 && srcs[0].isDir()) {
                    Path destfp = destf.getParent();
                    if (!destFs.exists(destfp)) {
                        boolean success = destFs.mkdirs(destfp);
                        if (!success) {
                            LOG.warn("Error creating directory " + destf.toString());
                        }

                        if (inheritPerms && success) {
                            inheritFromTable(tablePath, destfp, conf, destFs);
                        }
                    }

                    Iterator i$ = result.iterator();

                    while(i$.hasNext()) {
                        List<Path[]> sdpairs = (List)i$.next();
                        Iterator i$ = sdpairs.iterator();

                        while(i$.hasNext()) {
                            Path[] sdpair = (Path[])i$.next();
                            Path destParent = sdpair[1].getParent();
                            FileSystem destParentFs = destParent.getFileSystem(conf);
                            if (!destParentFs.isDirectory(destParent)) {
                                boolean success = destFs.mkdirs(destParent);
                                if (!success) {
                                    LOG.warn("Error creating directory " + destParent);
                                }

                                if (inheritPerms && success) {
                                    inheritFromTable(tablePath, destParent, conf, destFs);
                                }
                            }

                            if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true, isSrcLocal)) {
                                throw new IOException("Unable to move file/directory from " + sdpair[0] + " to " + sdpair[1]);
                            }
                        }
                    }
                } else {
                    if (!destFs.exists(destf)) {
                        boolean success = destFs.mkdirs(destf);
                        if (!success) {
                            LOG.warn("Error creating directory " + destf.toString());
                        }

                        if (inheritPerms && success) {
                            inheritFromTable(tablePath, destf, conf, destFs);
                        }
                    }

                    Iterator i$ = result.iterator();

                    while(i$.hasNext()) {
                        List<Path[]> sdpairs = (List)i$.next();
                        Iterator i$ = sdpairs.iterator();

                        while(i$.hasNext()) {
                            Path[] sdpair = (Path[])i$.next();
                            if (!moveFile(conf, sdpair[0], sdpair[1], destFs, true, isSrcLocal)) {
                                throw new IOException("Error moving: " + sdpair[0] + " into: " + sdpair[1]);
                            }
                        }
                    }
                }

            }
        } catch (IOException var21) {
            throw new HiveException(var21.getMessage(), var21);
        }
    }

    public static boolean trashFilesUnderDir(FileSystem fs, Path f, Configuration conf) throws FileNotFoundException, IOException {
        FileStatus[] statuses = fs.listStatus(f, HIDDEN_FILES_PATH_FILTER);
        boolean result = true;
        FileStatus[] arr$ = statuses;
        int len$ = statuses.length;

        for(int i$ = 0; i$ < len$; ++i$) {
            FileStatus status = arr$[i$];
            result &= moveToTrash(fs, status.getPath(), conf);
        }

        return result;
    }

hive在執行loadPartition的時候，如果分割槽目錄已經存在，會呼叫replaceFiles，replaceFiles會呼叫trashFilesUnderDir，trashFilesUnderDir裡會逐個將檔案放到回收站；

spark執行loadPartition的時候，直接反射呼叫hive的邏輯，為什麼還會比hive執行慢很多呢？

這時注意到hive用的版本是2.1，spark2.1.1裡依賴的hive版本是1.2，對比hive1.2和hive2.1之間的程式碼發現，確實有差別，以下是hive2.1的程式碼：

org.apache.hadoop.hive.ql.metadata.Hive（2.1版本）

  /**
   * Trashes or deletes all files under a directory. Leaves the directory as is.
   * @param fs FileSystem to use
   * @param f path of directory
   * @param conf hive configuration
   * @param forceDelete whether to force delete files if trashing does not succeed
   * @return true if deletion successful
   * @throws IOException
   */
  private boolean trashFilesUnderDir(final FileSystem fs, Path f, final Configuration conf)
      throws IOException {
    FileStatus[] statuses = fs.listStatus(f, FileUtils.HIDDEN_FILES_PATH_FILTER);
    boolean result = true;
    final List<Future<Boolean>> futures = new LinkedList<>();
    final ExecutorService pool = conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25) > 0 ?
        Executors.newFixedThreadPool(conf.getInt(ConfVars.HIVE_MOVE_FILES_THREAD_COUNT.varname, 25),
        new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Delete-Thread-%d").build()) : null;
    final SessionState parentSession = SessionState.get();
    for (final FileStatus status : statuses) {
      if (null == pool) {
        result &= FileUtils.moveToTrash(fs, status.getPath(), conf);
      } else {
        futures.add(pool.submit(new Callable<Boolean>() {
          @Override
          public Boolean call() throws Exception {
            SessionState.setCurrentSessionState(parentSession);
            return FileUtils.moveToTrash(fs, status.getPath(), conf);
          }
        }));
      }
    }
    if (null != pool) {
      pool.shutdown();
      for (Future<Boolean> future : futures) {
        try {
          result &= future.get();
        } catch (InterruptedException | ExecutionException e) {
          LOG.error("Failed to delete: ",e);
          pool.shutdownNow();
          throw new IOException(e);
        }
      }
    }
    return result;
  }

可以看到在hive2.1裡刪除檔案用到了執行緒池，而在hive1.2裡是在for迴圈裡序列刪除，所以當檔案很多時，hive2.1比hive1.2（即spark2.1.1）就會快非常多；

spark依賴hive的方式是直接反射呼叫，由於hive1.2和hive2.1很多類的方法介面都有調整，很難升級，所以遇到這個問題只能通過修改spark裡Hive.trashFilesUnderDir程式碼，同樣改為執行緒池的方式來刪除檔案，問題解決；

【原創】問題定位分享（21）spark執行insert overwrite非常慢，比hive還要慢

最近把一些sql執行從hive改到spark，發現執行更慢，sql主要是一些insert overwrite操作，從執行計劃看到，用到InsertIntoHiveTable spark-sql> explain insert overwrite table test2 select * from

【原創】經驗分享（15）spark sql limit實現原理

之前討論過hive中limit的實現，詳見 https://www.cnblogs.com/barneywill/p/10109217.html下面看spark sql中limit的實現，首先看執行計劃： spark-sql> explain select * from test1 limit

【原創】經驗分享（20）spark job之間會停頓幾分鐘

今天遇到一個問題，spark應用中在一個迴圈裡執行sql，每個sql都會向一張表寫入資料，比如 insert overwrite table test_table partition(dt) select * from test_table_another; 除了執行sql沒有其他邏輯，每個sq

【原創】案例分享（4）定位分析--見證scala的強大

一場景分析定位分析廣泛應用，比如室外基站定位，室內藍芽beacon定位，室內wifi探針定位等，實現方式是三點定位 Trilateration 理想情況這種理想情況要求3個基站‘同時’採集‘準確’的距離資訊，實際情況 3個基站採集資料的時間是分開的；採集資料的距離

【原創】經驗分享（10）Could not transfer artifact org.apache.maven:maven. from/to central. Received fatal alert: protocol_version

maven編譯工程報錯 [ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:add-source (scala-compile-first) on project trade: Execution sca

【原創】經驗分享（12）如何程式化kill提交到spark thrift上的sql

spark 2.1.1 hive正在執行中的sql可以很容易的中止，因為可以從console輸出中拿到當前在yarn上的application id，然後就可以kill任務， WARNING: Hive-on-MR is deprecated in Hive 2 and may no

【原創】演算法分享（4）Cardinality Estimate 基數計數概率演算法

讀過《程式設計珠璣》（<Programming Pearls>）的人應該還對開篇的Case記憶猶新，大概的場景是：作者的一位在電話公司工作的朋友想要統計一段時間內不同的電話號碼的個數，電話號碼的數量很大，當時的記憶體很小，所以不能把所有的電話號碼全部放到記憶體來去重統計，他的朋友很苦惱。作

【原創】演算法分享（5）聚類演算法DBSCAN

簡介 DBSCAN：Density-based spatial clustering of applications with noise is a data clustering algorithm proposed by Martin Ester, Hans-Peter

【原創】案例分享（3）使用者行為分析--見證scala的強大

場景分析使用者行為分析應用的場景很多，像線上網站訪問統計，線下客流分析（比如影象人臉識別、wifi探針等），比較核心的指標有幾個： PV | UV | SD | SC 指標說明： PV（Page View）：網站瀏覽量或者商場門店的訪問量UV（Unique Visitor）：獨立訪客數，即

【原創】演算法分享（7）最小二乘法

Ordinary Least Square 最小二乘法提到最小二乘法要先提到擬合，擬合Fitting是數值分析的基礎工具之一，在二維平面上分為直線擬合和曲線擬合，直線擬合找到一條直線儘可能穿過所有的點，注意這裡是儘可能，因為只要超過2個點，就有可能發生直線不能精確穿過所有點的情況，這時確定直線的原則有很多

【原創】經驗分享（22）檢視linux發行版以及核心版本

redhat檢視發行版 # cat /etc/redhat-release CentOS Linux release 7.2.1511 (Core) 檢視核心版本 # uname -aLinux $host 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu A

【原創】問題定位分享（17）spark查orc格式資料偶爾報錯NullPointerException

spark查orc格式的資料有時會報這個錯 Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFo

【原創】MapReduce實戰（一）

tid refs 讀取 sel instance 網站 let 創建 -c 應用場景：用戶每天會在網站上產生各種各樣的行為，比如瀏覽網頁，下單等，這種行為會被網站記錄下來，形成用戶行為日誌，並存儲在hdfs上。格式如下： 17:03:35.012?pageview?{"d

【原創】命令列（2）----一些伺服器命令列

Ls Ps –x Cd server/ Sh stopall.sh Sh fresh.sh Sh runall.sh 命令全部小寫即可 Ls

【原創】java-NIO（一）阻塞IO與非阻塞IO--轉載請註明出處

零、一個小故事在講解阻塞IO與非阻塞IO之前，先舉出一個小小的例子：一個老闆經營一個飯店，最初的時候，每來一個客人安排一個服務員招呼，客人很滿意。　　後來客人越來越多，需要的服務員越來越多，但是餐廳的後廚已經擠滿了服務員，不

【原創】java-NIO（一）阻塞IO與非阻塞IO

零、一個小故事在講解阻塞IO與非阻塞IO之前，先舉出一個小小的例子：一個老闆經營一個飯店，最初的時候，每來一個客人安排一個服務員招呼，客人很滿意。　　後來客人越來越多，需要的服務員越來越多，但是餐廳的後廚已經擠滿了服務員，不能請更多的服務員了，之前的

問題定位分享（2）spark任務一定機率報錯java.lang.NoSuchFieldError: HIVE_MOVE_FILES_THREAD_COUNT

用yarn cluster方式提交spark任務時，有時會報錯，報錯機率是40%，報錯如下： 18/03/15 21:50:36 116 ERROR ApplicationMaster91: User class threw exception: org.apache.spark.sql.A

問題定位分享（12）Spark儲存文字型別檔案（text、csv、json等）到hdfs時為什麼是壓縮格式的

問題重現 rdd.repartition(1).write.csv(outPath) 寫檔案之後發現檔案是壓縮過的 write時首先會獲取hadoopConf，然後從中獲取是否壓縮以及壓縮格式 org.apache.spark.sql.execution.datasource

【原創】問題定位分享（15）Context namespace element 'component-scan' and its parser class [org.springframework.context.annotation.ComponentScanBeanDefinit

今天嘗試執行一個古老的工程，配置好之後編譯通過，結果執行時報錯： org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing XML document from class p

【原創】大叔問題定位分享（30）mesos agent啟動失敗：Failed to perform recovery: Incompatible agent info detected

cpp 方法 fail mesos perf mes inf for cut mesos agent啟動失敗，報錯如下： Feb 15 22:03:18 server1.bj mesos-slave[1190]: E0215 22:03:18.622994 1192 sl

【原創】問題定位分享（21）spark執行insert overwrite非常慢，比hive還要慢

相關推薦