1. 程式人生 > >網路原因造成 spark task 卡住

網路原因造成 spark task 卡住

主機名映射出錯

背景

Yarn叢集新加入了一批Spark機器後發現執行Spark任務時,一些task會無限卡住且driver端沒有任何提示。

這裡寫圖片描述

解決

進入task卡住的節點檢視container stderr日誌,發現在獲取其他節點block資訊時,連線不上其他的機器節點,不停重試。
這裡寫圖片描述
懷疑部分舊節點的/etc/hosts檔案被運維更新漏了,檢視/etc/hosts,發現沒有加入新節點的地址,加入後問題解決。

在叢集節點不斷增多的情況下,可以使用dns避免某些節點忘記修改/etc/hosts檔案導致的錯誤。

產生原因

在讀取shuffle資料時,本地的block會從本地的BlockManager讀取資料塊,遠端的block則通過 BlockTransferService 讀取,其中包含了hostname作為地址資訊,如果沒有hostname和ip的對映資訊,則會獲取失敗,將呼叫
RetryingBlockFetcher進行重試,如果繼續失敗則會丟擲異常: Exception while beginning fetch …

NettyBlockTransferService

override def fetchBlocks(
    host: String,
    port: Int,
    execId: String,
    blockIds: Array[String],
    listener: BlockFetchingListener): Unit = {
  logTrace(s"Fetch blocks from $host:$port (executor id $execId)")
  try {
    val blockFetchStarter = new RetryingBlockFetcher.BlockFetchStarter {
      override def createAndStart(blockIds: Array[String], listener: BlockFetchingListener) {
        val client = clientFactory.createClient(host, port)
        new OneForOneBlockFetcher(client, appId, execId, blockIds.toArray, listener).start()
      }
    }

    val maxRetries = transportConf.maxIORetries()
    if (maxRetries > 0) {
      // Note this Fetcher will correctly handle maxRetries == 0; we avoid it just in case there's
      // a bug in this code. We should remove the if statement once we're sure of the stability.
      new RetryingBlockFetcher(transportConf, blockFetchStarter, blockIds, listener).start()
    } else {
      blockFetchStarter.createAndStart(blockIds, listener)
    }
  } catch {
    case e: Exception =>
      logError("Exception while beginning fetchBlocks", e)
      blockIds.foreach(listener.onBlockFetchFailure(_, e))
  }
}

RetryingBlockFetcher

private void fetchAllOutstanding() {
  // Start by retrieving our shared state within a synchronized block.
  String[] blockIdsToFetch;
  int numRetries;
  RetryingBlockFetchListener myListener;
  synchronized (this) {
    blockIdsToFetch = outstandingBlocksIds.toArray(new String[outstandingBlocksIds.size()]);
    numRetries = retryCount;
    myListener = currentListener;
  }

  // Now initiate the fetch on all outstanding blocks, possibly initiating a retry if that fails.
  try {
    fetchStarter.createAndStart(blockIdsToFetch, myListener);
  } catch (Exception e) {
    logger.error(String.format("Exception while beginning fetch of %s outstanding blocks %s",
      blockIdsToFetch.length, numRetries > 0 ? "(after " + numRetries + " retries)" : ""), e);

    if (shouldRetry(e)) {
      initiateRetry();
    } else {
      for (String bid : blockIdsToFetch) {
        listener.onBlockFetchFailure(bid, e);
      }
    }
  }
}

讀取kafka失敗,不斷重試

這裡寫圖片描述

原因

防火牆原因連不上9092埠
這裡寫圖片描述

總結

遇到類似問題一般在executor的日誌下都能找到結果