HBae找不到協處理器導致RegionServer全部掛掉

阿新 • • 發佈：2019-01-23

一、問題背景：

跟兄弟單位公用一個大資料叢集，通過Dataspace結合Kerberos控制資料的訪問，我們生產環境中用到的OLAP工具Kylin，在升級Kylin的過程中，由於刪除了舊的協處理器，導致原來資料繼續去尋找目標協處理器，找不到引起所有RegionServer退出，始終想不明白hbase有關協處理器的機制，於是查詢資料才得以清楚。

一下內容為轉載，原地址：http://blog.itpub.net/12129601/viewspace-1690668/ 主要用於個人收藏、備查。轉載請註明原作者。

二、協處理的使用

1 載入協處理器
1.1 將協處理器上傳到hdfs：
hadoop fs -mkdir /hbasenew/usercoprocesser
hadoop fs -ls /hbasenew/usercoprocesser
hadoop fs -rm /hbasenew/usercoprocesser/coprocessor.jar
hadoop fs -copyFromLocal /home/hbase/coprocessor.jar /hbasenew/usercoprocessor
1.2 將協處理器載入到表中：
1）先解除安裝協處理器：
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att_unset',NAME =>'coprocessor$1'
enable 'ns_bigdata:tb_test_coprocesser'
2）再載入協處理器：
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att','coprocessor' => '/hbasenew/usercoprocesser/coprocessor.jar|com.suning.hbase.coprocessor.service.HelloWorldEndPoin|1001|'
enable 'ns_bigdata:tb_test_coprocesser'
注意：在載入協處理器是我特意將協處理器中的類名少寫一個字母t，以重現將叢集regionserver搞掛的現象以及表的狀態不一致的現象。

2 出現的問題
以上操作會導致如下兩個問題：
2.1 將叢集的region server搞掛掉

2.2 將載入協處理器的表的狀態搞的不一致，一直處於enabling狀態

對錶做disable和enable操作均不可操作：

同時此表對應的regionserver上出現如下錯誤：

3 原因分析
3.1 關於協處理載入錯誤導致regionserver掛掉的原因分析
在hbase的原始碼中，引數：hbase.coprocessor.abortonerror的預設值是true：
public static final String ABORT_ON_ERROR_KEY = "hbase.coprocessor.abortonerror";
public static final boolean DEFAULT_ABORT_ON_ERROR = true;
下面檢視此引數的含義：

      hbase.coprocessor.abortonerror
      true
      Set to true to cause the hosting server (master or regionserver)
      to abort if a coprocessor fails to load, fails to initialize, or throws an
      unexpected Throwable object. Setting this to false will allow the server to
      continue execution but the system wide state of the coprocessor in question
      will become inconsistent as it will be properly executing in only a subset
      of servers, so this is most useful for debugging only.
因此，當載入錯誤的協處理器之後，會導致regionserver掛掉。

3.2 關於載入協處理器的表的狀態不一致的原因分析：
相關錯誤日誌：

檢視enable的相關原始碼：
public void enableTable(final TableName tableName)
throws IOException {
enableTableAsync(tableName);

// Wait until all regions are enabled
waitUntilTableIsEnabled(tableName);

    LOG.info("Enabled table " + tableName);
}
private void waitUntilTableIsEnabled(final TableName tableName) throws IOException {
    boolean enabled = false;
    long start = EnvironmentEdgeManager.currentTimeMillis();
    for (int tries = 0; tries < (this.numRetries * this.retryLongerMultiplier); tries++) {
      try {
        enabled = isTableEnabled(tableName);
      } catch (TableNotFoundException tnfe) {
        // wait for table to be created
        enabled = false;
      }
      enabled = enabled && isTableAvailable(tableName);
      if (enabled) {
        break;
      }
      long sleep = getPauseTime(tries);
      if (LOG.isDebugEnabled()) {
        LOG.debug("Sleeping= " + sleep + "ms, waiting for all regions to be " +
          "enabled in " + tableName);
      }
      try {
        Thread.sleep(sleep);
      } catch (InterruptedException e) {
        // Do this conversion rather than let it out because do not want to
        // change the method signature.
        throw (InterruptedIOException)new InterruptedIOException("Interrupted").initCause(e);
      }
    }
    if (!enabled) {
      long msec = EnvironmentEdgeManager.currentTimeMillis() - start;
      throw new IOException("Table '" + tableName +
        "' not yet enabled, after " + msec + "ms.");
    }
}

===========================================================================
/**
   * Brings a table on-line (enables it). Method returns immediately though
   * enable of table may take some time to complete, especially if the table
   * is large (All regions are opened as part of enabling process). Check
   * {@link #isTableEnabled(byte[])} to learn when table is fully online. If
   * table is taking too long to online, check server logs.
   * @param tableName
   * @throws IOException
   * @since 0.90.0
   */
public void enableTableAsync(final TableName tableName)
throws IOException {
    TableName.isLegalFullyQualifiedTableName(tableName.getName());
    executeCallable(new MasterCallable(getConnection()) {
      @Override
      public Void call() throws ServiceException {
        LOG.info("Started enable of " + tableName);
        EnableTableRequest req = RequestConverter.buildEnableTableRequest(tableName);
        master.enableTable(null,req);
        return null;
      }
    });
}
發現在enable的過程中，首先是執行enable操作，操作完畢後需要等待各個regionserver反饋所有region的狀態，由於此時regionserver已經掛掉，一直在連線重試等待，此時表的狀態一直是ENABLING。

4 問題的處理
4.1 關於regionserver 掛掉的問題處理：
通過在hbase-site.xml檔案中設定引數：

    hbase.coprocessor.abortonerror
    false

並啟動region server可以解決，這樣就忽略了協處理器出現的錯誤，保證叢集高可用。
4.2 關於有協處理器的表的狀態不一致，不能disable和enable問題的解決辦法：
此問題可以通過切換master節點可以解決，將主停掉，backup-master會承擔主master的任務，同時在切換的過程中，會將狀態不一致的表的狀態改為一致的：

切換後的master資訊如下：

在切換的過程中呼叫瞭如下方法：
/**
   * Recover the tables that are not fully moved to ENABLED state. These tables
   * are in ENABLING state when the master restarted/switched
   *
   * @throws KeeperException
   * @throws org.apache.hadoop.hbase.TableNotFoundException
   * @throws IOException
   */
private void recoverTableInEnablingState()
      throws KeeperException, TableNotFoundException, IOException {
    Set enablingTables = ZKTable.getEnablingTables(watcher);
    if (enablingTables.size() != 0) {
      for (TableName tableName : enablingTables) {
        // Recover by calling EnableTableHandler
        LOG.info("The table " + tableName
            + " is in ENABLING state. Hence recovering by moving the table"
            + " to ENABLED state.");
        // enableTable in sync way during master startup,
        // no need to invoke coprocessor
        EnableTableHandler eth = new EnableTableHandler(this.server, tableName,
          catalogTracker, this, tableLockManager, true);
        try {
          eth.prepare();
        } catch (TableNotFoundException e) {
          LOG.warn("Table " + tableName + " not found in hbase:meta to recover.");
          continue;
        }
        eth.process();
      }
    }
}
在卻換過程中，跟蹤master和對應的regionserver的後臺日誌：
master日誌：
其中的部分日誌資訊如下：
2015-05-20 10:00:01,398 INFO [master:nim-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state. Hence recovering by moving the table to ENABLED state.
2015-05-20 10:00:01,421 DEBUG [master:nim-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000002
2015-05-20 10:00:01,436 INFO [master:nim-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 10:00:01,465 INFO [master:nim-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 10:00:01,466 INFO [master:nim-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 10:00:01,466 INFO [master:nim-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true
對應的regionserver的日誌如下：
2015-05-20 14:39:56,175 INFO [master:sup02-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state. Hence recovering by moving the table to ENABLED state.
2015-05-20 14:39:56,211 DEBUG [master:sup02-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000031
2015-05-20 14:39:56,235 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 14:39:56,269 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 14:39:56,270 INFO [master:sup02-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 14:39:56,270 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true

結論：
1. 為了提高叢集的高可用性，應該將引數：hbase.coprocessor.abortonerror設定為true，這樣即使載入的協處理器有問題，也不會導致叢集的regionserver掛掉，也不會導致表不能enable和disable；
2.即使表出現不能enable和disable的現象後，也可以通過切換master來解決，因此在搭建叢集時，一定要至少有一到兩個backupmaster

5 全部master節點宕後集群的讀寫測試
1. 在叢集都是正常的情況下，通過客戶端往叢集中插入2000000行資料，插入正常
2.將叢集的所有master全部停掉：

3.監控客戶端的資料插入情況，發現客戶端的資料插入正常。持續讓客戶端繼續插入20000000行資料，發現數據插入正常。
4.在客戶端批量讀取資料，發現數據讀取正常。
結論：當hbase叢集的master所有節點掛掉後（一定時間段，目前測試的是半小時內），客戶端的資料讀寫正常。

HBae找不到協處理器導致RegionServer全部掛掉

HBae找不到協處理器導致RegionServer全部掛掉

關於HBase協處理器導致問題的研究

RK3399使用乙太網pppoe撥號導致系統服務全部掛掉。

Hbase regionserver 逐個掛掉的問題分析

問題定位分享（1）HBase RegionServer頻繁掛掉

Java效能分析及問題解決(二)jvm致命錯誤導致程序直接掛掉，錯誤日誌分析及解決

MySQL鎖（二）表鎖：為什麼給小表加欄位會導致整個庫掛掉？

linux中mysql表名默認區分大小寫導致表找不到的問題

yum 安裝一個小問題導致找不到安裝包

php程式設計中require和include多層巢狀導致檔案找不到的錯誤

定位下拉框，這裡遇到一些問題，沒有新增顯式等待會導致找不到頁面元素，折騰了一會兒

記錄一次子母包問題導致找不到路徑

Java Scala 混合程式設計導致編譯失敗，【找不到符號】問題解決

mysql修改密碼導致找不到mysql資料庫、ERROR 1045 (28000)、 ERROR 1044 (42000)等問題的解決辦法

警惕！高版本VS釋出時預編譯導致Mono中Razor找不到檢視

專案中檔案位置不對，導致找不到檔案（專案裡有好幾個同名的檔案，比如好幾個index.jsp）

Android SDK的預設目錄導致的AVD啟動時各種找不到

搭建VS2017+WDK10+WinDBG雙機除錯Win7環境過程遇到的坑與解決（WinDBG找不到串列埠、security_cookie導致的藍屏、看不到除錯訊息等）

ubuntu由於使用了/bin/sh 導致找不到pushd命令

【已解決】Android studio中ADB啟動失敗，導致找不到虛擬機器或真機

HBae找不到協處理器導致RegionServer全部掛掉

相關推薦