關於HBase協處理器導致問題的研究

阿新 • • 發佈：2019-01-20

1 載入協處理器
1.1 將協處理器上傳到hdfs：
hadoop fs -mkdir /hbasenew/usercoprocesser
hadoop fs -ls /hbasenew/usercoprocesser
hadoop fs -rm /hbasenew/usercoprocesser/coprocessor.jar
hadoop fs -copyFromLocal /home/hbase/coprocessor.jar /hbasenew/usercoprocessor
1.2 將協處理器載入到表中：
1）先解除安裝協處理器：
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att_unset',NAME =>'coprocessor$1'
enable 'ns_bigdata:tb_test_coprocesser'
2）再載入協處理器：
disable 'ns_bigdata:tb_test_coprocesser'
alter 'ns_bigdata:tb_test_coprocesser',METHOD => 'table_att','coprocessor' => '/hbasenew/usercoprocesser/coprocessor.jar|com.suning.hbase.coprocessor.service.HelloWorldEndPoin|1001|'
enable 'ns_bigdata:tb_test_coprocesser'
注意：在載入協處理器是我特意將協處理器中的類名少寫一個字母t，以重現將叢集regionserver搞掛的現象以及表的狀態不一致的現象。

2 出現的問題
以上操作會導致如下兩個問題：
2.1 將叢集的region server搞掛掉

2.2 將載入協處理器的表的狀態搞的不一致，一直處於enabling狀態

對錶做disable和enable操作均不可操作：

同時此表對應的regionserver上出現如下錯誤：

3 原因分析
3.1 關於協處理載入錯誤導致regionserver掛掉的原因分析
在hbase的原始碼中，引數：hbase.coprocessor.abortonerror的預設值是true：
public static final String ABORT_ON_ERROR_KEY = "hbase.coprocessor.abortonerror";
public static final boolean DEFAULT_ABORT_ON_ERROR = true;
下面檢視此引數的含義：
<property>
      <name>hbase.coprocessor.abortonerror</name>
      <value>true</value>
      <description>Set to true to cause the hosting server (master or regionserver)
      to abort if a coprocessor fails to load, fails to initialize, or throws an
      unexpected Throwable object. Setting this to false will allow the server to
      continue execution but the system wide state of the coprocessor in question
      will become inconsistent as it will be properly executing in only a subset
      of servers, so this is most useful for debugging only.</description> </property>
因此，當載入錯誤的協處理器之後，會導致regionserver掛掉。

3.2 關於載入協處理器的表的狀態不一致的原因分析：
相關錯誤日誌：

檢視enable的相關原始碼：
public void enableTable(final TableName tableName)
throws IOException {
enableTableAsync(tableName);

// Wait until all regions are enabled
waitUntilTableIsEnabled(tableName);

    LOG.info("Enabled table " + tableName);
}
private void waitUntilTableIsEnabled(final TableName tableName) throws IOException {
    boolean enabled = false;
    long start = EnvironmentEdgeManager.currentTimeMillis();
    for (int tries = 0; tries < (this.numRetries * this.retryLongerMultiplier); tries++) {
      try {
        enabled = isTableEnabled(tableName);
      } catch (TableNotFoundException tnfe) {
        // wait for table to be created
        enabled = false;
      }
      enabled = enabled && isTableAvailable(tableName);
      if (enabled) {
        break;
      }
      long sleep = getPauseTime(tries);
      if (LOG.isDebugEnabled()) {
        LOG.debug("Sleeping= " + sleep + "ms, waiting for all regions to be " +
          "enabled in " + tableName);
      }
      try {
        Thread.sleep(sleep);
      } catch (InterruptedException e) {
        // Do this conversion rather than let it out because do not want to
        // change the method signature.
        throw (InterruptedIOException)new InterruptedIOException("Interrupted").initCause(e);
      }
    }
    if (!enabled) {
      long msec = EnvironmentEdgeManager.currentTimeMillis() - start;
      throw new IOException("Table '" + tableName +
        "' not yet enabled, after " + msec + "ms.");
    }
}

===========================================================================
/**
   * Brings a table on-line (enables it). Method returns immediately though
   * enable of table may take some time to complete, especially if the table
   * is large (All regions are opened as part of enabling process). Check
   * {@link #isTableEnabled(byte[])} to learn when table is fully online. If
   * table is taking too long to online, check server logs.
   * @param tableName
   * @throws IOException
   * @since 0.90.0
   */
public void enableTableAsync(final TableName tableName)
throws IOException {
    TableName.isLegalFullyQualifiedTableName(tableName.getName());
    executeCallable(new MasterCallable<Void>(getConnection()) {
      @Override
      public Void call() throws ServiceException {
        LOG.info("Started enable of " + tableName);
        EnableTableRequest req = RequestConverter.buildEnableTableRequest(tableName);
        master.enableTable(null,req);
        return null;
      }
    });
}
發現在enable的過程中，首先是執行enable操作，操作完畢後需要等待各個regionserver反饋所有region的狀態，由於此時regionserver已經掛掉，一直在連線重試等待，此時表的狀態一直是ENABLING。

4 問題的處理
4.1 關於regionserver 掛掉的問題處理：
通過在hbase-site.xml檔案中設定引數：
    <property>
    <name>hbase.coprocessor.abortonerror</name>
    <value>false</value>
    </property>
並啟動region server可以解決，這樣就忽略了協處理器出現的錯誤，保證叢集高可用。
4.2 關於有協處理器的表的狀態不一致，不能disable和enable問題的解決辦法：
此問題可以通過切換master節點可以解決，將主停掉，backup-master會承擔主master的任務，同時在切換的過程中，會將狀態不一致的表的狀態改為一致的：

切換後的master資訊如下：

在切換的過程中呼叫瞭如下方法：
/**
   * Recover the tables that are not fully moved to ENABLED state. These tables
   * are in ENABLING state when the master restarted/switched
   *
   * @throws KeeperException
   * @throws org.apache.hadoop.hbase.TableNotFoundException
   * @throws IOException
   */
private void recoverTableInEnablingState()
      throws KeeperException, TableNotFoundException, IOException {
    Set<TableName> enablingTables = ZKTable.getEnablingTables(watcher);
    if (enablingTables.size() != 0) {
      for (TableName tableName : enablingTables) {
        // Recover by calling EnableTableHandler
        LOG.info("The table " + tableName
            + " is in ENABLING state. Hence recovering by moving the table"
            + " to ENABLED state.");
        // enableTable in sync way during master startup,
        // no need to invoke coprocessor
        EnableTableHandler eth = new EnableTableHandler(this.server, tableName,
          catalogTracker, this, tableLockManager, true);
        try {
          eth.prepare();
        } catch (TableNotFoundException e) {
          LOG.warn("Table " + tableName + " not found in hbase:meta to recover.");
          continue;
        }
        eth.process();
      }
    }
}
在卻換過程中，跟蹤master和對應的regionserver的後臺日誌：
master日誌：
其中的部分日誌資訊如下：
2015-05-20 10:00:01,398 INFO [master:nim-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state. Hence recovering by moving the table to ENABLED state.
2015-05-20 10:00:01,421 DEBUG [master:nim-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000002
2015-05-20 10:00:01,436 INFO [master:nim-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 10:00:01,465 INFO [master:nim-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 10:00:01,466 INFO [master:nim-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 10:00:01,466 INFO [master:nim-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true
對應的regionserver的日誌如下：
2015-05-20 14:39:56,175 INFO [master:sup02-pre:60000] master.AssignmentManager: The table ns_bigdata:tb_test_coprocesser is in ENABLING state. Hence recovering by moving the table to ENABLED state.
2015-05-20 14:39:56,211 DEBUG [master:sup02-pre:60000] lock.ZKInterProcessLockBase: Acquired a lock for /hbasen/table-lock/ns_bigdata:tb_test_coprocesser/write-master:600000000000031
2015-05-20 14:39:56,235 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Attempting to enable the table ns_bigdata:tb_test_coprocesser
2015-05-20 14:39:56,269 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Table 'ns_bigdata:tb_test_coprocesser' has 1 regions, of which 1 are offline.
2015-05-20 14:39:56,270 INFO [master:sup02-pre:60000] balancer.BaseLoadBalancer: Reassigned 1 regions. 1 retained the pre-restart assignment.
2015-05-20 14:39:56,270 INFO [master:sup02-pre:60000] handler.EnableTableHandler: Bulk assigning 1 region(s) across 3 server(s), retainAssignment=true

結論：
1. 為了提高叢集的高可用性，應該將引數：hbase.coprocessor.abortonerror設定為false，這樣即使載入的協處理器有問題，也不會導致叢集的regionserver掛掉，也不會導致表不能enable和disable；
2.即使表出現不能enable和disable的現象後，也可以通過切換master來解決，因此在搭建叢集時，一定要至少有一到兩個backupmaster

5 全部master節點宕後集群的讀寫測試
1. 在叢集都是正常的情況下，通過客戶端往叢集中插入2000000行資料，插入正常
2.將叢集的所有master全部停掉：

3.監控客戶端的資料插入情況，發現客戶端的資料插入正常。持續讓客戶端繼續插入20000000行資料，發現數據插入正常。
4.在客戶端批量讀取資料，發現數據讀取正常。
結論：當hbase叢集的master所有節點掛掉後（一定時間段，目前測試的是半小時內），客戶端的資料讀寫正常。

zz:http://blog.csdn.net/dcswinner/article/details/46293041

關於HBase協處理器導致問題的研究

關於HBase協處理器導致問題的研究

HBase協處理器

hbase協處理器與二級索引

HBase協處理器載入過程（1.2.x）

HBae找不到協處理器導致RegionServer全部掛掉

HBase協處理器載入過程（1.2）

HBase-6.hbase 協處理器

HBase 協處理器實踐（一）AggregationClient

hbase協處理器簡介

hbase協處理器--建立endpoint協處理器

HBase協處理器同步二級索引到Solr

HBase 協處理器 (二)

HBase 協處理器統計行數

HBase 系列（八）——HBase 協處理器

HBase的協處理器

HBase利用observer（協處理器）建立二級索引

Hbase Coprocessor(協處理器)的使用

使用HBase Coprocessor協處理器

使用協處理器將HBase資料索引到Elasticsearch叢集

HBase學習之三: hbase filter(過濾器)和coprocessor(協處理器)統計行數的簡單應用

關於HBase協處理器導致問題的研究

相關推薦