HBase在單Column和多Column情況下批量Put的性能對比分析

阿新 • • 發佈：2019-04-05

out amp .html 線程 its lse void 比較操作

作者: 大圓那些事 | 文章可以轉載，請以超鏈接形式標明文章原始出處和作者信息

網址: http://www.cnblogs.com/panfeng412/archive/2013/11/28/hbase-batch-put-performance-analysis-of-single-column-and-multiple-columns.html

針對HBase在單column family單column qualifier和單column family多column qualifier兩種場景下，分別批量Put寫入時的性能對比情況，下面是結合HBase的源碼來簡單分析解釋這一現象。

1. 測試結果
在客戶端批量寫入時，單列族單列模式和單列族多列模式的TPS和RPC次數相差很大，以客戶端10個線程，開啟WAL的兩種模式下的測試數據為例，

單列族單列模式下，TPS能夠達到12403.87，實際RPC次數為53次；
單列族多列模式下，TPS只有1730.68，實際RPC次數為478次。
二者TPS相差約7倍，RPC次數相差約9倍。詳細的測試環境這裏不再羅列，我們這裏關心的只是在兩種條件下的性能差別情況。

2. 粗略分析
下面我們先從HBase存儲原理層面“粗略”分析下為什麽出現這個現象：

HBase的KeyValue類中自帶的字段占用大小約為50~60 bytes左右（參考HBase源碼org/apache/hadoop/hbase/KeyValue.java），那麽客戶端Put一行數據時（53個字段，row key為64 bytes，value為751 bytes）：

1）開WAL，單column family單column qualifier，批量Put：(50~60) + 64 + 751 = 865~875 bytes；

2）開WAL，單column family多column qualifier，批量Put：((50~60) + 64) * 53 + 751 = 6793~7323 bytes。

因此，總體來看，後者實際傳輸的數據量是前者的：(6793~7323 bytes) / (865~875 bytes) = 7.85~8.36倍，與測試結果478 / 53 = 9.0倍基本相符（由於客戶端write buffer大小一樣，實際請求數的比例關系即代表了實際傳輸的數據量的比例關系）。

3. 源碼分析
OK，口說無憑，下面我們通過對HBase的源碼分析來進一步驗證以上理論估算值：

HBase客戶端執行put操作後，會調用put.heapSize()累加當前客戶端buffer中的數據，滿足以下條件則調用flushCommits()將客戶端數據提交到服務端：

1）每次put方法調用時可能傳入的是一個List<Put>，此時每隔DOPUT_WB_CHECK條（默認為10條），檢查當前緩存數據是否超過writeBufferSize（測試中被設置為5MB），超過則強制執行刷新；

2）autoFlush被設置為true，此次put方法調用後執行一次刷新；

3）autoFlush被設置為false，但當前緩存數據已超過設定的writeBufferSize，則執行刷新。
Java代碼

技術分享圖片

private void doPut(final List<Put> puts) throws IOException {
int n = 0;
for (Put put : puts) {
validatePut(put);
writeBuffer.add(put);
currentWriteBufferSize += put.heapSize();
// we need to periodically see if the writebuffer is full instead
// of waiting until the end of the List
n++;
if (n % DOPUT_WB_CHECK == 0
&& currentWriteBufferSize > writeBufferSize) {
flushCommits();
}
}
if (autoFlush || currentWriteBufferSize > writeBufferSize) {
flushCommits();
}
}

由上述代碼可見，通過put.heapSize()累加客戶端的緩存數據，作為判斷的依據；那麽，我們可以按照測試數據的實際情況，編寫代碼生成Put對象後就能得到測試過程中的一行數據（由53個字段組成，共計731 bytes）實際占用的客戶端緩存大小：
Java代碼

技術分享圖片

import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
public class PutHeapSize {
/**
* @param args
*/
public static void main(String[] args) {
// single column Put size
byte[] rowKey = new byte[64];
byte[] value = new byte[751];
Put singleColumnPut = new Put(rowKey);
singleColumnPut.add(Bytes.toBytes("t"), Bytes.toBytes("col"), value);
System.out.println("single column Put size: " + singleColumnPut.heapSize());
// multiple columns Put size
value = null;
Put multipleColumnsPut = new Put(rowKey);
for (int i = 0; i < 53; i++) {
multipleColumnsPut.add(Bytes.toBytes("t"), Bytes.toBytes("col" + i), value);
}
System.out.println("multiple columns Put size: " + (multipleColumnsPut.heapSize() + 751));
}
}

程序輸出結果如下：

single column Put size: 1208
multiple columns Put size: 10575
由運行結果可得到，9719/1192 = 8.75，與上述理論分析值（7.85~8.36倍）、實際測試結果值（9.0倍）十分接近，基本可以驗證測試結果的準確性。

如果你還對put.heapSize()方法感興趣，可以繼續閱讀其源碼實現，你會發現對於一個put對象來說，其中KeyValue對象的大小最主要決定了整個put對象的heapSize大小，為了進一步通過實例驗證，下面的這段代碼分別計算單column和多columns兩種情況下一行數據的KeyValue對象的heapSize大小：
Java代碼

技術分享圖片

import org.apache.hadoop.hbase.KeyValue;
public class KeyValueHeapSize {
/**
* @param args
*/
public static void main(String[] args) {
// single column KeyValue size
byte[] row = new byte[64]; // test row length
byte[] family = new byte[1]; // test family length
byte[] qualifier = new byte[4]; // test qualifier length
long timestamp = 123456L; // ts
byte[] value = new byte[751]; // test value length
KeyValue singleColumnKv = new KeyValue(row, family, qualifier, timestamp, value);
System.out.println("single column KeyValue size: " + singleColumnKv.heapSize());
// multiple columns KeyValue size
value = null;
KeyValue multipleColumnsWithoutValueKv = new KeyValue(row, family, qualifier, timestamp, value);
System.out.println("multiple columns KeyValue size: " + (multipleColumnsWithoutValueKv.heapSize() * 53 + 751));
}
}

程序輸出結果如下：

single column KeyValue size: 920
multiple columns KeyValue size: 10079
與前面PutHeapSize程序的輸出結果對比發現，KeyValue確實占據了整個Put對象的大部分heapSize空間，同時發現從KeyValue對象級別對比兩種情況下的傳出數據量情況：10079/920 = 10.9倍，也與實際測試值比較接近。

4. 相關結論
經過以上分析可以得出以下結論：

在實際應用場景中，對於單column qualifier和多column qualifier兩種情況，如果value長度越長，row key長度越短，字段數（column qualifier數）越少，前者和後者在實際傳輸數據量上會相差小些；反之則相差較大。
如果采用多column qualifier的方式存儲，且客戶端采取批量寫入的方式，則可以根據實際情況，適當增大客戶端的write buffer大小，以便能夠提高客戶端的寫入吞吐量。

HBase在單Column和多Column情況下批量Put的性能對比分析

out amp .html 線程 its lse void 比較操作作者: 大圓那些事 | 文章可以轉載，請以超鏈接形式標明文章原始出處和作者信息網址: http://www.cnblogs.com/panfeng412/archive/2013/11/28/hbas

HBase在單Column和多Column情況下批量Put的性能對比分析

HBase在單Column和多Column情況下批量Put的性能對比分析

C++單繼承、多繼承情況下的虛擬函式表分析

C# 單執行緒和多執行緒下的單例模式的實現

springmvc 開發中關於Controller 的單例情況和多例情況

[教程]Ubuntu 下為單版本和多版本 PHP 安裝擴充套件

線程學習--（六）單例和多線程、ThreadLocal

Struts2單例和多例

如何在刪除ibdata1和ib_logfile的情況下恢復MySQL數據庫

java單例和多例

java中單例和多例的區別

無人機圖像處理、視頻與高性能單屏和多屏便攜工作站

display為inline-block的元素有內容和沒有內容情況下高度不一致的問題

html-4, form 表單輸入、傳文件、單選、多選、下拉菜單、文本描述、重置、submit、按鈕限制輸入

3分鐘教你如何在Word裏快速制作單斜線和多斜線表頭

JavaWeb_day8_MySQL單表和多表

Nginx在沒有第三方支援和擴充套件的情況下到能幹嘛啊

spring bean的單例和多例的使用場景和在單例bean中注入多例（不看後悔，一看必懂）

webpack單入口和多入口配置

iOS 使用AFN 進行單圖和多圖上傳

封裝單選和多選

HBase在單Column和多Column情況下批量Put的性能對比分析

相關推薦