1. 程式人生 > >thrift — 記關於"Read a frame size of "的痛苦經歷

thrift — 記關於"Read a frame size of "的痛苦經歷

thrift開源出來好多年,得到了大家認可。我本人也覺得thrift非常優秀。CTO說要公司定製的RPC框架,起初基於netty開發了一套RPC框架,協議為json,因為考慮通用性和實用性,最終選擇了輕量級的RPC框架thrift為通訊框架。選定thrift之後,面臨著優化,擴充套件等等一系列問題。今天要講的就是因為一次“優化”產生的BUG及排查、解決的痛苦經歷。

異常資訊:

Read a frame size of 16842752, which is bigger than the maximum allowable buffer size for ALL connections

此異常是因為一次調整引數導致,為了限制客戶端傳送資料大小,MAX_READ_BUFFER_BYTES設定成1M,對於大多數介面足夠用了,程式碼如下:

protected static final int DEFAULT_MAX_LENGTH = 1024 * 1024;//1 m

TThreadedSelectorServer.Args tArgs = new TThreadedSelectorServer.Args(serverTransport); 
tArgs.maxReadBufferBytes = DEFAULT_MAX_LENGTH;

起初未太在意,因為不影響業務,就忙其他事情了。最近閒下來,檢視線上日誌時,發現此異常不少反增(業務量增加),就開始了痛苦的排查過程。

1.仔細讀了thrift原始碼,瞭解了異常產生原因。thrift協議不是太複雜,在接受資料時會呼叫FrameBuffer.read()方法,首先會讀取4個位元組,

ByteBuffer buffer_ = ByteBuffer.allocate(4);

private boolean internalRead() {
       try {
            if (trans_.read(buffer_) < 0) {
                return false;
            }
            return true;
        } catch (IOException e) {
            LOGGER.warn("Got an IOException in internalRead!"
, e); return false; } }

再更改狀態機state_為FrameBufferState.READING_FRAME

state_ = FrameBufferState.READING_FRAME;

讀取資料。異常正是此時產生,

// if this frame will always be too large for this server,
// log the
// error and close the connection.
if (frameSize > MAX_READ_BUFFER_BYTES) {
     列印異常
     return false;
}

資料長度超過了限制,此處MAX_READ_BUFFER_BYTES為1024 × 1024。

知道原因後開始排查

  1. 增加列印

    首先需要知道哪些連線會報此異常。所以在錯誤處增加了socket channel資訊的列印。

 LOGGER.error("[" + Thread.currentThread().getName() + "] Read a frame size of " + frameSize + ", which is bigger than the maximum allowable buffer size for ALL connections. client[" + socketChannel.socket().getRemoteSocketAddress() + "] will be closed in future. server[" + socketChannel.socket().getLocalSocketAddress() + "]");
打印出來的資訊為另一個idc機房的連接出現次數比較多,所以想是不是因為迴圈呼叫或者網路問題導致的該錯誤,所以進行了下一步。

 2. 單元測試

寫了個客戶端demo,把demo搬到idc執行,迴圈呼叫上千次,未發現此異常。增加測試複雜度,在多執行緒情況下呼叫,還是未產生此異常。這時感覺到事情沒想的那麼簡單。

 3. 增加詳細列印

 在客戶端傳送資料處和服務端接受處均增加了列印。客戶端程式碼在TServiceClient的sendBase()方法中,如下:
private void sendBase(String methodName, TBase<?, ?> args, byte type) throws TException {
        oprot_.writeMessageBegin(new TMessage(methodName, type, ++seqid_));
        args.write(oprot_);
        oprot_.writeMessageEnd();

        try {
            TTransport trans_ = oprot_.getTransport();
            if(trans_ instanceof TFramedTransport) {
                TFramedTransport framedTrans_ = (TFramedTransport)trans_;
                byte[] data_ = framedTrans_.getWriteBuffer();

                if(data_ != null && data_.length > 0) {

                    int len = framedTrans_.getWriteBufferLength();
                    StringBuffer sb = new StringBuffer("method name: " + methodName + ", data len: " + len + ",");
                    byte[] buf = new byte[4];

                    buf[0] = (byte)(0xff & (len >> 24));
                    buf[1] = (byte)(0xff & (len >> 16));
                    buf[2] = (byte)(0xff & (len >> 8));
                    buf[3] = (byte)(0xff & (len));

                    for(byte lbyte : buf) {
                        sb.append(lbyte).append(",");
                    }

                    sb.append(",data:");
                    for(int i = 0; i < data_.length; i++) {
                        sb.append(data_[i]).append(",");
                    }
                    LOGGER.error(sb.toString());
                }
            }
        } catch (Exception e) {
            LOGGER.error(e.getMessage(), e);
        }
        oprot_.getTransport().flush();
    }
    //為了跟thrift的flush方法效果一樣,我在TFramedTransport類裡增加了getWriteBufferLength()方法,獲取資料長度。
    int len = framedTrans_.getWriteBufferLength();

服務端在異常處增加了接受位元組的列印,程式碼如下:

if (frameSize > MAX_READ_BUFFER_BYTES) {

    SocketChannel socketChannel = (SocketChannel) selectionKey_.channel();
    try {
    ByteBuffer dst = ByteBuffer.allocate(100);
    buffer_.flip();
    dst.put(buffer_);
    int numberOfBytes = socketChannel.read(dst);
    dst.flip();
    LOGGER.error("number of bytes read: " + numberOfBytes);
    byte[] bytes = dst.array();
    if (bytes != null && bytes.length > 0) {
       StringBuffer sb = new StringBuffer();
       for (int i = 0; i < bytes.length; i++) {
           sb.append(bytes[i] + ",");
       }
       LOGGER.error(sb.toString() + ", byteString: " + new String(bytes, "utf8"));
    } else {
        LOGGER.error("bytes is empty.");
    }
    } catch (IOException e) {
        LOGGER.error(e.getMessage(), e);
    }
LOGGER.error("[" + Thread.currentThread().getName() + "] Read a frame size of " + frameSize + ", which is bigger than the maximum allowable buffer size for ALL connections. client["+ socketChannel.socket().getRemoteSocketAddress() + "] will be closed in future. server[" + socketChannel.socket().getLocalSocketAddress() + "]");
    return false;
}

客戶端列印結果如下:

2017-07-21 01:56:31,993 ERROR [org.apache.thrift.TServiceClient]:107 - method name: xxx, data len: 47,0,0,0,47,,data:-128,1,0,1,0,0,0,27,113,117,101,114,121,65,100,1
18,101,114,116,105,115,101,66,121,83,101,99,111,110,100,68,101,112,73,100,0,0,4,-41,8,0,1,0,0,0,57,0,0,3,0,0,0,18,-27,-116,-105,-28,-70,-84,-27,-115,-113,-27,-110,-116,-27,-116,-69,-23,-103
,-94,0,0,-94,-24,-126,-65,-25,-104,-92,-27,-116,-69,-23,-103,-94,0,0,-69,-23,-103,-94,0,0,0,-114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,

服務端列印如下:

2017-07-21 01:56:31,893 ERROR [org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer]:376 - 1,0,0,0,47,-128,1,0,1,0,0,0,27,113,117,101,114,121,65,100,118,101,114,116,105,115,101,66
,121,83,101,99,111,110,100,68,101,112,73,100,0,0,43,-21,8,0,1,0,0,0,57,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,, byteString: ^A^@^@
^@/�^A^@^A^@^@^@ESCxxx^@^@+^@^A^@^@^@9^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
2017-07-21 01:56:31,893 ERROR [org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer]:384 - [Thread-15] Read a frame size of 16777216, which is bigger than the maximum allowable bu
ffer size for ALL connections. client[/20.1.41.18:43196] will be closed in future. server[/192.168.11.150:10083]
通過打印發現服務端在讀取frame size時讀錯了一個位元組,導致協議錯亂。 正常情況下服務端應該讀取0,0,0,47,4個位元組,然後轉成int,即frame size,然而服務端讀取的位元組是1,0,0,0。前面多出的1是哪裡來的?
  1. 分析

    發現此問題後瞬間崩潰,哪裡來一個位元組,這麼多人使用thrift,難道只有我一個人遇到了?在bing,百度四處搜尋,在thrift官網檢視是否有我同樣遭遇的人,始終未找到類似帖子。我開始變得急躁,無從下手。吃飯睡覺都在思考這個問題。

  2. 用tcpdump抓包

    使用tcpdump抓包分別在客戶端和服務端抓包,發現客戶端有很多cksum incorrect的包,開始懷疑是否是cksum導致的問題,搜尋關於cksum的資料,說是網絡卡設定會導致cksum incorrect的問題。不是太懂硬體,特意諮詢運維工程師,運維給出的答案是

You see the "incorrect" checksums due to a feature called TCP checksum offloading. The checksum fields for outgoing TCP packets are not pre-calculated by the operating system but set to 0 and left for calculation by the NIC processor. The Wireshark FAQ has a more detailed explanation.

You should only see incorrect checksums on packets sent by your machine. This is not specific to Wireshark, any packet capture software working at this level will produce similar results.

If you see checksum fails on received packets, there might be a problem indeed - these packets are discarded by the TCP stack and trigger a retransmission as well as congestion control algorithms - which will have a direct impact on data throughput.

大概意思是客戶端有cksum incorrect不嚴重,服務端出現的話會直接拋棄資料。我抓包結果是incorrect出現在客戶端,服務端並未出現。跟我遇見的現象描述不符。

運維說可能網絡卡造成的原因,伺服器配置了多個網絡卡,也幫忙調了一下,發現沒什麼改變,異常還在繼續。所以排除了硬體的可能性。畢竟運維不能一直陪著你,換另一個思路思考。到這裡我已基本崩潰,能想的可能性都想過了,也都驗證過,都沒啥鳥用。

  1. 最後一搏,還是tcpdump

除了以上幾個思路,還試過是否是緩衝區的資料未清空,讀取了上一個請求的末尾位元組等等等等,發現都沒啥鳥用。

實在沒有辦法了,這個問題困擾了我2天了,實在熬不過了,索性用最笨的辦法吧,在新建立的連線上抓客戶端和服務端的包,我就不信抓不到不一樣的資訊。確定好客戶端埠,開始抓包,

服務端:

tcpdump -i bond0 -vv -nn port 10083 and src port 43196 > srcport.log

客戶端:

tcpdump -i eth0 -nn -vv port 43196 and dst port 10083 > client_dump.log

服務端開啟另一個終端:

netstat -atnp | grep 10083 | grep 43196 

就盼著43196快點斷掉,洗個澡,刷個牙,回來還沒斷,抽了3根菸,斷開了。已經半夜3點了,不解決實在睡不著,接著弄吧。

從3次握手開始檢視,一條一條比對兩邊的資料包,沒發現什麼問題。突然靈機一動,問題就應該在最後的請求上,索性去看了看,不看不知道,一看嚇一跳,真發現有個地方有不同處。

服務端:

01:56:31.882846 IP (tos 0x0, ttl  62, id 7895, offset 0, flags [DF], proto: TCP (6), length: 53) 20.1.41.18.43196 > 192.168.11.150.10083: P, cksum 0x9877 (correct), 97260:97261(1) ack 10313
55 win 501 urg 2 <nop,nop,timestamp 1308631990 636616015>
01:56:31.896903 IP (tos 0x0, ttl  62, id 7896, offset 0, flags [DF], proto: TCP (6), length: 53) 20.1.41.18.43196 > 192.168.11.150.10083: P, cksum 0x9759 (correct), 97313:97314(1) ack 10313
56 win 501 urg 1 <nop,nop,timestamp 1308632003 636616235>

客戶端:

   20.1.41.18.43196 > 20.1.41.61.10083: Flags [P.U], cksum 0x7a78 (incorrect -> 0x2778), seq 92506:92507, ack 975450, win 501, urg 2, options [nop,nop,TS val 1308631990 ecr 636616015], length 1
01:56:32.004995 IP (tos 0x0, ttl 64, id 7896, offset 0, flags [DF], proto TCP (6), length 53)
    20.1.41.18.43196 > 20.1.41.61.10083: Flags [P.U], cksum 0x7a78 (incorrect -> 0x265a), seq 92559:92560, ack 975451, win 501, urg 1, options [nop,nop,TS val 1308632003 ecr 636616235], length 1

對比發現斷開的前一次請求的urg不同,urg是啥東東?bing了一下,是緊急欄位。到這裡全都清楚了,奶奶個腿兒的,不就是我“優化”客戶端的時候加的方法嗎?

剛開始弄thfit的時候

thrift客戶端是基於原生socket實現,主要類為TSocket,此類封裝了socket例項,也就是套接字。以往的經驗告訴我,套接字在服務端關閉連線時客戶端無感知,導致執行緒卡在read()方法上,浪費資源。大概讀了一遍thrift的原始碼,發現TSocket.isOpen()方法不靠譜,只調用了socket_.isConnected() 所以我首先想到了sendUrgentData(int data) 方法,此方法可以很好解決socket是否可用的問題。

我還自以為是的加了sendUrgentData()方法,去判斷連線完好,萬萬沒想到就是因為此方法折騰了2天,非常痛苦的2天啊。

總結

別自以為是的修改開原始碼
解決問題要多維度思考,實在不行,用最笨的方法。

好了,可以安心睡覺了。

相關推薦

thrift關於"Read a frame size of "的痛苦經歷

thrift開源出來好多年,得到了大家認可。我本人也覺得thrift非常優秀。CTO說要公司定製的RPC框架,起初基於netty開發了一套RPC框架,協議為json,因為考慮通用性和實用性,最終選擇了輕量級的RPC框架thrift為通訊框架。選定thrift之後

the frame size of 1096 bytes is larger than 1024 bytes

I'm guessing there's some large buffer in that routine that is stack-allocated; this is likely causing the stack frame of that functio

安卓驅動編譯:warning: the frame size of 1072 bytes is larger than 1024 bytes

  在編譯核心模組驅動時,如果出現如下警告資訊:warning: the frame size of 1040 bytes is larger than 1024 bytes。主要是因為核心中設定了堆疊報警大小,其預設為1024bytes。我們主要將其修改為4096既可以消除

Xamarin.Android 使用 SQLite 出現 Index -1 requested, with a size of 10 異常

查詢 else bubuko local 獲取 roi sum 圖片 next() 異常: Android.Database.CursorIndexOutOfBoundsException: Index -1 requested, with a size of 10 此錯

#748 – 獲得按下時對應位置點的大小(Getting the Size of a Contact Point during Raw Touch)

ini bounds event view hup main.c .cn hand phoenix 原文:#748 – 獲得按下時對應位置點的大小(Getting the Size of a Contact Point during Raw Touch)

android.database.CursorIndexOutOfBoundsException: Index 0 requested, with a size of 0

android.database.CursorIndexOutOfBoundsException: Index 0 requested, with a size of 0 在查資料庫的時候出現上述異常 檢視sql語句: //獲取所選圖片的faceFeature

[LeetCode] Search in a Sorted Array of Unknown Size 在未知大小的有序陣列中搜索

  Given an integer array sorted in ascending order, write a function to search target in nums.  If target exists,

collatz number,迴文數,love,map函式,漢若塔,佛祖鎮樓,read a word by random in 5s,the star of sand glass

1 collatz number def collatz(num): '''collatz number ''' if num % 2 == 0: n = num / 2 return n elif num % 2 == 1:

READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification

https://aclanthology.info/pdf/W/W11/W11-2308.pdf   2 background2000年以前 ----傳統可讀性準則侷限於表面的文字特徵,例如the Flesch-Kincaid measure(現在還在用的最普遍的)是每個單詞的平均音節數和每個句

Android問題:報錯Index -1 requested, with a size of 1

使用Cursor使,讀取裡面的資料用到getColumnIndex()時報錯:      Index -1 requested, with a size of 1       仔細閱讀過C

702. Search in a Sorted Array of Unknown Size - Medium

Given an integer array sorted in ascending order, write a function to search target in nums.  If target exists, then ret

How a brain the size of a sesame seed could change AI forever

When a honey bee leaves its hive for the first time in its life, it will typically perform one to five flights around the surrounding area to get its beari

npm報錯:npm ERR! Maximum call stack size exceeded npm ERR! A complete log of this run can be found in

在npm下載包的時候出現一個錯誤: npm ERR! Maximum call stack size exceeded npm ERR! A complete log of this run can be found in: npm ERR! C:\Users\安寧的小獅獅\

The Auto-extending innodb_system data file './ibdata1' is of a different size 6400 pages (rounded do

Mark一個解決的問題: ibadat1 是 共享表空間檔案,在使用資料庫時,會自動增大 包含: Data dictionary   Double write buffer  Insert buffer Rollback segments UNDO space

RuntimeError: The size of tensor a (96) must match the size of tensor b (95) at non-singleton dimens

復現PSMNet時候出現的bug。 出錯在output = model(imgL,imgR) 一開始以為左右兩圖維度不一致。換成output = model(imgR,imgR)之後依舊報錯 翻牆查閱pytorch資料 If two tensors x, y are

bad geometry: block count 65536 exceeds size of device (53248 blocks)

lvm 邏輯卷 卷組 在虛擬機裏做LVM的縮減的時候,未卸載邏輯卷的情況下進行了縮減導致報錯:bad geometry: block count 65536 exceeds size of device (53248 blocks)猜測大概的意思就是我分區表裏是64M(PE為4M),而實際邏輯卷已

consider increasing the maximum size of the cache

inf unable text trap red ima 安裝 web-inf ffi 下午打了一個小盹,等醒來的時候,啟動Tomcat,Tomcat報了滿屏的警告。。。 [2017-06-20 07:53:20,948] Artifact cms:war explode

Tomcat上傳文件報錯:returned a response status of 403 Forbidden

XML ons value web resp 文件 配置 pan param 出現這樣的錯誤是沒有權限對服務器進行寫操作。需要在這個項目所在的tomcat中配置可寫操作即可: 在tomcat的web.xml添加下面代碼: <init-param>

Deprecated: Methods with the same name as their class will not be constructors in a future version of PHP

統一 apple col convert rstp stp line methods gree <?php class Car { var $color = "add"; function Car($color="green") {

Angular4 - Can't bind to 'ngModel' since it isn't a known property of 'input'.

mod 導入 -1 9.png smo bind 解決辦法 文件 blog 用[(ngModel)]="xxx"雙向綁定,如:控制臺報錯:Can‘t bind to ‘ngModel‘ since it isn‘t a known property of ‘input‘.