1. 程式人生 > >系統技術非業餘研究 » Erlang節點重啟導致的incarnation問題

系統技術非業餘研究 » Erlang節點重啟導致的incarnation問題

今天晚上mingchaoyan同學在線上問以下這個問題:

152489 =ERROR REPORT==== 2013-06-28 19:57:53 ===
152490 Discarding message {send,<<19 bytes>>} from <0.86.1> to <0.6743.0> in an old incarnation (1 ) of this node (2)
152491
152492
152493 =ERROR REPORT==== 2013-06-28 19:57:55 ===
152494 Discarding message {send,<<22 bytes>>} from <0.1623.1> to <0.6743.0> in an old incarnation (1) of this node (2

我們中午伺服器更新後,日誌上滿屏的這些錯誤,請問您有遇到過類似的錯誤嗎?或者提過些定位問題,解決問題的思路,謝謝

這個問題有點意思,從日誌提示來再結合原始碼來看,馬上我們就可以找到打出這個提示的地方:

/*bif.c*/
Sint
do_send(Process *p, Eterm to, Eterm msg, int suspend) {
    Eterm portid;
...
} else if (is_external_pid(to)) {
        dep = external_pid_dist_entry(to);
        if(dep == erts_this_dist_entry) {
            erts_dsprintf_buf_t *dsbufp = erts_create_logger_dsbuf();
            erts_dsprintf(dsbufp,
                          "Discarding message %T from %T to %T in an old "
                          "incarnation (%d) of this node (%d)\n",
                          msg,
                          p->id,
                          to,
                          external_pid_creation(to),
                          erts_this_node->creation);
            erts_send_error_to_logger(p->group_leader, dsbufp);
            return 0;
        }
..
}

觸發這句警告提示必須滿足以下條件:
1. 目標Pid必須是external_pid。
2. 該pid歸宿的外部節點所對應的dist_entry和當前節點的dist_entry相同。

通過google引擎,我找到了和這個描述很相近的問題:參見 這裡 ,該作者很好的描述和重現了這個現象,但是他沒有解釋出具體的原因。

好,那我們順著他的路子來重新下這個問題.
但演示之前,我們先鞏固下基礎,首先需要明白pid的格式:
可以參見這篇文章:

pid的核心內容摘抄如下:

Printed process ids < A.B.C > are composed of [6]:
A, the node number (0 is the local node, an arbitrary number for a remote node)
B, the first 15 bits of the process number (an index into the process table) [7]
C, bits 16-18 of the process number (the same process number as B) [7]

再參見Erlang External Term Format 文件的章節9.10
描述了PID_EXT的組成:

1 N 4 4 1
103 Node ID Serial Creation
Table 9.16:
Encode a process identifier object (obtained from spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest should be 0.


我們可以看到一個欄位 Creation, 這個東西我們之前怎麼沒見過呢?
參考erlang的文件 我們可以知道:

creation
Returns the creation of the local node as an integer. The creation is changed when a node is restarted. The creation of a node is stored in process identifiers, port identifiers, and references. This makes it (to some extent) possible to distinguish between identifiers from different incarnations of a node. Currently valid creations are integers in the range 1..3, but this may (probably will) change in the future. If the node is not alive, 0 is returned.

追蹤這個creation的來源,我們知道這個變數來自epmd. 具體點的描述就是每次節點都會像epmd註冊名字,epmd會給節點返回這個creation. net_kernel會把這個creation通過set_node這個bif登記到該節點的erts_this_dist_entry->creation中去:

/* erl_node_tables.c */
void
erts_set_this_node(Eterm sysname, Uint creation)
{
...
    erts_this_dist_entry->sysname = sysname;
    erts_this_dist_entry->creation = creation;
...
}

/*epmd_srv.c  */
...
        /* When reusing we change the "creation" number 1..3 */

        node->creation = node->creation % 3 + 1;
...

從上面的程式碼可以看出creation取值是1-3,每次登記的時候+1. 未聯網的節點creation為0.

知道了createion的來龍去脈後,我們再看下DistEntry的資料結構,這個資料結構基本上代表了聯網的節點和外面世界的互動。

typedef struct dist_entry_ {

Eterm sysname; /* [email protected] atom for efficiency */
Uint32 creation; /* creation of connected node */
Eterm cid; /* connection handler (pid or port), NIL == free

} DistEntry;
其中最重要的資訊有上面3個,其中cid代表port(節點之間的TCP通道).

我們知道外部pid是通過binary_to_term來構造的, 程式碼位於external.c:dec_pid函式。

static byte*
dec_pid(ErtsDistExternal *edep, Eterm** hpp, byte* ep, ErlOffHeap* off_heap, Eterm* objp)
{
 ...
    /*                                                                                                                    
     * We are careful to create the node entry only after all                                                             
     * validity tests are done.                                                                                           
     */
    node = dec_get_node(sysname, cre);

    if(node == erts_this_node) {
        *objp = make_internal_pid(data);
    } else {
        ExternalThing *etp = (ExternalThing *) *hpp;
        *hpp += EXTERNAL_THING_HEAD_SIZE + 1;

        etp->header = make_external_pid_header(1);
        etp->next = off_heap->first;
        etp->node = node;
        etp->data.ui[0] = data;

        off_heap->first = (struct erl_off_heap_header*) etp;
        *objp = make_external_pid(etp);
    }
...
}
static ERTS_INLINE ErlNode* dec_get_node(Eterm sysname, Uint creation)
{
    switch (creation) {
    case INTERNAL_CREATION:
        return erts_this_node;
    case ORIG_CREATION:
        if (sysname == erts_this_node->sysname) {
            creation = erts_this_node->creation;
        }
    }
    return erts_find_or_insert_node(sysname,creation);
}

如果creation等0的話,肯定是本地節點,否則根據sysname和creation來找到一個匹配的節點。
繼續上程式碼:

typedef struct erl_node_ {
  HashBucket hash_bucket;       /* Hash bucket */
  erts_refc_t refc;             /* Reference count */
  Eterm sysname;                /* [email protected] atom for efficiency */
  Uint32 creation;              /* Creation */
  DistEntry *dist_entry;        /* Corresponding dist entry */
} ErlNode;

/* erl_node_tables.c */
ErlNode *erts_find_or_insert_node(Eterm sysname, Uint creation)
{    
    ErlNode *res;
    ErlNode ne;
    ne.sysname = sysname;
    ne.creation = creation;

    erts_smp_rwmtx_rlock(&erts_node_table_rwmtx);
    res = hash_get(&erts_node_table, (void *) &ne);
    if (res && res != erts_this_node) {
        erts_aint_t refc = erts_refc_inctest(&res->refc, 0);
        if (refc < 2) /* New or pending delete */
            erts_refc_inc(&res->refc, 1);
    }
    erts_smp_rwmtx_runlock(&erts_node_table_rwmtx);
    if (res)
        return res;

    erts_smp_rwmtx_rwlock(&erts_node_table_rwmtx);
    res = hash_put(&erts_node_table, (void *) &ne);
    ASSERT(res);
    if (res != erts_this_node) {
        erts_aint_t refc = erts_refc_inctest(&res->refc, 0);
        if (refc < 2) /* New or pending delete */
            erts_refc_inc(&res->refc, 1);
    }
    erts_smp_rwmtx_rwunlock(&erts_node_table_rwmtx);
    return res;  
}

static int
node_table_cmp(void *venp1, void *venp2)
{
    return ((((ErlNode *) venp1)->sysname == ((ErlNode *) venp2)->sysname
             && ((ErlNode *) venp1)->creation == ((ErlNode *) venp2)->creation)
            ? 0
            : 1);
}

static void*
node_table_alloc(void *venp_tmpl)
{
    ErlNode *enp;

    if(((ErlNode *) venp_tmpl) == erts_this_node)
        return venp_tmpl;

    enp = (ErlNode *) erts_alloc(ERTS_ALC_T_NODE_ENTRY, sizeof(ErlNode));

    node_entries++;

    erts_refc_init(&enp->refc, -1);
    enp->creation = ((ErlNode *) venp_tmpl)->creation;
    enp->sysname = ((ErlNode *) venp_tmpl)->sysname;
    enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname);

    return (void *) enp;
}

這個erts_find_or_insert_node會根據sysname和creation的組合來查詢節點,如果找不到的話,會新建一個節點放入ErlNode型別的erts_node_table表中。而ErlNode有3個關鍵資訊 1. sysname 2. creation 3. dist_entry。 新建一個節點的時候,dist_entry填什麼呢?

核心程式碼是這行:
enp->dist_entry = erts_find_or_insert_dist_entry(((ErlNode *) venp_tmpl)->sysname);
這個dist_entry是根據sysname查詢到的,而不是依據sysname和creation的組合。

這時候問題就來了, 我們仔細看下 dec_pid的程式碼:

node = dec_get_node(sysname, cre);
if(node == erts_this_node) {
*objp = make_internal_pid(data);
} else {

etp->node = node;

*objp = make_external_pid(etp);
}

由於creation不同,所以相同的sysname, 無法找到目前的節點。在新建的節點裡面,它的dist_entry卻是當前節點對應的dist_entry.
創建出來的外部pid物件包含新建的node。

所以send的時候出警告的三句程式碼:

} else if (is_external_pid(to)) {
dep = external_pid_dist_entry(to);
if(dep == erts_this_dist_entry) {

external_pid_dist_entry巨集會從外部pid中取出node,再從node中取出dist_entry. 這個dist_entry很不幸的和erts_this_dist_entry相同,於是就有了上面的悲劇。

分析了半天總算有眉目了,喝口水先!
現在有了這些背景知識我們就可以演示了:

$ erl -sname a
Erlang R15B03 (erts-5.9.3.1)  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
([email protected])1> term_to_binary(self()).
<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,
  0,0,37,0,0,0,0,2>>
([email protected])2> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>).
<0.37.0>
([email protected])3> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>).
<0.37.0>
([email protected])4> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>).
<0.37.0>
([email protected])5> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,1>>)==self(). 
false
([email protected])6> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,2>>)==self().
true
([email protected])7> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)==self().
false
([email protected])8> binary_to_term(<<131,103,100,0,11,97,64,114,100,115,48,54,52,48,55,54,0,0,0,37,0,0,0,0,3>>)!ok.     
ok
([email protected])9> erlang:system_info(creation).
2
=ERROR REPORT==== 28-Jun-2013::23:10:58 ===
Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (3) of this node (2)

上面的演示我們可以看出,creation確實是每次+1迴圈,同時雖然pid打出來的是一樣的,但是實際上由於creation的存在,看起來一樣的還是不同的pid.
到這裡,我們大概明白了前應後果。但是並沒有回到上面同學的疑問。
他的叢集,只是重新啟動了個節點,然後收到一螢幕的警告。
注意是一屏!!!

我重新設計了一個案例,在深度剖析這個問題:
在這之前,我們需要以下程式:

$ cat test.erl
-module(test).
-export([start/0]).

start()->
  register(test, self()),
  loop(undefined).

loop(State)->
   loop( 
  receive
   {set, Msg} -> Msg;
   {get, From} -> From!State
   end
   ).

這段程式碼的目的是:
test:start程序啟動起來後,會在目標節點上把自己登記為test名字,同時可以接受2中訊息get和set。set會保持使用者設定的資訊,而get會取回訊息。

我們的測試案例是這樣的:
啟動a,b節點,然後在b節點上通過spawn在a節點上啟動test:start這個程序負責儲存我們的資訊。這個資訊就是b程序的shell的程序pid.
然後模擬b節點掛掉重新啟動,通過a節點上的test程序取回上次保持的程序pid, 這個pid和新啟動的shell pid是相同的,但是他們應該是不完全相同的,因為creation不一樣。
好了,交代清楚了,我們就來秀下:

$ erl -name [email protected]
Erlang R15B03 (erts-5.9.3.1)  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
([email protected])1> 

好,A節點準備好了,接下來啟動B節點儲存shell的程序pid到節點a去。

$ erl -name [email protected]
Erlang R15B03 (erts-5.9.3.1)  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1  (abort with ^G)
([email protected])1> R=spawn('[email protected]', test, start,[]).
<6002.42.0>
([email protected])2> self().
<0.37.0>
([email protected])3> R!{set, self()}.   
{set,<0.37.0>}
([email protected])4> R!{get, self()}.
{get,<0.37.0>}
([email protected])5> flush().
Shell got <0.37.0>
ok
([email protected])6> 
BREAK: (a)bort (c)ontinue (p)roc info (i)nfo (l)oaded
       (v)ersion (k)ill (D)b-tables (d)istribution
^C

這時候把節點b退出,模擬b掛掉,再重新啟動b,取回之前儲存的pid,和現有的shell pid對比,發現不是完全一樣。

$ erl -name [email protected]
Erlang R15B03 (erts-5.9.3.1) [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1 (abort with ^G)
([email protected])1> {test, ‘[email protected]’}!{get, self()}.
{get,<0.37.0>}
([email protected])2> flush().
Shell got <0.37.0>
ok
([email protected])3> {test, ‘[email protected]’}!{get, self()}, receive X->X end.
<0.37.0>
([email protected])4> T=v(-1).
<0.37.0>
([email protected])5> T==self().
false
([email protected])6> T!ok.
ok
([email protected])7>
=ERROR REPORT==== 28-Jun-2013::23:24:00 ===
Discarding message ok from <0.37.0> to <0.37.0> in an old incarnation (2) of this node (3)
[/erlang]
我們發訊息給取回的上次保持的pid, 就觸發了警告。

這個場景在分散式環境裡面非常普遍,參與協作的程序會保持在其他節點的系統裡面,當其中的一些程序掛掉重新啟動的時候,試圖取回這些程序id的時候,卻發現這些id已經失效了。

到這裡為止,應該能夠很好的回答了上面同學的問題了。

這個問題的解決方案是什麼呢?
我們的系統應該去monitor_node其他相關節點並且去捕獲nodedown訊息,當節點失效的時候,適時移除掉和該節點相關的程序。 因為這些程序本質上已經失去功效了。

小結:看起來再無辜的警告,也是會隱藏著重大的問題。

祝玩得開心。

Post Footer automatically generated by wp-posturl plugin for wordpress.