1. 程式人生 > >系統技術非業餘研究 » erlang coredump問題

系統技術非業餘研究 » erlang coredump問題

早上成立濤同學問道:

: :)我們最近發生了幾次宕機。。節點無緣無故就沒有了。也沒有crash dump,也不知道任何線索。

我們知道erlang的VM在正常運作的時候,如果發現erlang程式的異常或者虛擬機器資源不夠如記憶體不夠的時候,會產生erl_crash.dump檔案,裡面把crash的原因和上下文描述的非常清楚,定位問題起來就很容易。但是vm本身是c實現的,如果vm的實現有bug或者系統用到了自己寫的nif,這個情況下就很容易把vm搞掛了。 vm都掛了,就不再可能還有機會產生erl_crash.dump.
所以這時候應該產生的是作業系統的core,碰巧如果系統的coredump沒開,那麼節點就會看起來無緣無故的消失了。

我摘取我們的個案給大家看下:我們在erlang系統裡面用到了nif, 這個nif不是多執行緒安全的,所以在運作的時候產生問題了,搞垮了beam:

*** glibc detected *** …/ump_proxy/erts-5.9.2/bin/beam.smp: double free or corruption (fasttop): 0x00002aaad8006780 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3998a7245f]
/lib64/libc.so.6(cfree+0x4b)[0x3998a728bb]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(lru_remove_and_destroy+0x19)[0x2aaab846d849]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(cherly_remove+0x75)[0x2aaab846b0c5]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so(cherly_put+0x193)[0x2aaab846b3e3]
/home/admin/rds2tae/clusters/ump_proxy/lib/cherly-0.12.8/priv/cherly.so[0x2aaab846b73d]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp(process_main+0x6774)[0x53b104]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp[0x4a62e3]
/home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp[0x5b4fc9]
/lib64/libpthread.so.0[0x399920673d]
/lib64/libc.so.6(clone+0x6d)[0x3998ad44bd]
======= Memory map: ========
00400000-00603000 r-xp 00000000 68:09 179078506 /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp
00803000-00857000 rw-p 00203000 68:09 179078506 /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/beam.smp
00857000-0086e000 rw-p 00857000 00:00 0
1d843000-1dc54000 rw-p 1d843000 00:00 0 [heap]
406fb000-406fc000 —p 406fb000 00:00 0

399e403000-399e404000 rw-p 00003000 68:02 192528 /lib64/libgthread-2.0.so.0.1200.3
2aaaaaaac000-2aaaabff1000 rw-p 2aaaaaaac000 00:00 0
2aaaac1e5000-2aaaac2e6000 rw-p 2aaaac1e5000 00:00 0
2aaaac3e6000-2aaaac4e7000 rw-p 2aaaac3e6000 00:00 0
2aaaac6[os_mon] cpu supervisor port (cpu_sup): Erlang has closed
[os_mon] memory supervisor port (memsup): Erlang has closed
heart: Sun Jun 23 07:41:32 2013: Erlang has closed.
heart: Sun Jun 23 07:41:34 2013: Executed “/home/admin/rds2tae/clusters/ump_proxy/bin/ump_proxy start”. Terminating.

=====
===== LOGGING STARTED Sun Jun 23 07:41:34 CST 2013
=====
Exec: /home/admin/rds2tae/clusters/ump_proxy/erts-5.9.2/bin/erlexec -boot /home/admin/rds2tae/clusters/ump_proxy/releases/2.3.6/ump_proxy -mode embedded -config /home/admin/rds2tae/clusters/ump_proxy/etc/sys.config -args_file /home/admin/rds2tae/clusters/ump_proxy/etc/vm.args — console
Root: /home/admin/rds2tae/clusters/ump_proxy
heart_beat_kill_pid = 17047
Erlang R15B02 (erts-5.9.2) [64-bit] [smp:16:16] [async-threads:5] [hipe] [kernel-poll:true]

從日誌可以看到我們的vm crash了,原因也有,心跳程式在接著的幾秒內把系統重新拉起來了。
因為系統服務沒受到影響,從監控系統看到vm crash了一次,但是系統沒有足夠的線索。
我們可以把os的coredump開啟,可以觀察到這些現象。

首先我們來驗證下開不開coredump的效果:

$ cat x.c
int main(int argc, char* argv[])
{
  *(char*)0x000  =0;
  return 0;
}
$ gcc -g x.c
$ ./a.out 
Segmentation fault
$ ulimit  -c 999999999
$ ls -al core.*
-rw------- 1 chuba users 184320 Jun 27 11:45 core.23021
$ gdb ./a.out core.23021 
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/chuba/a.out...done.
[New Thread 23021]
Missing separate debuginfo for 
Try: yum --disablerepo='*' --enablerepo='*-debuginfo' install /usr/lib/debug/.build-id/a6/816913e0668c79e9ac0c257a1d28cdffe82e4a
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000400484 in main (argc=1, argv=0x7ffff9931d48) at x.c:3
3         *(char*)0x000  =0;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64
(gdb) 

可以看到開啟ulimit -c後,我們獲取到了 coredump檔案,叫做core.23021, 通過gdb我們獲取到了系統crash的原因。

而erlang也可以通過強制產生coredump來驗證系統是不是正常運作的,我來演示下:

$ erl
Erlang R15B03 (erts-5.9.3.1)  [64-bit] [smp:16:16] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.9.3.1 (abort with ^G)
2> os:getpid().
"2294"
3> erlang:halt(abort).
Aborted (core dumped)
$ ls -al core.*
-rw------- 1 chuba users 290553856 Jun 27 11:22 core.2294

通過erlang:halt(abort)來強制產生vm的失效,來模擬線上的故障,可以讓我們有機會來設計系統來捕獲這些異常。

當然erlang還提供了除錯這些失效的方法,這就是強大的cerl, 有各種強大的gdb command協助使用者調查問題,我給大家演示下:

$ bin/cerl -break main
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-50.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/x86_64-unknown-linux-gnu/beam.smp...done.
%---------------------------------------------------------------------------
% Use etp-help for a command overview and general help.
%
% To use the Erlang support module, the environment variable ROOTDIR
% must be set to the toplevel installation directory of Erlang/OTP,
% so the etp-commands file becomes:
%     $ROOTDIR/erts/etc/unix/etp-commands
% Also, erl and erlc must be in the path.
%---------------------------------------------------------------------------
etp-set-max-depth 20
etp-set-max-string-length 100
--------------- System Information ---------------
OTP release: R16B01
ERTS version: 5.10.2
Compile date: Sat Jun 15 13:49:06 2013
Arch: x86_64-unknown-linux-gnu
Endianess: Little
Word size: 64-bit
Halfword: no
HiPE support: yes
SMP support: yes
Thread support: yes
Kernel poll: Supported
Debug compiled: no
Lock checking: no
Lock counting: no
System not initialized
--------------------------------------------------
(gdb) r
Starting program: /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/x86_64-unknown-linux-gnu/beam.smp -- -root /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git -progname /home/chuba/.kerl/builds/r16b01_rc1/otp_src_git/bin/cerl -- -home /home/chuba --
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff4cff700 (LWP 10751)]
...
[New Thread 0x7fffe8cf0700 (LWP 10779)]
[New Thread 0x7fffe82ef700 (LWP 10780)]
Erlang R16B01 (erts-5.10.2)  [64-bit] [smp:16:16] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.2  (abort with ^G)
1> erlang:halt(abort).

Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffee6f9700 (LWP 10770)]
0x000000322aa32885 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.47.el6.x86_64 ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) bt
#0  0x000000322aa32885 in raise () from /lib64/libc.so.6
#1  0x000000322aa34065 in abort () from /lib64/libc.so.6
#2  0x0000000000450f9b in erl_exit_vv (n=-2147483647, flush_async=<value optimized out>, fmt=0x5d05d4 "", 
    args1=0x7fffee6f8c00, args2=0x7fffee6f8be0) at beam/erl_init.c:1788
#3  0x0000000000451197 in erl_exit (n=10745, fmt=0x6 <Address 0x6 out of bounds>) at beam/erl_init.c:1798
#4  0x000000000047fdca in halt_1 (A__p=0x7ffff4f40390, BIF__ARGS=0x7ffff6218480) at beam/bif.c:3909
#5  0x00000000005391b7 in process_main () at beam/beam_emu.c:3364
#6  0x00000000004a55e3 in sched_thread_func (vesdp=0x7ffff4201cc0) at beam/erl_process.c:5738
#7  0x00000000005b67a6 in thr_wrapper (vtwd=0x7fffffffdd70) at pthread/ethread.c:106
#8  0x000000322ae077f1 in start_thread () from /lib64/libpthread.so.0
#9  0x000000322aae570d in clone () from /lib64/libc.so.6

小結: 心跳和日誌系統是必須的,有助提高系統的穩定性。

祝玩得開心。

Post Footer automatically generated by wp-posturl plugin for wordpress.