1. 程式人生 > >Hadoop運維問題記錄

Hadoop運維問題記錄

昨天同事遇到一個hadoop故障,找了半天沒看出問題,問到我這裡,花了一會解決了一下,估計這是我給暴風的叢集解決的最後的故障了,以後就不定給誰解決問題去了。

只截下來了Namenode的報錯Log,Datanode的刷屏刷過去了,不過都差不多。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 2013-09-03 18:11:44,021 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.blockReceived: blk_8094241928859719036_2147969 is received from dead or unregistered node 192.168.1.99:50010
2013-09-03 18:11:44,022 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hdfs cause:java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_2147969 2013-09-03 18:11:44,022 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9000, call blockReceived(DatanodeRegistration(192.168.1.99:50010, storageID=DS-1925877777-192.168.1.99-50010-1372745739682, infoPort=50075, ipcPort=50020), [Lorg.apache.hadoop.hdfs.protocol.Block;@4ec371c, [Ljava.lang.String;@301611ca) from 192.168.1.99:18853: error: java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_2147969
java.io.IOException: Got blockReceived message from unregistered or dead node blk_8094241928859719036_2147969 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.blockReceived(FSNamesystem.java:4188) at org.apache.hadoop.hdfs.server.namenode.NameNode.blockReceived(NameNode.java:1069) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

看上去是個IPC的錯誤,從下往上看,都是報許可權錯誤,然後無法註冊Datanode,還有從未註冊或死亡的Datanode上報了一個塊已被接收的錯誤。同事就暈了,已經死亡的node怎麼還上報啊。

然後重啟datanode時間不長,就又掛掉了。

登入到datanode,先看了一下dfs的資料資料夾的許可權,正確無誤。然後看了一下df -h,發現/var資料夾滿了,OPS很缺,只給分了20G的/var。結果Hadoop的log寫不進去了,自然就掛了。刪掉/var/log/hadoop/hdfs裡面的歷史日誌,datanode啟動正常。以後的解決辦法只有兩個,要麼設定定時指令碼每天刪歷史日誌,要麼就把/var/log/hadoop/hdfs資料夾軟鏈到一個比較大的硬碟上。