1. 程式人生 > >記一次Alluxio HA master啟動失敗

記一次Alluxio HA master啟動失敗

大數據 Hadoop

1. 今天遇到一個情況,就是alluxio不能正常訪問,經過日誌查看,發現下面錯誤。

2018-05-14 03:35:58,680 ERROR logger.type (HdfsUnderFileSystem.java:open) - 4 try to open hdfs://sandy-bridge/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 : Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]}
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[10.16.13.189:1019, 10.16.13.84:1019, 10.16.13.128:1019]; storageIDs=[DS-30126b4d-afdf-449a-8de1-e479c1abf33d, DS-ed2e905e-fa43-4f51-801f-3305da180d2a, DS-0e1946c8-dccb-4143-8d74-c11d8d429d02]; storageTypes=[DISK, DISK, DISK]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:400)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:305)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:242)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:235)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1487)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:302)
at org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:298)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:298)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at alluxio.underfs.hdfs.HdfsUnderFileSystem.open(HdfsUnderFileSystem.java:387)
at alluxio.underfs.BaseUnderFileSystem.open(BaseUnderFileSystem.java:124)
at alluxio.master.journal.JournalReader.getNextInputStream(JournalReader.java:114)
at alluxio.master.journal.JournalTailer.processNextJournalLogFiles(JournalTailer.java:118)
at alluxio.master.AbstractMaster.start(AbstractMaster.java:140)
at alluxio.master.file.FileSystemMaster.start(FileSystemMaster.java:419)
at alluxio.master.DefaultAlluxioMaster.startMasters(DefaultAlluxioMaster.java:263)
at alluxio.master.FaultTolerantAlluxioMaster.start(FaultTolerantAlluxioMaster.java:91)
at alluxio.ServerUtils.run(ServerUtils.java:38)

2. 首先是懷疑文件log.00000000000000000001損壞,經過hfs fsck的檢查,並沒有發現corruption,但是Total size: 0,這是個問題

[hdfs@hdfs-namenode hdfs]$ hdfs fsck /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001
Connecting to namenode via http://hdfs-namenode.eu-central-1.compute.internal:50070/fsck?ugi=hdfs&path=%2Fuser%2Falluxio%2Fjournal%2FFileSystemMaster%2Fcompleted%2Flog.00000000000000000001
FSCK started by hdfs (auth:KERBEROS_SSL) from /10.16.13.73 for path /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 at Mon May 14 03:53:11 UTC 2018
Status: HEALTHY
Total size: 0 B (Total open files size: 254 B)
Total dirs: 0
Total files: 0
Total symlinks: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 1)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 41
Number of racks: 1
FSCK ended at Mon May 14 03:53:11 UTC 2018 in 1 milliseconds
The filesystem under path '/user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001' is HEALTHY

3. 將這個問題件mv走,再啟動alluxio HA master,啟動成功。

[hdfs@hdfs-namenode hdfs]$ hdfs dfs -mv /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
[hdfs@hdfs-namenode hdfs]$ hdfs dfs -ls /user/alluxio/journal/FileSystemMaster/completed/
Found 2 items
-rw-r--r-- 3 alluxio alluxio 254 2018-01-29 09:32 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
-rw-r--r-- 3 alluxio alluxio 397 2018-05-14 03:03 /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002

4. 其中嘗試過,將文件再mv回來,但是alluxio依然啟動失敗,還是最開始的錯誤。


5. 直接cat這個文件,發現也不能訪問。

hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000001.bak
cat: Cannot obtain block length for LocatedBlock{BP-1941630157-10.16.13.73-1486732586674:blk_1322900685_252817168; getBlockSize()=254; corrupt=false; offset=0; locs=[DatanodeInfoWithStorage[10.16.13.189:1019,DS-30126b4d-afdf-449a-8de1-e479c1abf33d,DISK], DatanodeInfoWithStorage[10.16.13.128:1019,DS-0e1946c8-dccb-4143-8d74-c11d8d429d02,DISK], DatanodeInfoWithStorage[10.16.13.84:1019,DS-ed2e905e-fa43-4f51-801f-3305da180d2a,DISK]]}

6. 而正常的文件,輸出如下:

[hdfs@hdfs-namenode hdfs]$ hdfs dfs -cat /user/alluxio/journal/FileSystemMaster/completed/log.00000000000000000002
NOT_PERSISTED(0,@ HPXhdatadownloadz_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadzdatadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"
NOT_PERSISTED(0,@ HPXhdatadownloadz datadownload Z 6Perrier_%3F%3F_20180101_20180104_20180510130731077.zip"

7. Alluxio master是啟動成功了,但是丟了一部分數據。

這個問題,有時間,還要繼續研究一下,看是否能將數據找回。

記一次Alluxio HA master啟動失敗