According to Hadoop command manual, fsck -move “Move corrupted files to /lost+found”. It sound safe at the first look, but actually moving is not the only thing this command is doing!
No big deal, the action plan simply looks like:
- Run a fsck -move to move the corrupted files out of the processing paths to not have jobs failing
- Recover the blocks from the crashed datanodes, potentially manually if the datanodes do not come back to life
- Move the files from /lost+found back to their original locations.
$ hadoop fs -ls -R /lost+found
drwx------ /lost+found/path/to/my/corrupted/file
-rw-r--r-- /lost+found/path/to/my/corrupted/file/0
-rw-r--r-- /lost+found/path/to/my/corrupted/file/1
As a side effect, the corrupted blocks are no longer used in the remaining files. Even if the blocks are recovered anytime later, they will be deleted anyway because no more belonging to any file. fsck -move is a non-reversible process – or at least painful to recover from.
I learnt fsck -move hidden behavior the hard way. If I needed to rewrite the action plan in case of corrupted files, it would be:
- Get the list of corrupted files (hadoop fsck / | grep -v "^\.$")
- Manually move the files listed at point 1. away from the processing path (hadoop fs -mv /path/to/my/corrupted/file /quarantine/path/to/my/corrupted/file)
- Try to recover the missing blocks.
- If able to recover the missing blocks, re-move the files in the initial directory
- If not, run hadoop fsck -move to reduce the amount of lost data down to a few blocks, and do some additional post-processing to have the portion of files in a usable format, including re-move the file’s splits in the initial directory.
No comments:
Post a Comment