Hadoop merge large part files with ‘getmerge’ command

I recently faced an issue while trying to merge large number of part files into a single file. I had used the ‘cat’ command for the merge. My command looks like below,


   hadoop fs -cat ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml

Assume that INPUT_PATH is the location of part files and OUTPUT_FOLDER is the output location of the merged file. Note that the part file contains the XML data in it and those are very huge files.

When I ran the above command, I got an error in the middle of the merge process and threw an error with “cat unable to write output stream” message.

I have decided to use getmerge command to get rid of the above error. It works fine without any issues. Check the below command.


   hadoop fs -getmerge ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml