Hadoop merge large part files with ‘getmerge’ command

I recently faced an issue while trying to merge large number of part files into a single file. I had used the ‘cat’ command for the merge. My command looks like below,

   hadoop fs -cat ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml

Assume that INPUT_PATH is the location of part files and OUTPUT_FOLDER is the output location of the merged file. Note that the part file contains the XML data in it and those are very huge files.

When I ran the above command, I got an error in the middle of the merge process and threw an error with “cat unable to write output stream” message.

I have decided to use getmerge command to get rid of the above error. It works fine without any issues. Check the below command.

   hadoop fs -getmerge ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s