Hadoop merge large part files with ‘getmerge’ command

I recently faced an issue while trying to merge large number of part files into a single file. I had used the ‘cat’ command for the merge. My command looks like below,


   hadoop fs -cat ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml

Assume that INPUT_PATH is the location of part files and OUTPUT_FOLDER is the output location of the merged file. Note that the part file contains the XML data in it and those are very huge files.

When I ran the above command, I got an error in the middle of the merge process and threw an error with “cat unable to write output stream” message.

I have decided to use getmerge command to get rid of the above error. It works fine without any issues. Check the below command.


   hadoop fs -getmerge ${INPUT_PATH}/part* > ${OUTPUT_FOLDER}/output.xml

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s