Hadoop DistCp can be used to copy data between Hadoop clusters (and also within a Hadoop cluster). DistCp uses MapReduce to implement its distribution, error handling and reporting. It expands a list of files and directories into map tasks, each of which copies a partition of the files specified in the source list.
hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination
hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs:// nn2:8020/destination
hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination
Note: srclist contains:
hdfs://nn1:8020/source/a
hdfs://nn1:8020/source/b
hadoop distcp hftp://<hdp 1.3.x namenode host>:50070/<folder path of source> hdfs://<hdp 2.x namenode host>/<folder path of target>
hadoop distcp hftp://namenodehdp130.test.com:50070/apps/hive/warehouse/db/ hdfs://namenodehdp20.test.com/data/raw/
Common use of DistCp is an inter-cluster copy:
Specify multiple source directories:
Specify multiple source directories from a file with the -f option:
Note: srclist contains:
hdfs://nn1:8020/source/a
hdfs://nn1:8020/source/b
DistCp from HDP-1.3.x to HDP-2.x:
hadoop distcp hftp://<hdp 1.3.x namenode host>:50070/<folder path of source> hdfs://<hdp 2.x namenode host>/<folder path of target>
DistCp copy from HDP 1.3.0 to HDP-2.0:
hadoop distcp hftp://namenodehdp130.test.com:50070/apps/hive/warehouse/db/ hdfs://namenodehdp20.test.com/data/raw/
Update & Overwrite:
- The DistCp -update option is used to copy files from a source that do not exist at the target, or that have different contents.
- The DistCp -overwrite option overwrites target files even if they exist at the source, or if they have the same contents.
Thanks for posting such a great article.you done a great job
ReplyDeletecore Java online training