Wednesday, January 31, 2018

DistCp (Distributed Copy)

Hadoop DistCp can be used to copy data between Hadoop clusters (and also within a Hadoop cluster). DistCp uses MapReduce to implement its distribution, error handling and reporting. It expands a list of files and directories into map tasks, each of which copies a partition of the files specified in the source list.

Common use of DistCp is an inter-cluster copy:


hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination 

Specify multiple source directories:


hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs:// nn2:8020/destination

Specify multiple source directories from a file with the -f option:


hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination

Note: srclist contains:
hdfs://nn1:8020/source/a
hdfs://nn1:8020/source/b

DistCp from HDP-1.3.x to HDP-2.x:


hadoop distcp hftp://<hdp 1.3.x namenode host>:50070/<folder path of source> hdfs://<hdp 2.x namenode host>/<folder path of target>

DistCp copy from HDP 1.3.0 to HDP-2.0:


hadoop distcp hftp://namenodehdp130.test.com:50070/apps/hive/warehouse/db/ hdfs://namenodehdp20.test.com/data/raw/

Update & Overwrite:


  • The DistCp -update option is used to copy files from a source that do not exist at the target, or that have different contents. 
  • The DistCp -overwrite option overwrites target files even if they exist at the source, or if they have the same contents.


1 comment:

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...