Monday, January 22, 2018

Snapshots


HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.

The implementation of HDFS Snapshots is efficient:
  • Snapshot creation is instantaneous: the cost is O(1) excluding the inode lookup time.
  • Additional memory is used only when modifications are made relative to a snapshot: memory usage is O(M), where M is the number of modified files/directories.
  • Blocks in datanodes are not copied: the snapshot files record the block list and the file size. There is no data copying.
  • Snapshots do not adversely affect regular HDFS operations: modifications are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the modifications from the current data.

Snapshottable Directories

Snapshots can be taken on any directory once the directory has been set as snapshottable. A snapshottable directory is able to accommodate 65,536 simultaneous snapshots. There is no limit on the number of snapshottable directories. Administrators may set any directory to be snapshottable. If there are snapshots in a snapshottable directory, the directory can be neither deleted nor renamed before all the snapshots are deleted.

Nested snapshottable directories are currently not allowed. In other words, a directory cannot be set to snapshottable if one of its ancestors/descendants is a snapshottable directory.

Snapshot Paths

For a snapshottable directory, the path component “.snapshot” is used for accessing its snapshots. Suppose /foo is a snapshottable directory, /foo/bar is a file/directory in /foo, and /foo has a snapshot s0. Then, the path /foo/.snapshot/s0/bar refers to the snapshot copy of /foo/bar. The usual API and CLI can work with the “.snapshot” paths. The following are some examples.

Allow Snapshots

Allowing snapshots of a directory to be created. If the operation completes successfully, the directory becomes snapshottable.

Command:
hdfs dfsadmin -allowSnapshot <path>

Arguments:
path: The path of the snapshottable directory.

Disallow Snapshots

Disallowing snapshots of a directory to be created. All snapshots of the directory must be deleted before disallowing snapshots.

Command:
hdfs dfsadmin -disallowSnapshot <path>

Arguments:
path: The path of the snapshottable directory.

Create Snapshots

Create a snapshot of a snapshottable directory. This operation requires owner privilege of the snapshottable directory.

Command:
hdfs dfs -createSnapshot <path> [<snapshotName>]

Arguments:
path: The path of the snapshottable directory.
snapshotName: The snapshot name, which is an optional argument. When it is omitted, a default name is generated using a timestamp with the format "'s'yyyyMMdd-HHmmss.SSS", e.g. "s20130412-151029.033".

Delete Snapshots

Delete a snapshot of from a snapshottable directory. This operation requires owner privilege of the snapshottable directory.

Command:
hdfs dfs -deleteSnapshot <path> <snapshotName>

Arguments:
path: The path of the snapshottable directory.
snapshotName: The snapshot name.


Rename Snapshots

Rename a snapshot. This operation requires owner privilege of the snapshottable directory.

Command:
hdfs dfs -renameSnapshot <path> <oldName> <newName>

Arguments:
path: The path of the snapshottable directory.
oldName: The old snapshot name.
newName: The new snapshot name.

Get Snapshottable Directory Listing

Get all the snapshottable directories where the current user has permission to take snapshtos.

Command:
hdfs lsSnapshottableDir

Get Snapshots Difference Report

Get the differences between two snapshots. This operation requires read access privilege for all files/directories in both snapshots.

Command:
hdfs snapshotDiff <path> <fromSnapshot> <toSnapshot>

No comments:

Post a Comment

Kafka Architecture

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you t...