What’s New in HDFS 2.0
As you learned in Hour 2,“Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings,” HDFS in Hadoop 1.0 had some limitations and lacked support for providing a highly available distributed storage system. Consider the following limitations of Hadoop 1.0, related to HDFS; Hadoop 2.0 and HDFS 2.0 have resolved them.
- Single point of failure—Although you can have a secondary name node in Hadoop 1.0, it’s not a standby node, so it does not provide failover capabilities. The name node still is a single point of failure.
- Horizontal scaling performance issue—As the number of data nodes grows beyond 4,000 the performance of the name node degrades. This sets a kind of upper limit to the number of nodes in a cluster.
In Hadoop 2.0, HDFS has undergone an overhaul. Three important features, as discussed next, have overcome the limitations of Hadoop 1.0. The new version is referred to as HDFS 2.0 in Hadoop 2.0.
HDFS High Availability
In the Hadoop 1.0 cluster, the name node was a single point of failure. Name node failure gravely impacted the complete cluster availability. Taking down the name node for maintenance or upgrades meant that the entire cluster was unavailable during that time. The HDFS High Availability (HA) feature introduced with Hadoop 2.0 addresses this problem.
Now you can have two name nodes in a cluster in an active-passive configuration: One node is active at a time, and the other node is in standby mode. The active and standby name nodes remain synchronized. If the active name node fails, the standby name node takes over and promotes itself to the active state.
In other words, the active name node is responsible for serving all client requests, whereas the standby name node simply acts as a passive name node—it maintains enough state to provide a fast failover to act as the active name node if the current active name node fails.
This allows a fast failover to the standby name node if an active name node crashes, or a graceful failing over to the standby name node by the Hadoop administrator for any planned maintenance.
HDFS 2.0 uses these different methods to implement high availability for name nodes. The next sections discuss them.
Shared Storage Using NFS
In this implementation method, the file system namespace and edit log are maintained on a shared storage device (for example, a Network File System [NFS] mount from a NAS [Network Attached Storage]). Both the active name node and the passive or standby name node have access to this shared storage, but only the active name node can write to it; the standby name node can only read from it, to synchronize its own copy of file system namespace (see Figure 3.20).
FIGURE 3.20 HDFS high availability with shared storage.
When the active name node performs any changes to the file system namespace, it persists the changes to the edit log available on the shared storage device; the standby name node constantly applies changes logged by the active name node in the edit log from the shared storage device to its own copy of the file system namespace. When a failover happens, the standby name node ensures that it has fully synchronized its file system namespace from the changes logged in the edit log before it can promote itself to the role of active name node.
The possibility of a “split-brain” scenario exists, in which both the name nodes take control of writing to the edit log on the same shared storage device at the same time. This results in data loss or corruption, or some other unexpected result. To avoid this scenario, while configuring high availability with shared storage, you can configure a fencing method for a shared storage device. Only one name node is then able to write to the edit log on the shared storage device at one time. During failover, the shared storage devices gives write access to the new active name node (earlier, the standby name node) and revokes the write access from the old active name node (now the standby name node), allowing it to only read the edit log to synchronize its copy of the file system namespace.
Quorum-based Storage Using the Quorum Journal Manager
This is one of the preferred methods. In this high-availability implementation, which leverages the Quorum Journal Manager (QJM), the active name node writes edit log modifications to a group of separate daemons. These daemons, called Journal Machines or nodes, are accessible to both the active name node (for writes) and the standby name node (for reads). In this high-availability implementation, Journal nodes act as shared edit log storage.
When the active name node performs any changes to the file system namespace, it persists the change log to the majority of the Journal nodes. The active name node commits a transaction only if it has successfully written it to a quorum of the Journal nodes. The standby or passive name node keeps its state in synch with the active name node by consuming the changes logged by the active name node and applying those changes to its own copy of the file system namespace.
When a failover happens, the standby name node ensures that it has fully synchronized its file system namespace with the changes from all the Journal nodes before it can promote itself to the role of active name node.
To avoid this scenario, Journal nodes allow only one name node to be a writer, to ensure that only one name node can write to the edit log at one time. During failover, the Journal node gives write access to the new active name node (earlier, the standby name node) and revokes the write access from the old active name node (now the standby name node). This new standby name node now can only read the edit log to synchronize its copy of file system namespace. Figure 3.21 shows HDFS high availability with Quorum-based storage works.
FIGURE 3.21 HDFS high availability with Quorum-based storage.
Failover Detection and Automatic Failover
Automatic failover is controlled by the dfs.ha.automatic-failover.enabled configuration property in the hdfs-site.xml configuration file. Its default value is false, but you can set it to true to enable automatic failover. When setting up automatic failover to the standby name node if the active name node fails, you must add these two components to your cluster:
- ZooKeeper quorum—Set up ZooKeeper quorum using the ha.zookeeper.quorum configuration property in the core-site.xml configuration file and, accordingly, three or five ZooKeeper services. ZooKeeper has light resource requirements, so it can be co-located on the active and standby name node. The recommended course is to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata (file system namespace and edit log), for isolation and better performance.
ZooKeeper Failover Controller (ZKFC)—ZKFC is a ZooKeeper client that monitors and manages the state of the name nodes for both the active and standby name node machines. ZKFC uses the ZooKeeper Service for coordination in determining a failure (unavailability of the active name node) and the need to fail over to the standby name node (see Figure 3.22). Because ZKFC runs on both the active and standby name nodes, a split-brain scenario can arise in which both nodes try to achieve active state at the same time. To prevent this, ZKFC tries to obtain an exclusive lock on the ZooKeeper service. The service that successfully obtains a lock is responsible for failing over to its respective name node and promoting it to active state.
FIGURE 3.22 HDFS high availability and automatic failover.
While implementing automatic failover, you leverage ZooKeeper for fault detection and to elect a new active name node in case of earlier name node failure. Each name node in the cluster maintains a persistent session in ZooKeeper and holds the special “lock” called Znode (an active name node holds Znode). If the active name node crashes, the ZooKeeper session then expires and the lock is deleted, notifying the other standby name node that a failover should be triggered.
ZKFC periodically pings its local name node to see if the local name node is healthy. Then it checks whether any other node currently holds the special “lock” called Znode. If not, ZKFC tries to acquire the lock. When it succeeds, it has “won the election” and is responsible for running a failover to make its local name node active.
GO TO Refer to Hour 6, “Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning,” and Hour 7, “Exploring Typical Components of HDFS Cluster”, for more details about high availability in HDInsight.
Client Redirection on Failover
As you can see in Figure 3.23, you must specify the Java class name that the HDFS client will use to determine which name node is active currently, to serve requests and connect to the active name node.
FIGURE 3.23 Client redirection on failover.
The session timeout, or the amount of time required to detect a failure and trigger a failover, is controlled by the ha.zookeeper.session-timeout.ms configuration property. Its default value is 5000 milliseconds (5 seconds). A smaller value for this property ensures quick detection of the unavailability of the active name node, but with a potential risk of triggering failover too often in case of a transient error or network blip.
Horizontal scaling of HDFS is rare but needed in very large implementations such as Yahoo! or Facebook. This was a challenge with single name nodes in Hadoop 1.0. The single name node had to take care of entire namespace and block storage on all the data nodes of the cluster, which doesn’t require coordination. In Hadoop 1.0, the number of files that can be managed in HDFS was limited to the amount of RAM on a machine. The throughput of read/write operations is also limited by the power of that specific machine.
HDFS has two layers (see Figure 3.24—for Hadoop 1.0, look at the top section):
- Namespace (managed by a name node)—The file system namespace consists of a hierarchy of files, directories, and blocks. It supports namespace-related operations such as creating, deleting, modifying, listing, opening, closing, and renaming files and directories.
- Block storage—Block storage has two parts. The first is Block Management, managed by a name node. The second is called Storage and is managed by data nodes of the cluster.
- Block Management (managed by a name node)—As the data nodes send periodic block-reports, the Block Management part manages the data node cluster membership by handling registrations, processing these block-reports received from data nodes, maintaining the locations of the blocks, and ensuring that blocks are properly replicated (deleting overly replicated blocks or creating more replicas for under-replicated blocks). This part also serves the requests for block-related operations such as create, delete, modify, and get block location.
- Storage (managed by data nodes)—Blocks can be stored on the local file system (each block as separate file), with read/write access to each.
FIGURE 3.24 HDFS Federation and how it works.
In terms of layers, both Hadoop 1.0 and Hadoop 2.0 remain the same, with two layers: Namespace and Block Storage. However, the works these layers encompass have changed dramatically.
In Figure 3.24, you can see that, in Hadoop 2.0 (bottom part), HDFS Federation is a way of partitioning the file system namespace over multiple separated name nodes. Each name node manages only an independent slice of the overall file system namespace. Although these name nodes are part of a single cluster, they don’t talk to each other; they are federated and do not require any coordination with each other. A small cluster can use HDFS Federation for file system namespace isolation (for multitenants) or for use in large cluster, for horizontal scalability.
HDFS Federation horizontally scales the name node or name service using multiple federated name nodes (or namespaces). Each data node from the cluster registers itself with all the name nodes in the cluster, sends periodic heartbeat signals and block-reports, stores blocks managed by any of these name nodes, and handles commands from these name nodes. The collection of all the data nodes is used as common storage for blocks by all the name nodes. HDFS Federation also adds client-side namespaces to provide a unified view of the file system.
In Hadoop 2.0, the block management layer is divided into multiple block pools (see Figure 3.24). Each one belongs to a single namespace. The combination of each block pool with its associated namespace is called the namespace volume and is a self-contained unit of management. Whenever any name node or namespace is deleted, its related block pool is deleted, too.
Using HDFS Federation offers these advantages:
- Horizontal scaling for name nodes or namespaces.
- Performance improvement by breaking the limitation of a single name node. More name nodes mean improved throughput, with more read/write operations.
- Isolation of namespaces for a multitenant environment.
- Separation of services (namespace and Block Management), for flexibility in design.
- Block pool as a generic storage service (namespace is one application to use this service.)
Horizontal scalability and isolation is fine from the cluster level implementation perspective, but what about client accessing these many namespaces? Wouldn’t it be difficult for the client to access these many namespaces at first?
HDFS Federation also adds a client-side mount table to provide an application-centric unified view of the multiple file system namespaces transparent to the client. Clients may use View File System (ViewFs), analogous to client-side mount tables in some Unix/Linux systems, to create personalized namespace views or per cluster common or global namespace views.
Hadoop 2.0 added a new capability to take a snapshot (read-only copy, and copy-on-write) of the file system (data blocks) stored on the data nodes. The best part of this new feature is that it has been efficiently implemented as a name node-only feature with low memory overhead, instantaneous snapshot creation, and no adverse impact on regular HDFS operations. It does not require additional storage space because no extra copies of data are required (or created) when creating a snapshot; the snapshot just records the block list and the file size without doing any data copying upon creation.
Additional memory is needed only when changes are done for the files (that belong to a snapshot), relative to a snapshot, to capture the changes in the reverse chronological order so that the current data can be accessed directly.
For example, imagine that you have two directories, d1 (it has files f1 and f2) and d2 (it has files f3 and f4), in the root folder at time T1. At time T2, a snapshot (S1) is created for the d2 directory, which includes f3 and f4 files (see Figure 3.25).
FIGURE 3.25 How HDFS Snapshot works.
In Figure 3.26, at time T3, file f3 was deleted and file f5 was added. At this time, space is needed to store the previous version or deleted file. At time T4, you have to look at the files under the d2 directory; you will find only f4 and f5 (current data). On the other hand, if you look at the snapshot S1, you will still find files f3 and f4 under the d2 directory; it implies that a snapshot is calculated by subtracting all the changes made after snapshot creation from the current data (for example, Snapshot data = Current Data – any modifications done after snapshot creation).
FIGURE 3.26 How changes are applied in HDFS Snapshot.
Consider these important characteristics of the HDFS snapshot feature:
- You can take a snapshot of the complete HDFS file system or of any directory or subdirectory in any subtree of the HDFS file system. To take a snapshot, you first must check that you can take a snapshot for a directory or make a directory snapshottable. Note that if any snapshot is available for the directory, the directory can be neither deleted nor renamed before all the snapshots are deleted.
- Snapshot creation is instantaneous and doesn’t require interference with other regular HDFS operations. Any changes to the file are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the changes (which occurred since the snapshot) from the current data from the file system.
- Although there is no limit to the number of snapshottable directories you can have in the HDFS file system, for a given directory, you can keep up to 65,536 simultaneous snapshots.
- Subsequent snapshots for a directory consume storage space for only delta data (changes between data already in snapshot and the current state of the file system) after the first snapshot you take.
- As of this writing, you cannot enable snapshot (snapshottable) on nested directories. As an example, if any ancestors or descendants of a specific directory have already enabled for snapshot, you cannot enable snapshot on that specific directory.
- The .snapshot directory contains all the snapshots for a given directory (if enabled for snapshot) and is actually a hidden directory itself. Hence, you need to explicitly refer to it when referring to snapshots for that specific directory. Also, .snapshot is now a reserved word, and you cannot have a directory or a file with this name.
Consider an example in which you have configured to take a daily snapshot of your data. Suppose that a user accidently deletes a file named sample.txt on Wednesday. This file will no longer be available in the current data (see Figure 3.27). Because this file is important (maybe it got deleted accidentally), you must recover it. To do that, you can refer to the last snapshot TueSS, taken on Tuesday (assuming that the file sample.txt was already available in the file system when this snapshot was taken) to recover this file.
FIGURE 3.27 An example scenario for HDFS Snapshot.
HDFS Snapshot Commands
You can use either of these commands to enable snapshot for a path (the path of the directory you want to make snapshottable):
hdfs dfsadmin -allowSnapshot <path> hadoop dfsadmin -allowSnapshot <path>
To disable to take a snapshot for a directory, you can use either of these commands:
hdfs dfsadmin -disallowSnapshot <path> hadoop dfsadmin -disallowSnapshot <path>
When a directory has been enabled to take a snapshot, you can run either of these commands to do so. You must specify the directory path and name of the snapshot (an optional argument—if it is not specified, a default name is generated using a time stamp with the format 's'yyyyMMdd-HHmmss.SSS) you are creating. You must have owner privilege on the snapshottable directory to execute the command successfully.
hdfs dfs -createSnapshot <path> <snapshotName> hadoop dfs -createSnapshot <path> <snapshotName>
If there is a snapshot on the directory and you want to delete it, you can use either of these commands. Again, you must have owner privilege on the snapshottable directory to execute the command successfully.
hdfs dfs -deleteSnapshot <path> <snapshotName> hadoop dfs -deleteSnapshot <path> <snapshotName>
You can even rename the already created snapshot to another name, if needed, with the command discussed next. You need owner privilege on the snapshottable directory to execute the command successfully.
hdfs dfs -renameSnapshot <path> <oldSnapshotName> <newSnapshotName> hadoop dfs -renameSnapshot <path> <oldSnapshotName> <newSnapshotName>
If you need to compare two snapshots and identify the differences between them, you can use either of these commands. This requires you to have read access for all files and directories in both snapshots.
hdfs snapshotDiff <path> <startingSnapshot> <endingSnapshot> hadoop snapshotDiff <path> <startingSnapshot> <endingSnapshot>
You can use either of these commands to get a list of all the snapshottable directories where you have permission to take snapshots:
hdfs lsSnapshottableDir hadoop lsSnapshottableDir
You can use this command to list all the snapshots for the directory specified—for example, for a directory enabled for snapshot, the path component .snapshot is used for accessing its snapshots:
hdfs dfs -ls /<path>/.snapshot
You can use this command to list all the files in the snapshot s5 for the testdir directory:
hdfs dfs -ls /testdir/.snapshot/s5
You can use all the regular commands (or native Java APIs) against snapshot. For example, you can use this command to list all the files available in the bar subdirectory under the foo directory from the s5 snapshot of the testdir directory:
hdfs dfs –ls /testdir/.snapshot/s5/foo/bar
Likewise, you can use this command to copy the file sample.txt, which is available in the bar subdirectory under the foo directory from the s5 snapshot of the testdir directory to the tmp directory:
hdfs dfs -cp /testdir/.snapshot/s5/foo/bar/sample.txt /tmp