Home > Articles > Programming > Java

This chapter is from the book

What’s New in HDFS 2.0

As you learned in Hour 2,“Introduction to Hadoop, Its Architecture, Ecosystem, and Microsoft Offerings,” HDFS in Hadoop 1.0 had some limitations and lacked support for providing a highly available distributed storage system. Consider the following limitations of Hadoop 1.0, related to HDFS; Hadoop 2.0 and HDFS 2.0 have resolved them.

  • Single point of failure—Although you can have a secondary name node in Hadoop 1.0, it’s not a standby node, so it does not provide failover capabilities. The name node still is a single point of failure.
  • Horizontal scaling performance issue—As the number of data nodes grows beyond 4,000 the performance of the name node degrades. This sets a kind of upper limit to the number of nodes in a cluster.

In Hadoop 2.0, HDFS has undergone an overhaul. Three important features, as discussed next, have overcome the limitations of Hadoop 1.0. The new version is referred to as HDFS 2.0 in Hadoop 2.0.

HDFS High Availability

In the Hadoop 1.0 cluster, the name node was a single point of failure. Name node failure gravely impacted the complete cluster availability. Taking down the name node for maintenance or upgrades meant that the entire cluster was unavailable during that time. The HDFS High Availability (HA) feature introduced with Hadoop 2.0 addresses this problem.

Now you can have two name nodes in a cluster in an active-passive configuration: One node is active at a time, and the other node is in standby mode. The active and standby name nodes remain synchronized. If the active name node fails, the standby name node takes over and promotes itself to the active state.

In other words, the active name node is responsible for serving all client requests, whereas the standby name node simply acts as a passive name node—it maintains enough state to provide a fast failover to act as the active name node if the current active name node fails.

This allows a fast failover to the standby name node if an active name node crashes, or a graceful failing over to the standby name node by the Hadoop administrator for any planned maintenance.

HDFS 2.0 uses these different methods to implement high availability for name nodes. The next sections discuss them.

Shared Storage Using NFS

In this implementation method, the file system namespace and edit log are maintained on a shared storage device (for example, a Network File System [NFS] mount from a NAS [Network Attached Storage]). Both the active name node and the passive or standby name node have access to this shared storage, but only the active name node can write to it; the standby name node can only read from it, to synchronize its own copy of file system namespace (see Figure 3.20).

FIGURE 3.20

FIGURE 3.20 HDFS high availability with shared storage.

When the active name node performs any changes to the file system namespace, it persists the changes to the edit log available on the shared storage device; the standby name node constantly applies changes logged by the active name node in the edit log from the shared storage device to its own copy of the file system namespace. When a failover happens, the standby name node ensures that it has fully synchronized its file system namespace from the changes logged in the edit log before it can promote itself to the role of active name node.

The possibility of a “split-brain” scenario exists, in which both the name nodes take control of writing to the edit log on the same shared storage device at the same time. This results in data loss or corruption, or some other unexpected result. To avoid this scenario, while configuring high availability with shared storage, you can configure a fencing method for a shared storage device. Only one name node is then able to write to the edit log on the shared storage device at one time. During failover, the shared storage devices gives write access to the new active name node (earlier, the standby name node) and revokes the write access from the old active name node (now the standby name node), allowing it to only read the edit log to synchronize its copy of the file system namespace.

Quorum-based Storage Using the Quorum Journal Manager

This is one of the preferred methods. In this high-availability implementation, which leverages the Quorum Journal Manager (QJM), the active name node writes edit log modifications to a group of separate daemons. These daemons, called Journal Machines or nodes, are accessible to both the active name node (for writes) and the standby name node (for reads). In this high-availability implementation, Journal nodes act as shared edit log storage.

When the active name node performs any changes to the file system namespace, it persists the change log to the majority of the Journal nodes. The active name node commits a transaction only if it has successfully written it to a quorum of the Journal nodes. The standby or passive name node keeps its state in synch with the active name node by consuming the changes logged by the active name node and applying those changes to its own copy of the file system namespace.

When a failover happens, the standby name node ensures that it has fully synchronized its file system namespace with the changes from all the Journal nodes before it can promote itself to the role of active name node.

To avoid this scenario, Journal nodes allow only one name node to be a writer, to ensure that only one name node can write to the edit log at one time. During failover, the Journal node gives write access to the new active name node (earlier, the standby name node) and revokes the write access from the old active name node (now the standby name node). This new standby name node now can only read the edit log to synchronize its copy of file system namespace. Figure 3.21 shows HDFS high availability with Quorum-based storage works.

FIGURE 3.21

FIGURE 3.21 HDFS high availability with Quorum-based storage.

Failover Detection and Automatic Failover

Automatic failover is controlled by the dfs.ha.automatic-failover.enabled configuration property in the hdfs-site.xml configuration file. Its default value is false, but you can set it to true to enable automatic failover. When setting up automatic failover to the standby name node if the active name node fails, you must add these two components to your cluster:

  • ZooKeeper quorum—Set up ZooKeeper quorum using the ha.zookeeper.quorum configuration property in the core-site.xml configuration file and, accordingly, three or five ZooKeeper services. ZooKeeper has light resource requirements, so it can be co-located on the active and standby name node. The recommended course is to configure the ZooKeeper nodes to store their data on separate disk drives from the HDFS metadata (file system namespace and edit log), for isolation and better performance.
  • ZooKeeper Failover Controller (ZKFC)—ZKFC is a ZooKeeper client that monitors and manages the state of the name nodes for both the active and standby name node machines. ZKFC uses the ZooKeeper Service for coordination in determining a failure (unavailability of the active name node) and the need to fail over to the standby name node (see Figure 3.22). Because ZKFC runs on both the active and standby name nodes, a split-brain scenario can arise in which both nodes try to achieve active state at the same time. To prevent this, ZKFC tries to obtain an exclusive lock on the ZooKeeper service. The service that successfully obtains a lock is responsible for failing over to its respective name node and promoting it to active state.

    FIGURE 3.22

    FIGURE 3.22 HDFS high availability and automatic failover.

While implementing automatic failover, you leverage ZooKeeper for fault detection and to elect a new active name node in case of earlier name node failure. Each name node in the cluster maintains a persistent session in ZooKeeper and holds the special “lock” called Znode (an active name node holds Znode). If the active name node crashes, the ZooKeeper session then expires and the lock is deleted, notifying the other standby name node that a failover should be triggered.

ZKFC periodically pings its local name node to see if the local name node is healthy. Then it checks whether any other node currently holds the special “lock” called Znode. If not, ZKFC tries to acquire the lock. When it succeeds, it has “won the election” and is responsible for running a failover to make its local name node active.

GO TO arrow.jpg Refer to Hour 6, “Getting Started with HDInsight, Provisioning Your HDInsight Service Cluster, and Automating HDInsight Cluster Provisioning,” and Hour 7, “Exploring Typical Components of HDFS Cluster”, for more details about high availability in HDInsight.

Client Redirection on Failover

As you can see in Figure 3.23, you must specify the Java class name that the HDFS client will use to determine which name node is active currently, to serve requests and connect to the active name node.

FIGURE 3.23

FIGURE 3.23 Client redirection on failover.

The session timeout, or the amount of time required to detect a failure and trigger a failover, is controlled by the ha.zookeeper.session-timeout.ms configuration property. Its default value is 5000 milliseconds (5 seconds). A smaller value for this property ensures quick detection of the unavailability of the active name node, but with a potential risk of triggering failover too often in case of a transient error or network blip.

HDFS Federation

Horizontal scaling of HDFS is rare but needed in very large implementations such as Yahoo! or Facebook. This was a challenge with single name nodes in Hadoop 1.0. The single name node had to take care of entire namespace and block storage on all the data nodes of the cluster, which doesn’t require coordination. In Hadoop 1.0, the number of files that can be managed in HDFS was limited to the amount of RAM on a machine. The throughput of read/write operations is also limited by the power of that specific machine.

HDFS has two layers (see Figure 3.24—for Hadoop 1.0, look at the top section):

  • Namespace (managed by a name node)—The file system namespace consists of a hierarchy of files, directories, and blocks. It supports namespace-related operations such as creating, deleting, modifying, listing, opening, closing, and renaming files and directories.
  • Block storage—Block storage has two parts. The first is Block Management, managed by a name node. The second is called Storage and is managed by data nodes of the cluster.
  • Block Management (managed by a name node)—As the data nodes send periodic block-reports, the Block Management part manages the data node cluster membership by handling registrations, processing these block-reports received from data nodes, maintaining the locations of the blocks, and ensuring that blocks are properly replicated (deleting overly replicated blocks or creating more replicas for under-replicated blocks). This part also serves the requests for block-related operations such as create, delete, modify, and get block location.
  • Storage (managed by data nodes)—Blocks can be stored on the local file system (each block as separate file), with read/write access to each.
FIGURE 3.24

FIGURE 3.24 HDFS Federation and how it works.

In terms of layers, both Hadoop 1.0 and Hadoop 2.0 remain the same, with two layers: Namespace and Block Storage. However, the works these layers encompass have changed dramatically.

In Figure 3.24, you can see that, in Hadoop 2.0 (bottom part), HDFS Federation is a way of partitioning the file system namespace over multiple separated name nodes. Each name node manages only an independent slice of the overall file system namespace. Although these name nodes are part of a single cluster, they don’t talk to each other; they are federated and do not require any coordination with each other. A small cluster can use HDFS Federation for file system namespace isolation (for multitenants) or for use in large cluster, for horizontal scalability.

HDFS Federation horizontally scales the name node or name service using multiple federated name nodes (or namespaces). Each data node from the cluster registers itself with all the name nodes in the cluster, sends periodic heartbeat signals and block-reports, stores blocks managed by any of these name nodes, and handles commands from these name nodes. The collection of all the data nodes is used as common storage for blocks by all the name nodes. HDFS Federation also adds client-side namespaces to provide a unified view of the file system.

In Hadoop 2.0, the block management layer is divided into multiple block pools (see Figure 3.24). Each one belongs to a single namespace. The combination of each block pool with its associated namespace is called the namespace volume and is a self-contained unit of management. Whenever any name node or namespace is deleted, its related block pool is deleted, too.

Using HDFS Federation offers these advantages:

  • Horizontal scaling for name nodes or namespaces.
  • Performance improvement by breaking the limitation of a single name node. More name nodes mean improved throughput, with more read/write operations.
  • Isolation of namespaces for a multitenant environment.
  • Separation of services (namespace and Block Management), for flexibility in design.
  • Block pool as a generic storage service (namespace is one application to use this service.)

Horizontal scalability and isolation is fine from the cluster level implementation perspective, but what about client accessing these many namespaces? Wouldn’t it be difficult for the client to access these many namespaces at first?

HDFS Federation also adds a client-side mount table to provide an application-centric unified view of the multiple file system namespaces transparent to the client. Clients may use View File System (ViewFs), analogous to client-side mount tables in some Unix/Linux systems, to create personalized namespace views or per cluster common or global namespace views.

HDFS Snapshot

Hadoop 2.0 added a new capability to take a snapshot (read-only copy, and copy-on-write) of the file system (data blocks) stored on the data nodes. The best part of this new feature is that it has been efficiently implemented as a name node-only feature with low memory overhead, instantaneous snapshot creation, and no adverse impact on regular HDFS operations. It does not require additional storage space because no extra copies of data are required (or created) when creating a snapshot; the snapshot just records the block list and the file size without doing any data copying upon creation.

Additional memory is needed only when changes are done for the files (that belong to a snapshot), relative to a snapshot, to capture the changes in the reverse chronological order so that the current data can be accessed directly.

For example, imagine that you have two directories, d1 (it has files f1 and f2) and d2 (it has files f3 and f4), in the root folder at time T1. At time T2, a snapshot (S1) is created for the d2 directory, which includes f3 and f4 files (see Figure 3.25).

FIGURE 3.25

FIGURE 3.25 How HDFS Snapshot works.

In Figure 3.26, at time T3, file f3 was deleted and file f5 was added. At this time, space is needed to store the previous version or deleted file. At time T4, you have to look at the files under the d2 directory; you will find only f4 and f5 (current data). On the other hand, if you look at the snapshot S1, you will still find files f3 and f4 under the d2 directory; it implies that a snapshot is calculated by subtracting all the changes made after snapshot creation from the current data (for example, Snapshot data = Current Data – any modifications done after snapshot creation).

FIGURE 3.26

FIGURE 3.26 How changes are applied in HDFS Snapshot.

Consider these important characteristics of the HDFS snapshot feature:

  • You can take a snapshot of the complete HDFS file system or of any directory or subdirectory in any subtree of the HDFS file system. To take a snapshot, you first must check that you can take a snapshot for a directory or make a directory snapshottable. Note that if any snapshot is available for the directory, the directory can be neither deleted nor renamed before all the snapshots are deleted.
  • Snapshot creation is instantaneous and doesn’t require interference with other regular HDFS operations. Any changes to the file are recorded in reverse chronological order so that the current data can be accessed directly. The snapshot data is computed by subtracting the changes (which occurred since the snapshot) from the current data from the file system.
  • Although there is no limit to the number of snapshottable directories you can have in the HDFS file system, for a given directory, you can keep up to 65,536 simultaneous snapshots.
  • Subsequent snapshots for a directory consume storage space for only delta data (changes between data already in snapshot and the current state of the file system) after the first snapshot you take.
  • As of this writing, you cannot enable snapshot (snapshottable) on nested directories. As an example, if any ancestors or descendants of a specific directory have already enabled for snapshot, you cannot enable snapshot on that specific directory.
  • The .snapshot directory contains all the snapshots for a given directory (if enabled for snapshot) and is actually a hidden directory itself. Hence, you need to explicitly refer to it when referring to snapshots for that specific directory. Also, .snapshot is now a reserved word, and you cannot have a directory or a file with this name.

Consider an example in which you have configured to take a daily snapshot of your data. Suppose that a user accidently deletes a file named sample.txt on Wednesday. This file will no longer be available in the current data (see Figure 3.27). Because this file is important (maybe it got deleted accidentally), you must recover it. To do that, you can refer to the last snapshot TueSS, taken on Tuesday (assuming that the file sample.txt was already available in the file system when this snapshot was taken) to recover this file.

FIGURE 3.27

FIGURE 3.27 An example scenario for HDFS Snapshot.

HDFS Snapshot Commands

You can use either of these commands to enable snapshot for a path (the path of the directory you want to make snapshottable):

hdfs dfsadmin -allowSnapshot <path>
hadoop dfsadmin -allowSnapshot <path>

To disable to take a snapshot for a directory, you can use either of these commands:

hdfs dfsadmin -disallowSnapshot <path>
hadoop dfsadmin -disallowSnapshot <path>

When a directory has been enabled to take a snapshot, you can run either of these commands to do so. You must specify the directory path and name of the snapshot (an optional argument—if it is not specified, a default name is generated using a time stamp with the format 's'yyyyMMdd-HHmmss.SSS) you are creating. You must have owner privilege on the snapshottable directory to execute the command successfully.

hdfs dfs -createSnapshot <path> <snapshotName>
hadoop dfs -createSnapshot <path> <snapshotName>

If there is a snapshot on the directory and you want to delete it, you can use either of these commands. Again, you must have owner privilege on the snapshottable directory to execute the command successfully.

hdfs dfs -deleteSnapshot <path> <snapshotName>
hadoop dfs -deleteSnapshot <path> <snapshotName>

You can even rename the already created snapshot to another name, if needed, with the command discussed next. You need owner privilege on the snapshottable directory to execute the command successfully.

hdfs dfs -renameSnapshot <path> <oldSnapshotName> <newSnapshotName>
hadoop dfs -renameSnapshot <path> <oldSnapshotName> <newSnapshotName>

If you need to compare two snapshots and identify the differences between them, you can use either of these commands. This requires you to have read access for all files and directories in both snapshots.

hdfs snapshotDiff <path> <startingSnapshot> <endingSnapshot>
hadoop snapshotDiff <path> <startingSnapshot> <endingSnapshot>

You can use either of these commands to get a list of all the snapshottable directories where you have permission to take snapshots:

hdfs lsSnapshottableDir
hadoop lsSnapshottableDir

You can use this command to list all the snapshots for the directory specified—for example, for a directory enabled for snapshot, the path component .snapshot is used for accessing its snapshots:

hdfs dfs -ls /<path>/.snapshot

You can use this command to list all the files in the snapshot s5 for the testdir directory:

hdfs dfs -ls /testdir/.snapshot/s5

You can use all the regular commands (or native Java APIs) against snapshot. For example, you can use this command to list all the files available in the bar subdirectory under the foo directory from the s5 snapshot of the testdir directory:

hdfs dfs –ls /testdir/.snapshot/s5/foo/bar

Likewise, you can use this command to copy the file sample.txt, which is available in the bar subdirectory under the foo directory from the s5 snapshot of the testdir directory to the tmp directory:

hdfs dfs -cp /testdir/.snapshot/s5/foo/bar/sample.txt /tmp

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020