Home > Articles > Programming > Java

  • Print
  • + Share This
This chapter is from the book

HDFS Architecture

HDFS has a master and slaves architecture in which the master is called the name node and slaves are called data nodes (see Figure 3.1). An HDFS cluster consists of a single name node that manages the file system namespace (or metadata) and controls access to the files by the client applications, and multiple data nodes (in hundreds or thousands) where each data node manages file storage and storage device attached to it.

FIGURE 3.1

FIGURE 3.1 How a client reads and writes to and from HDFS.

While storing a file, HDFS internally splits it into one or more blocks (chunks of 64MB, by default, which is configurable and can be changed at cluster level or when each file is created). These blocks are stored in a set of slaves, called data nodes, to ensure that parallel writes or reads can be done even on a single file. Multiple copies of each block are stored per replication factor (which is configurable and can be changed at the cluster level, or at file creation, or even at a later stage for a stored file) for making the platform fault tolerant.

The name node is also responsible for managing file system namespace operations, including opening, closing, and renaming files and directories. The name node records any changes to the file system namespace or its properties. The name node contains information related to the replication factor of a file, along with the map of the blocks of each individual file to data nodes where those blocks exist. Data nodes are responsible for serving read and write requests from the HDFS clients and perform operations such as block creation, deletion, and replication when the name node tells them to. Data nodes store and retrieve blocks when they are told to (by the client applications or by the name node), and they report back to the name node periodically with lists of blocks that they are storing, to keep the name node up to date on the current status.

A client application talks to the name node to get metadata information about the file system. It connects data nodes directly so that they can transfer data back and forth between the client and the data nodes.

The name node and data node are pieces of software called daemons in the Hadoop world. A typical deployment has a dedicated high-end machine that runs only the name node daemon; the other machines in the cluster run one instance of the data node daemon apiece on commodity hardware. Next are some reasons you should run a name node on a high-end machine:

  • The name node is a single point of failure. Make sure it has enough processing power and storage capabilities to handle loads. You need a scaled-up machine for a name node.
  • The name node keeps metadata related to the file system namespace in memory, for quicker response time. Hence, more memory is needed.
  • The name node coordinates with hundreds or thousands of data nodes and serves the requests coming from client applications.

As discussed earlier, HDFS is based on a traditional hierarchical file organization. A user or application can create directories or subdirectories and store files inside. This means that you can create a file, delete a file, rename a file, or move a file from one directory to another.

All this information, along with information related to data nodes and blocks stored in each of the data nodes, is recorded in the file system namespace, called fsimage and stored as a file on the local host OS file system at the name node daemon. This fsimage file is not updated with every addition or removal of a block in the file system. Instead, the name node logs and maintains these add/remove operations in a separate edit log file, which exists as another file on the local host OS file system. Appending updates to a separate edit log achieves faster I/O.

A secondary name node is another daemon. Contrary to its name, the secondary name node is not a standby name node, so it is not meant as a backup in case of name node failure. The primary purpose of the secondary name node is to periodically download the name node fsimage and edit the log file from the name node, create a new fsimage by merging the older fsimage and edit the log file, and upload the new fsimage back to the name node. By periodically merging the namespace fsimage with the edit log, the secondary name node prevents the edit log from becoming too large.

The process of generating a new fsimage from a merge operation is called the Checkpoint process (see Figure 3.2). Usually the secondary name node runs on a separate physical machine than the name node; it also requires plenty of CPU and as much as memory as the name node to perform the Checkpoint operation.

FIGURE 3.2

FIGURE 3.2 Checkpoint process.

As you can see in Table 3.1, the core-site.xml configuration file for Hadoop 1.0 contains some configuration settings related to the Checkpoint process. You can change these configuration settings to change the Hadoop behavior. See Table 3.1 for a Checkpoint-related configuration example in Hadoop 1.0.

TABLE 3.1 Checkpoint-Related Configuration in Hadoop 1.0

Setting

Location

Description

fs.checkpoint.dir

c:\hadoop\HDFS\2nn

Determines where on the local file system the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories, the image is replicated in all the directories for redundancy.

fs.checkpoint.edits.dir

c:\hadoop\HDFS\2nn

Determines where on the local file system the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directories, the edits are replicated in all the directories for redundancy. The default value is the same as fs.checkpoint.dir.

fs.checkpoint.period

86400

The number of seconds between two periodic Checkpoints.

fs.checkpoint.size

2048000000

The size of the current edit log (in bytes) that triggers a periodic Checkpoint even if the fs.checkpoint.period hasn’t expired.

Table 3.2 shows some configuration settings related to the Checkpoint process that are available in the core-site.xml configuration file for Hadoop 2.0.

TABLE 3.2 Checkpoint-Related Configuration in Hadoop 2.0

Setting

Description

dfs.namenode.checkpoint.dir

Determines where on the local file system the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories, the image is replicated in all the directories for redundancy.

dfs.namenode.checkpoint.edits.dir

Determines where on the local file system the DFS secondary name node should store the temporary edits to merge. If this is a comma-delimited list of directories, the edits are replicated in all the directories for redundancy. The default value is the same as dfs.namenode.checkpoint.dir.

dfs.namenode.checkpoint.period

The number of seconds between two periodic checkpoints.

dfs.namenode.checkpoint.txns

The secondary name node or checkpoint node will create a checkpoint of the namespace every dfs.namenode.checkpoint.txns transactions, regardless of whether dfs.namenode.checkpoint.period has expired.

dfs.namenode.checkpoint.check.period

The secondary name node and checkpoint node will poll the name node every dfs.namenode.checkpoint.check.period seconds to query the number of uncheckpointed transactions.

dfs.namenode.checkpoint.max-retries

The secondary name node retries failed checkpointing. If the failure occurs while loading fsimage or replaying edits, the number of retries is limited by this variable.

dfs.namenode.num.checkpoints.retained

The number of image checkpoint files that will be retained by the name node and secondary name node in their storage directories. All edit logs necessary to recover an up-to-date namespace from the oldest retained checkpoint will also be retained.

With the secondary name node performing this task periodically, the name node can restart relatively faster. Otherwise, the name node would need to do this merge operation when it restarted.

The secondary name node is also responsible for backing up the name node fsimage (a copy of the merged fsimage), which is used if the primary name node fails. However, the state of the secondary name node lags that of the primary, so if the primary name node fails, data loss might occur.

File Split in HDFS

As discussed earlier, HDFS works best with small numbers of very large files for storing large data sets that the applications need. As you can see in Figure 3.3, while storing files, HDFS internally splits a file content into one or more data blocks (chunks of 64MB, by default, which is configurable and can be changed when needed at the cluster instance level for all the file writes or when each specific file is created). These data blocks are stored on a set of slaves called data nodes, to ensure a parallel data read or write.

FIGURE 3.3

FIGURE 3.3 File split process when writing to HDFS.

All blocks of a file are the same size except the last block, which can be either the same size or smaller. HDFS stores each file as a sequence of blocks, with each block stored as a separate file in the local file system (such as NTFS).

Cluster-wide block size is controlled by the dfs.blocksize configuration property in the hdfs-site.xml file. The dfs.blocksize configuration property applies for files that are created without a block size specification. This configuration has a default value of 64MB and usually varies from 64MB to 128MB, although many installations now use 128MB. In Hadoop 2.0, the default block is 128MB (see Table 3.3). The block size can continue to grow as transfer speeds grow with new generations of disk drives.

TABLE 3.3 Block Size Configuration

Name

Value

Description

dfs.blocksize

134217728

The default block size for new files, in bytes. You can use the following suffix (case insensitive): k (kilo), m (mega), g (giga), t (tera), p (peta), e (exa) to specify the size (such as 128k, 512m, 1g, and so on). Or, provide the complete size in bytes, such as 134217728 for 128MB.

Block Placement and Replication in HDFS

You have already seen that each file is broken in multiple data blocks. Now you can explore how these data blocks get stored. By default, each block of a file is stored three times on three different data nodes: The replication factor configuration property has a default value of 3 (see Table 3.4).

TABLE 3.4 Block Replication Configuration

Name

Value

Description

dfs.replication

3

Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.

dfs.replication.max

512

Maximum block replication.

dfs.namenode.replication.min

1

Minimal block replication.

When a file is created, an application can specify the number of replicas of each block of the file that HDFS must maintain. Multiple copies or replicas of each block makes it fault tolerant: If one copy is not accessible or gets corrupted, the data can be read from the other copy. The number of copies of each block is called the replication factor for a file, and it applies to all blocks of a file.

While writing a file, or even for an already stored file, an application can override the default replication factor configuration and specify another replication factor for that file. For example, the replication factor can be specified at file creation time and can be even changed later, when needed.

The name node has the responsibility of ensuring that the number of copies or replicas of each block is maintained according to the applicable replication factor for each file. If necessary, it instructs the appropriate data nodes to maintain the defined replication factor for each block of a file.

Each data node in the cluster periodically sends a heartbeat signal and a block-report to the name node. When the name node receives the heartbeat signal, it implies that the data node is active and functioning properly. A block-report from a data node contains a list of all blocks on that specific data node.

A typical Hadoop installation spans hundreds or thousands of nodes. A collection of data nodes is placed in rack together for a physical organization, so you effectively have a few dozen racks. For example, imagine that you have 100 nodes in a cluster, and each rack can hold 5 nodes. You then have 20 racks to accommodate all the 100 nodes, each containing 5 nodes.

The simplest block placement solution is to place each copy or replica of a block in a separate rack. Although this ensures that data is not lost even in case of multiple rack failures and delivers an enhanced read operation by utilizing bandwidth from all the racks, it incurs a huge performance penalty when writing data to HDFS because a write operation must transfer blocks to multiple racks. Remember also that communication between data nodes across racks is much more expensive than communication across nodes in a single rack.

The other solution is to put together all the replicas in the different data nodes of a single rack. This scenario improves the write performance, but rack failure would result in total data loss.

To take care of this situation, HDFS has a balanced default block placement policy. Its objective is to have a properly load-balanced, fast-access, fault-tolerance file system:

  • The first replica is written to the data node creating the file, to improve the write performance because of the write affinity.
  • The second replica is written to another data node within the same rack, to minimize the cross-rack network traffic.
  • The third replica is written to a data node in a different rack, ensuring that even if a switch or rack fails, the data is not lost. (This applies only if you have configured your cluster for rack awareness as discussed in the section “Rack Awareness” later in this hour.

You can see in Figure 3.4 that this default block placement policy cuts the cross-rack write traffic. It generally improves write performance without compromising on data reliability or availability, while still maintaining read performance.

FIGURE 3.4

FIGURE 3.4 Data block placement on data nodes.

The replication factor is an important configuration to consider. The default replication factor of 3 provides an optimum solution of both write and read performance while also ensuring reliability and availability. However, sometimes you need to change the replication factor configuration property or replication factor setting for a file. For example, you need to change the replication factor configuration to 1 if you have a single-node cluster.

For other cases, consider an example. Suppose you have some large files whose loss would be acceptable (for example, a file contains data older than 5 years, and you often do analysis over the last 5 years of data). Also suppose that you can re-create these files in case of data loss). You can set its replication factor to 1 to minimize the need for storage requirement and, of course, to minimize the time taken to write it.

You can even set the replication factor to 2, which requires double the storage space but ensures availability in case a data node fails (although it might not be helpful in case of a rack failure). You can change the replication factor to 4 or higher, which will eventually improve the performance of the read operation at the cost of a more expensive write operation, and with more storage space requirement to store another copies.

Writing to HDFS

As discussed earlier, when a client or application wants to write a file to HDFS, it reaches out to the name node with details of the file. The name node responds with details based on the actual size of the file, block, and replication configuration. These details from the name node contain the number of blocks of the file, the replication factor, and data nodes where each block will be stored (see Figure 3.5).

FIGURE 3.5

FIGURE 3.5 The client talks to the name node for metadata to specify where to place the data blocks.

Based on information received from the name node, the client or application splits the files into multiple blocks and starts sending them to the first data node. Normally, the first replica is written to the data node creating the file, to improve the write performance because of the write affinity.

As you see in Figure 3.6, Block A is transferred to data node 1 along with details of the two other data nodes where this block needs to be stored. When it receives Block A from the client (assuming a replication factor of 3), data node 1 copies the same block to the second data node (in this case, data node 2 of the same rack). This involves a block transfer via the rack switch because both of these data nodes are in the same rack. When it receives Block A from data node 1, data node 2 copies the same block to the third data node (in this case, data node 3 of the another rack). This involves a block transfer via an out-of-rack switch along with a rack switch because both of these data nodes are in separate racks.

FIGURE 3.6

FIGURE 3.6 The client sends data blocks to identified data nodes.

When all the instructed data nodes receive a block, each one sends a write confirmation to the name node (see Figure 3.7).

FIGURE 3.7

FIGURE 3.7 Data nodes update the name node about receipt of the data blocks.

Finally, the first data node in the flow sends the confirmation of the Block A write to the client (after all the data nodes send confirmation to the name node) (see Figure 3.8).

FIGURE 3.8

FIGURE 3.8 The first data node sends an acknowledgment back to the client.

For example, Figure 3.9 shows how data block write state should look after transferring Blocks A, B, and C, based on file system namespace metadata from the name node to the different data nodes of the cluster. This continues for all other blocks of the file.

FIGURE 3.9

FIGURE 3.9 All data blocks are placed in a similar way.

HDFS uses several optimization techniques. One is to use client-side caching, by the HDFS client, to improve the performance of the block write operation and to minimize network congestion. The HDFS client transparently caches the file into a temporary local file. When it accumulates data as big as a defined block size, the client reaches out to the name node.

At this time, the name node responds by inserting the filename into the file system hierarchy and allocating data nodes for its storage. The client flushes the block of data from the local temporary file to the closest data node and that data node creates copies of the block to other data nodes to maintain replication factor (as instructed by the name node based on the replication factor of the file).

When all the blocks of a file are transferred to the respective data nodes, the client tells the name node that the file is closed. The name node then commits the file creation operation to a persistent store.

Reading from HDFS

To read a file from the HDFS, the client or application reaches out to the name node with the name of the file and its location. The name node responds with the number of blocks of the file, data nodes where each block has been stored (see Figure 3.10).

FIGURE 3.10

FIGURE 3.10 The client talks to the name node to get metadata about the file it wants to read.

Now the client or application reaches out to the data nodes directly (without involving the name node for actual data transfer—data blocks don’t pass through the name node) to read the blocks of the files in parallel, based on information received from the name node. When the client or application receives all the blocks of the file, it combines these blocks into the form of the original file (see Figure 3.11).

FIGURE 3.11

FIGURE 3.11 The client starts reading data blocks of the file from the identified data nodes.

To improve the read performance, HDFS tries to reduce bandwidth consumption by satisfying a read request from a replica that is closest to the reader. It looks for a block in the same node, then another node in the same rack, and then finally another data node in another rack. If the HDFS cluster spans multiple data centers, a replica that resides in the local data center (the closest one) is preferred over any remote replica from remote data center.

Handling Failures

On cluster startup, the name node enters into a special state called safe mode. During this time, the name node receives a heartbeat signal (implying that the data node is active and functioning properly) and a block-report from each data node (containing a list of all blocks on that specific data node) in the cluster. Figure 3.12 shows how all the data nodes of the cluster send a periodic heartbeat signal and block-report to the name node.

FIGURE 3.12

FIGURE 3.12 All data nodes periodically send heartbeat signals to the name node.

Based on the replication factor setting, each block has a specified minimum number of replicas to be maintained. A block is considered safely replicated when the number of replicas (based on replication factor) of that block has checked in with the name node. If the name node identifies blocks with less than the minimal number of replicas to be maintained, it prepares a list.

After this process, plus an additional few seconds, the name node exits safe mode state. Now the name node replicates these blocks (which have fewer than the specified number of replicas) to other data nodes.

Now let’s examine how the name node handles a data node failure. In Figure 3.13, you can see four data nodes (two data nodes in each rack) in the cluster. These data nodes periodically send heartbeat signals (implying that a particular data node is active and functioning properly) and a block-report (containing a list of all blocks on that specific data node) to the name node.

FIGURE 3.13

FIGURE 3.13 The name node updates its metadata based on information it receives from the data nodes.

The name node thus is aware of all the active or functioning data nodes of the cluster and what block each one of them contains. You can see that the file system namespace contains the information about all the blocks from each data node (see Figure 3.13).

Now imagine that data node 4 has stopped working. In this case, data node 4 stops sending heartbeat signals to the name node. The name node concludes that data node 4 has died. After a certain period of time, the name nodes concludes that data node 4 is not in the cluster anymore and that whatever data node 4 contained should be replicated or load-balanced to the available data nodes.

As you can see in Figure 3.14, the dead data node 4 contained blocks B and C, so name node instructs other data nodes, in the cluster that contain blocks B and C, to replicate it in manner; it is load-balanced and the replication factor is maintained for that specific block. The name node then updates its file system namespace with the latest information regarding blocks and where they exist now.

FIGURE 3.14

FIGURE 3.14 Handling a data node failure transparently.

Delete Files from HDFS to Decrease the Replication Factor

By default, when you delete a file or a set of files from HDFS, the file(s) get deleted permanently and there is no way to recover it. But don’t worry: HDFS has a feature called Trash that you can enable to recover your accidently deleted file(s). As you can see in Table 3.5, this feature is controlled by two configuration properties: fs.trash.interval and fs.trash.checkpoint.interval in the core-site.xml configuration file.

TABLE 3.5 Trash-Related Configuration

Name

Description

fs.trash.interval

The number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. This option can be configured on both the server and the client. If trash is disabled on the server side, the client-side configuration is checked. If trash is enabled on the server side, the value configured on the server is used and the client configuration value is ignored.

fs.trash.checkpoint.interval

The number of minutes between trash checkpoints. Should be smaller than or equal to fs.trash.interval. If zero, the value is set to the value of fs.trash.interval. Each time the checkpoint process runs, it creates a new checkpoint out of the current and removes checkpoints created more than fs.trash.interval minutes ago.

By default, the value for fs.trash.interval is 0, which signifies that trashing is disabled. To enable it, you can set it to any numeric value greater than 0, represented in minutes. This instructs HDFS to move your deleted files to the Trash folder for that many minutes before it can permanently delete them from the system. In other words, it indicates the time interval a deleted file will be made available to the Trash folder so that the system can recover it from there, either until it crosses the fs.trash.interval or until the next trash checkpoint occurs.

By default, the value for fs.trash.checkpoint.interval is also 0. You can set it to any numeric value, but it must be smaller than or equal to the value specified for fs.trash.interval. It indicates how often the trash checkpoint operation should run. During trash checkpoint operation, it checks for all the files older than the specified fs.trash.interval and deletes them. For example, if you have set fs.trash.interval to 120 and fs.trash.checkpoint.interval to 60, the trash checkpoint operation kicks in every 60 minutes to see if any files are older than 120 minutes. If so, it deletes that files permanently from the Trash folder.

When you decrease the replication factor for a file already stored in HDFS, the name node determines, on the next heartbeat signal to it, the excess replica of the blocks of that file to be removed. It transfers this information back to the appropriate data nodes, to remove corresponding blocks and free up occupied storage space.

  • + Share This
  • 🔖 Save To Your Account