Talk to most people about Apache™ Hadoop®, and the conversation will quickly turn to using the MapReduce algorithm. MapReduce works quite well as a processing model for many types of problems. In particular, when multiple mapping processes are used to span terabytes of data, the power of a scalable Hadoop cluster becomes evident. In Hadoop version 1, the MapReduce process was one of two core components. The other component is the Hadoop Distributed File System (HDFS). Once data is stored and replicated in HDFS, the MapReduce process could move computational processes to the server on which specific data resides. The result is a very fast and parallel computational approach to problems with large amounts of data.
The Hadoop version 1 ecosystem is illustrated in Figure 1. The core components, HDFS and MapReduce, are shown in the gray boxes. Other tools and services are built on top of these components. Notice that the only form of distributed processing is in the MapReduce component. Part of this design includes a JobTracker service that manages and assigns cluster-computing resources.
Figure 1 Hadoop version 1 ecosystem.
This model works well provided that your problem fits cleanly into a MapReduce framework. If that's not the case, using MapReduce usually won't be the most efficient method for your application—if it works at all. To address this and another limitations, Hadoop version 2 "demoted" the MapReduce process to an application framework that runs under the new Yet Another Resource Scheduler (YARN) resource manager. Thus, Hadoop version 2 no longer offers a single MapReduce engine, but rather a more general cluster-resource manager.
Figure 2 shows the Hadoop version 2 ecosystem with YARN. Notice that the ecosystem is virtually identical to that shown in Figure 1, except for splitting the MapReduce processing component into YARN and a MapReduce framework. The relationships and functionality of those projects that are based on Hadoop version 1 MapReduce have remained the same; for example, Hive still works on top of MapReduce. Also, the JobTracker in version 1 is now the ResourceManager in YARN, which lives fully in the YARN component.
Figure 2 Hadoop version 2 ecosystem.
The migration from a monolithic MapReduce engine in Hadoop version 1 to the YARN general-purpose parallel resource-management system in Hadoop version 2 is changing the Hadoop landscape. Organizational data stored in HDFS, in addition to MapReduce processing, is now amenable to any type of processing. Developers are no longer forced to fit solutions into a MapReduce framework, and instead can use other analysis frameworks—or create their own.
Limitations of Hadoop Version 1
Hadoop version 1 had some known scalability limitations. Due to the monolithic design of the MapReduce engine, where both resource management and job progress need to be tracked, the maximum cluster size is limited to about 4,000 nodes. The number of concurrent tasks is limited to about 40,000. Another issue is reliance on a single JobTracker process that's a single point of failure. If the JobTracker failed, all queued and running jobs would be killed.
In terms of efficiency, the biggest limitation is the hard partitioning of resources into map and reduce slots, with no way to redistribute these resources so the cluster could be used more efficiently. Finally, many people requested alternate paradigms and services, such as graph processing. To address these issues, a major redesign was needed. Essentially, the monolithic MapReduce process was broken into two parts: a resource scheduler (YARN), and a generalized processing framework that includes MapReduce.
Hadoop Version 2 Features
In creating the new version, the developers preserved all the MapReduce capability found in version 1. Thus, there was no penalty for upgrading to version 2, because all version 1 MapReduce applications would work. Most applications are binary-compatible; at worst, in some rare cases, applications might need to be recompiled.
The YARN scheduler in version 2 also brought a more dynamic approach to resource containers. (A resource container is a number of compute cores, usually one, and a certain amount of memory.) In Hadoop version 1, these resource containers were either "mapper slots" or "reducer slots." In version 2, they're generalized slots that are under dynamic control. Thus, most clusters see an immediate boost in efficiency when switching to Hadoop version 2.
The scalability issue has been resolved through the use of a dedicated resource scheduler—YARN. The resource scheduler has no responsibility for running or monitoring job progress. Indeed, the YARN ResourceManager component doesn't care what type of job the user is running; it simply assigns resources and steps out of the way. This design also allows for a failover ResourceManager component, so that it's no longer a single point of failure.
Because YARN is a general scheduler, support of non-MapReduce jobs is now available to Hadoop clusters. Applications must work with and request resources from the ResourceManager, but the actual computation can be determined by the user's needs, and not the MapReduce data flow. One such application is Apache Spark, which can be thought of as a memory-resident MapReduce. Typically Hadoop MapReduce will move computation to the node where the data resides on disk. It will also write intermediate results to the node disks. Spark bypasses the disk writes/reads, keeping everything in memory. In other words, Spark moves computation to where the data lives in memory, thereby creating very fast and scalable applications.
New application frameworks are not limited to MapReduce-like applications. Other frameworks include applications like Apache Giraph for graph processing. There is even support for the Message Passing Interface (MPI) API within a YARN application.
Another aspect of the new Hadoop design is application agility. In version 1, any changes to the MapReduce process required an upgrade of the entire cluster. Even testing new MapReduce versions required a separate cluster. In version 2, multiple MapReduce versions can run at the same time. This agility also applies to other applications that are developed to run under YARN. New versions can be tested on the same cluster—and with the same data—that's used for the production version. Finally, YARN offers the user the ability to move away from Java, as YARN applications aren't required to be written in Java.
Now, Why You Should Care
Several important points about Apache Hadoop YARN are worth noting:
- Hadoop is no longer a single MapReduce application engine.
- Hadoop is a data-processing ecosystem that provides a framework for processing any type of data.
- The concept of a "data lake" is now possible, where practically unlimited amounts of raw unstructured data can be stored for later or real-time analysis.
- The ability of Hadoop to perform Extract, Transform, and Load (ETL) at runtime enables the developer to use a growing array of possible analysis engines, including MapReduce, graph processing, in-memory, custom, high-performance computing, and others.
The addition of YARN to the Hadoop ecosystem offers a flexible and powerful platform for data analysis and growth. Hadoop is no longer a "one-trick pony"; it now provides a very robust and open computing environment that can scale into the future.
You can learn more about Apache Hadoop YARN from Apache Hadoop YARN LiveLessons (Video Training) and Apache Hadoop YARN: Moving Beyond MapReduce and Batch Processing with Apache Hadoop 2.