Learn how to install and deploy Hadoop using commercial distributions and Amazon Web Services (AWS)
Now that you have been introduced to Hadoop and learned about its core components, HDFS and YARN and their related processes, as well as different deployment modes for Hadoop, let’s look at the different options for getting a functioning Hadoop cluster up and running. By the end of this hour you will have set up a working Hadoop cluster that we will use throughout the remainder of the book.
Installation Platforms and Prerequisites
Before you install Hadoop there are a few installation requirements, prerequisites, and recommendations of which you should be aware.
Operating System Requirements
The vast majority of Hadoop implementations are platformed on Linux hosts. This is due to a number of reasons:
The Hadoop project, although cross-platform in principle, was originally targeted at Linux. It was several years after the initial release that a Windows-compatible distribution was introduced.
Many of the commercial vendors only support Linux.
Many other projects in the open source and Hadoop ecosystem have compatibility issues with non-Linux platforms.
That said there are options for installing Hadoop on Windows, should this be your platform of choice. We will use Linux for all of our exercises and examples in this book, but consult the documentation for your preferred Hadoop distribution for Windows installation and support information if required.
If you are using Linux, choose a distribution you are comfortable with. All major distributions are supported (Red Hat, Centos, Ubuntu, SLES, etc.). You can even mix distributions if appropriate; for instance, master nodes running Red Hat and slave nodes running Ubuntu.
Although there are no hard and fast requirements, there are some general heuristics used in sizing instances, or hosts, appropriately for roles within a Hadoop cluster. First, you need to distinguish between master and slave node instances, and their requirements.
A Hadoop cluster relies on its master nodes, which host the NameNode and ResourceManager, to operate, although you can implement high availability for each subsystem as I discussed in the last hour. Failure and failover of these components is not desired. Furthermore, the master node processes, particularly the NameNode, require a large amount of memory to operate efficiently, as you will appreciate in the next hour when we dive into the internals of HDFS. Therefore, when specifying hardware requirements the following guidelines can be used for medium to large-scale production Hadoop implementations:
16 or more CPU cores (preferably 32)
128GB or more RAM (preferably 256GB)
RAID Hard Drive Configuration (preferably with hot-swappable drives)
Redundant power supplies
Bonded Gigabit Ethernet or 10Gigabit Ethernet
This is only a guide, and as technology moves on quickly, these recommendations will change as well. The bottom line is that you need carrier class hardware with as much CPU and memory capacity as you can get!
Slave nodes do the actual work in Hadoop, both for processing and storage so they will benefit from more CPU and memory—physical memory, not virtual memory. That said, slave nodes are designed with the expectation of failure, which is one of the reasons blocks are replicated in HDFS. Slave nodes can also be scaled out linearly. For instance, you can simply add more nodes to add more aggregate storage or processing capacity to the cluster, which you cannot do with master nodes. With this in mind, economic scalability is the objective when it comes to slave nodes. The following is a guide for slave nodes for a well-subscribed, computationally intensive Hadoop cluster; for instance, a cluster hosting machine learning and in memory processing using Spark.
16-32 CPU cores
64-512 GB of RAM
12-24 1-4 TB hard disks in a JBOD Configuration
Slave nodes are designed to be deployed on commodity-class hardware, and yet while they still need ample processing power in the form of CPU cores and memory, as they will be executing computational and data transformation tasks, they don’t require the same degree of fault tolerance that master nodes do.
Fully distributed Hadoop clusters are very chatty, with control messages, status updates and heartbeats, block reports, data shuffling, and block replication, and there is often heavy network utilization between nodes of the cluster. If you are deploying Hadoop on-premise, you should always deploy Hadoop clusters on private subnets with dedicated switches. If you are using multiple racks for your Hadoop cluster (you will learn more about this in Hour 21, “Understanding Advanced HDFS”), you should consider redundant core and “top of rack” switches.
Hostname resolution is essential between nodes of a Hadoop cluster, so both forward and reverse DNS lookups must work correctly between each node (master-slave and slave-slave) for Hadoop to function. Either DNS or a hosts files can be used for resolution. IPv6 should also be disabled on all hosts in the cluster.
Time synchronization between nodes of the cluster is essential as well, as some components, such as Kerberos, which is discussed in Hour 22, “Securing Hadoop,” rely on this being the case. It is recommended you use ntp (Network Time Protocol) to keep clocks synchronized between all nodes.
As discussed, Hadoop is almost entirely written in Java and compiled to run in a Java Runtime Environment (JRE); therefore Java is a prerequisite to installing Hadoop. Current prerequisites include:
Java Runtime Envrionment (JRE) 1.7 or above
Java Development Kit (JDK) 1.7 or above—required if you will be compiling Java classes such as MapReduce applications
Other ecosystem projects will have their specific prerequisites; for instance, Apache Spark requires Scala and Python as well, so you should always refer to the documentation for these specific projects.