You have numerous options for installing Hadoop and setting up Hadoop clusters. As Hadoop is a top-level Apache Software Foundation (ASF) open source project, one method is to install directly from the Apache builds on http://hadoop.apache.org/. To do this you first need one or more hosts, depending upon the mode you wish to use, with appropriate hardware specifications, an appropriate operating system, and a Java runtime environment available (all of the prerequisites and considerations discussed in the previous section).
Once you have this, it is simply a matter of downloading and unpacking the desired release. There may be some additional configuration to be done afterwards, but then you simply start the relevant services (master and slave node daemons) on their designated hosts and you are up and running.
Let’s deploy a Hadoop cluster using the latest Apache release now.
Using a Commercial Hadoop Distribution
As I had discussed in Hour 1, “Introducing Hadoop,” the commercial Hadoop landscape is well established. With the advent of the ODPi (the Open Data Platform initiative), a once-numerous array of vendors and derivative distributions has been consolidated to a much simpler landscape which includes three primary pure-play Hadoop vendors:
Importantly, enterprise support agreements and subscriptions can be purchased from the various Hadoop vendors for their distributions. Each vendor also supplies a suite of management utilities to help you deploy and manage Hadoop clusters. Let’s look at each of the three major pure play Hadoop vendors and their respective distributions.
Cloudera was the first mover in the commercial Hadoop space, establishing their first commercial release in 2008. Cloudera provides a Hadoop distribution called CDH (Cloudera Distribution of Hadoop), which includes the Hadoop core and many ecosystem projects. CDH is entirely open source.
Cloudera also provides a management utility called Cloudera Manager (which is not open source). Cloudera Manager provides a management console and framework enabling the deployment, management, and monitoring of Hadoop clusters, and which makes many administrative tasks such as setting up high availability or security much easier. The mix of open source and proprietary software is often referred to as open core. A screenshot showing Cloudera Manager is pictured in Figure 3.2.
Figure 3.2 Cloudera Manager.
As mentioned, Cloudera Manager can be used to deploy Hadoop clusters, including master nodes, slave nodes, and ecosystem technologies. Cloudera Manager distributes installation packages for Hadoop components through a mechanism called parcels. As Hadoop installations are typically isolated from public networks, Cloudera Manager, which is technically not part of the cluster and will often have access to the Internet, will download parcels and distribute these to new target hosts nominated to perform roles in a Hadoop cluster or to existing hosts to upgrade components.
Deploying a fully distributed CDH cluster using Cloudera Manager would involve the following steps at a high level:
Install Cloudera Manager on a host that has access to other hosts targeted for roles in the cluster.
Specify target hosts for the cluster using Cloudera Manager.
Use Cloudera Manager to provision Hadoop services, including master and slave nodes.
Cloudera also provides a Quickstart virtual machine, which is a pre-configured pseudo-distributed Hadoop cluster with the entire CDH stack, including core and ecosystem components, as well as a Cloudera Manager installation. This virtual machine is available for VirtualBox, VMware, and KVM, and works with the free editions of each of the virtualization platforms. The Cloudera Quickstart VM is pictured in Figure 3.3.
Figure 3.3 Cloudera Quickstart VM.
The Quickstart VM is a great way to get started with the Cloudera Hadoop offering. To find out more, go to http://www.cloudera.com/downloads.html.
More information about Cloudera is available at http://www.cloudera.com/.
Hortonworks provides pure open source Hadoop distribution and a founding member of the open data platform initiative (ODPi) discussed in Hour 1. Hortonworks delivers a distribution called HDP (Hortonworks Data Platform), which is a complete Hadoop stack including the Hadoop core and selected ecosystem components. Hortonworks uses the Apache Ambari project to provide its deployment configuration management and cluster monitoring facilities. A screenshot of Ambari is shown in Figure 3.4.
Figure 3.4 Ambari console.
The simplest method to deploy a Hortonworks Hadoop cluster would involve the following steps:
Install Ambari using the Hortonworks installer on a selected host.
Add hosts to the cluster using Ambari.
Deploy Hadoop services (such as HDFS and YARN) using Ambari.
Hortonworks provides a fully functional, pseudo-distributed HDP cluster with the complete Hortonworks application stack in a virtual machine called the Hortonworks Sandbox. The Hortonworks Sandbox is available for VirtualBox, VMware, and KVM. The Sandbox virtual machine includes several demo applications and learning tools to use to explore Hadoop and its various projects and components. The Hortonworks Sandbox welcome screen is shown in Figure 3.5.
Figure 3.5 Hortonworks Sandbox.
MapR delivers a “Hadoop-derived” software platform that implements an API-compatible distributed filesystem called MapRFS (the MapR distributed Filesystem). MapRFS has been designed to maximize performance and provide read-write capabilities not offered by native HDFS. MapR delivers three versions of their offering called the “Converged Data Platform.” These include:
M3 or “Converged Community Edition” (free version)
M5 or “Converged Enterprise Edition” (supported version)
M7 (M5 version that includes MapR’s custom HBase-derived data store)
Like the other distributions, MapR has a demo offering called the “MapR Sandbox,” which is available for VirtualBox or VMware. It is pictured in Figure 3.6.
Figure 3.6 MapR Sandbox VM.
The MapR Sandbox can be downloaded from https://www.mapr.com/products/mapr-sandbox-hadoop/download.
MapR’s management offering is called the MapR Control System (MCS), which offers a central console to configure, monitor and manage MapR clusters. It is shown in Figure 3.7.
Figure 3.7 MapR Control System (MCS).
Much more information about MapR and the Converged Data Platform is available at https://www.mapr.com/.