Deploying Hadoop in the Cloud
The rise and proliferation of cloud computing and virtualization technologies has definitely been a game changer for the way organizations think about and deploy technology, and Hadoop is no exception. The availability and maturity around IaaS (Infrastructure-as-a-Service), PaaS (Platform-as-a-Service) and SaaS (Software-as-a-Service) solutions makes deploying Hadoop in the cloud not only viable but, in some cases, desirable.
There are many public cloud variants that can be used to deploy Hadoop including Google, IBM, Rackspace, and others. Perhaps the most pervasive cloud platform to date has been AWS (Amazon Web Services), which I will use as the basis for our discussions.
Before you learn about deployment options for Hadoop in AWS, let’s go through a quick primer and background on some of the key AWS components. If you are familiar with AWS, feel free to jump straight to the Try it Yourself exercise on deploying Hadoop using AWS EMR.
Elastic Compute Cloud (EC2) EC2 is Amazon’s web service-enabled virtual computing platform. EC2 enables users to create virtual servers and networks in the cloud. The virtual servers are called instances. EC2 instances can be created with a variety of different instance permutations. The Instance Type property determines the number of virtual CPUs and the amount of memory and storage an EC2 instance has available to it. An example instance type is m4.large. A complete list of the different EC2 instance types available can be found at https://aws.amazon.com/ec2/instance-types/ .
EC2 instances can be optimized for compute, memory, storage and mixed purposes and can even include GPUs (Graphics Processing Units), a popular option for machine learning and deep analytics.
There are numerous options for operating systems with EC2 instances. All of the popular Linux distributions are supported, including Red Hat, Ubuntu, and SLES, as well various Microsoft Windows options.
EC2 instances are created in security groups. Security groups govern network permissions and Access Control Lists (ACLs). Instances can also be created in a Virtual Private Cloud (VPC). A VPC is a private network, not exposed directly to the Internet. This is a popular option for organizations looking to minimize exposure of EC2 instances to the public Internet.
EC2 instances can be provisioned with various storage options, including instance storage or ephemeral storage, which are terms for volatile storage that is lost when an instance is stopped, and Elastic Block Store (EBS), which is persistent, fault-tolerant storage. There are different options with each, such as SSD (solid state) for instance storage, or provisioned IOPS with EBS.
Additionally, AWS offers Spot instances, which enable you to bid on spare Amazon EC2 computing capacity, often available at a discount compared to normal on-demand EC2 instance pricing.
EC2, as well as all other AWS services, is located in an AWS region. There are currently nine regions, which include the following:
US East (N. Virginia)
US West (Oregon)
US West (N. California)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
South America (Sao Paulo)
Simple Storage Service (S3) is Amazon’s cloud-based object store. An object store manages data (such as files) as objects. These objects exist in buckets. Buckets are logical, user-created containers with properties and permissions. S3 provides APIs for users to create and manage buckets as well as to create, read, and delete objects from buckets.
The S3 bucket namespace is global, meaning that any buckets created must have a globally unique name. The AWS console or APIs will let you know if you are trying to create a bucket with a name that already exists.
S3 objects, like files in HDFS, are immutable, meaning they are write once, read many. When an S3 object is created and uploaded, an ETag is created, which is effectively a signature for the object. This can be used to ensure integrity when the object is accessed (downloaded) in the future.
There are also public buckets in S3 containing public data sets. These are datasets provided for informational or educational purposes, but they can be used for data operations such as processing with Hadoop. Public datasets, many of which are in the tens or hundreds of terabytes, are available, and topics range from historical weather data to census data, and from astronomical data to genetic data.
More information about S3 is available at https://aws.amazon.com/s3/.
Elastic MapReduce (EMR)
Elastic MapReduce (EMR) is Amazon’s Hadoop-as-a-Service platform. EMR clusters can be provisioned using the AWS Management Console or via the AWS APIs. Options for creating EMR clusters include number of nodes, node instance types, Hadoop distribution, and additional ecosystem applications to install.
EMR clusters can read data and output results directly to and from S3. They are intended to be provisioned on demand, run a discrete workflow, a job flow, and terminate. They do have local storage, but they are not intended to run in perpetuity. You should only use this local storage for transient data.
EMR is a quick and scalable deployment method for Hadoop. More information about EMR can be found at https://aws.amazon.com/elasticmapreduce/.
AWS Pricing and Getting Started
AWS products, including EC2, S3, and EMR, are charged based upon usage. Each EC2 instance type within each region has an instance per hour cost associated with it. The usage costs per hour are usually relatively low and the medium- to long-term costs are quite reasonable, but the more resources you use for a longer period of time, the more you are charged.
If you have not already signed up with AWS, you’re in luck! AWS has a free tier available for new accounts that enables you to use certain instance types and services for free for the first year. You can find out more at https://aws.amazon.com/free/. This page walks you through setting up an account with no ongoing obligations.
Once you are up and running with AWS, you can create an EMR cluster by navigating to the EMR link in the AWS console as shown in Figure 3.8.
Figure 3.8 AWS console—EMR option.
Then click Create Cluster on the EMR welcome page as shown in Figure 3.9, and simply follow the dialog prompts.
Figure 3.9 AWS EMR welcome screen.
You can use an EMR cluster for many of our exercises. However, be aware that leaving the cluster up and running will incur usage charges.