Home > Articles > Programming > Java

Big Data Analysis with MapReduce and Hadoop

As the amount of captured data increases over the years, so do our storage needs. Companies are realizing that “data is king,” but how do we analyze it? Through Hadoop. In this article, the first of a three-part series, Steven Haines presents an overview of Hadoop’s architecture and demonstrates, from a high-level, how to build a MapReduce application.
Like this article? We recommend

In the evolution of data processing, we moved from flat files to relational databases and from relational databases to NoSQL databases. Essentially, as the amount of captured data increased, so did our needs, and traditional patterns no longer sufficed. The databases of old worked well with data that measured in megabytes and gigabytes, but now that companies realize “data is king,” the amount of captured data is measured in terabytes and petabytes. Even with NoSQL data stores, the question remains: How do we analyze that amount of data?

The most popular answer to this is: Hadoop. Hadoop is an open-source framework for developing and executing distributed applications that process very large amounts of data. Hadoop is meant to run on large clusters of commodity machines, which can be machines in your data center that you’re not using or even Amazon EC2 images. The danger, of course, in running on commodity machines is how to handle failure. Hadoop is architected with the assumption that hardware will fail and as such, it can gracefully handle most failures. Furthermore, its architecture allows it to scale nearly linearly, so as processing capacity demands increase, the only constraint is the amount of budget you have to add more machines to your cluster.

This article presents an overview of Hadoop’s architecture to describe how it can achieve these bold claims, and it demonstrates, from a high-level, how to build a MapReduce application.

Hadoop Architecture

At a high-level, Hadoop operates on the philosophy of pushing analysis code close to the data it is intended to analyze rather than requiring code to read data across a network. As such, Hadoop provides its own file system, aptly named Hadoop File System or HDFS. When you upload your data to the HDFS, Hadoop will partition your data across the cluster (keeping multiple copies of it in case your hardware fails), and then it can deploy your code to the machine that contains the data upon which it is intended to operate.

Like many NoSQL databases, HDFS organizes data by keys and values rather than relationally. In other words, each piece of data has a unique key and a value associated with that key. Relationships between keys, if they exist, are defined in the application, not by HDFS. And in practice, you’re going to have to think about your problem domain a bit differently in order realize the full power of Hadoop (see the next section on MapReduce).

The components that comprise Hadoop are:

  •   HDFS: The Hadoop file system is a distributed file system designed to hold huge amounts of data across multiple nodes in a cluster (where huge can be defined as files that are 100+ terabytes in size!) Hadoop provides both an API and a command-line interface to interacting with HDFS.
  •   MapReduce Application: The next section reviews the details of MapReduce, but in short, MapReduce is a functional programming paradigm for analyzing a single record in your HDFS. It then assembles the results into a consumable solution. The Mapper is responsible for the data processing step, while the Reducer receives the output from the Mappers and sorts the data that applies to the same key.
  •   Partitioner: The partitioner is responsible for dividing a particular analysis problem into workable chunks of data for use by the various Mappers. The HashPartioner is a partitioner that divides work up by “rows” of data in the HDFS, but you are free to create your own custom partitioner if you need to divide your data up differently.
  •   Combiner: If, for some reason, you want to perform a local reduce that combines data before sending it back to Hadoop, then you’ll need to create a combiner. A combiner performs the reduce step, which groups values together with their keys, but on a single node before returning the key/value pairs to Hadoop for proper reduction.
  •   InputFormat: Most of the time the default readers will work fine, but if your data is not formatted in a standard way, such as “key, value” or “key [tab] value”, then you will need to create a custom InputFormat implementation.
  • OutputFormat: Your MapReduce applications will read data in some InputFormat and then write data out through an OutputFormat. Standard formats, such as “key [tab] value”, are supported out of the box, but if you want to do something else, then you need to create your own OutputFormat implementation.

Additionally, Hadoop applications are deployed to an infrastructure that supports its high level of scalability and resilience. These components include:

  •   NameNode: The NameNode is the master of the HDFS that controls slave DataNode daemons; it understands where all of your data is stored, how the data is broken into blocks, what nodes those blocks are deployed to, and the overall health of the distributed filesystem. In short, it is the most important node in the entire Hadoop cluster. Each cluster has one NameNode, and the NameNode is a single-point of failure in a Hadoop cluster.
  •   Secondary NameNode: The Secondary NameNode monitors the state of the HDFS cluster and takes “snapshots” of the data contained in the NameNode. If the NameNode fails, then the Secondary NameNode can be used in place of the NameNode. This does require human intervention, however, so there is no automatic failover from the NameNode to the Secondary NameNode, but having the Secondary NameNode will help ensure that data loss is minimal. Like the NameNode, each cluster has a single Secondary NameNode.
  •   DataNode: Each slave node in your Hadoop cluster will host a DataNode. The DataNode is responsible for performing data management: It reads its data blocks from the HDFS, manages the data on each physical node, and reports back to the NameNode with data management status.
  •   JobTracker: The JobTracker daemon is your liaison between your application and Hadoop itself. There is one JobTracker configured per Hadoop cluster and, when you submit your code to be executed on the Hadoop cluster, it is the JobTracker’s responsibility to build an execution plan. This execution plan includes determining the nodes that contain data to operate on, arranging nodes to correspond with data, monitoring running tasks, and relaunching tasks if they fail.
  •   TaskTracker: Similar to how data storage follows the master/slave architecture, code execution also follows the master/slave architecture. Each slave node will have a TaskTracker daemon that is responsible for executing the tasks sent to it by the JobTracker and communicating the status of the job (and a heartbeat) with the JobTracker.
  • Figure 1 tries to put all of these components together in one pretty crazy diagram.

    Figure 1 Hadoop application and infrastructure interactions

    Figure 1 shows the relationships between the master node and the slave nodes. The master node contains two important components: the NameNode, which manages the cluster and is in charge of all data, and the JobTracker, which manages the code to be executed and all of the TaskTracker daemons. Each slave node has both a TaskTracker daemon as well as a DataNode: the TaskTracker receives its instructions from the JobTracker and executes map and reduce processes, while the DataNode receives its data from the NameNode and manages the data contained on the slave node. And of course there is a Secondary NameNode listening to updates from the NameNode.

    MapReduce

    MapReduce is a functional programming paradigm that is well suited to handling parallel processing of huge data sets distributed across a large number of computers, or in other words, MapReduce is the application paradigm supported by Hadoop and the infrastructure presented in this article. MapReduce, as its name implies, works in two steps:

    1. Map: The map step essentially solves a small problem: Hadoop’s partitioner divides the problem into small workable subsets and assigns those to map processes to solve.
    2. Reduce: The reducer combines the results of the mapping processes and forms the output of the MapReduce operation.

    My Map definition purposely used the work “essentially” because one of the things that give the Map step its name is its implementation. While it does solve small workable problems, the way that it does it is that it maps specific keys to specific values. For example, if we were to count the number of times each word appears in a book, our MapReduce application would output each word as a key and the value as the number of times it is seen. Or more specifically, the book would probably be broken up into sentences or paragraphs, and the Map step would return each word mapped either to the number of times it appears in the sentence (or to “1” for each occurrence of every word) and then the reducer would combine the keys by adding their values together.

    Listing 1 shows a Java/Pseudo-code example about how the map and reduce functions might work to solve this problem.

    Listing 1 - Java/Pseudocode for MapReduce

    public void map( String name, String sentence, OutputCollector output ) {
      for( String word : sentence ) {
        output.collect( word, 1 );
      }
    }
    
    public void reduce( String word, Iterator values, OutputCollector output ) {
      int sum = 0;
      while( values.hasNext() ) {
        sum += values.next().get();
      }
      output.collect( word, sum );
    }

    Listing 1 does not contain code that actually works, but it does illustrate from a high-level how such a task would be implemented in a handful of lines of code. Prior to submitting your job to Hadoop, you would first load your data into Hadoop. It would then distribute your data, in blocks, to the various slave nodes in its cluster. Then when you did submit your job to Hadoop, it would distribute your code to the slave nodes and have each map and reduce task process data on that slave node. Your map task would iterate over every word in the data block passed to it (assuming a sentence in this example), and output the word as the key and the value as “1”. The reduce task would then receive all instances of values mapped to a particular key; for example, it may have 1,000 values of “1” mapped to the work “apple”, which would mean that there are 1,000 apples in the text. The reduce task sums up all of the values and outputs that as its result. Then your Hadoop job would be set up to handle all of the output from the various reduce tasks.

    This way of thinking is quite a bit different from how you might have approached the problem without using MapReduce, but it will become clearer in the next article on writing MapReduce applications, in which we build several working examples.

    Summary

    This article described what Hadoop is and presented an overview of its architecture. Hadoop is an open-source framework for developing and executing distributed applications that process very large amounts of data. It provides the infrastructure that distributes data across a multitude of machines in a cluster and that pushes analysis code to nodes closest to the data being analyzed. Your job is to write MapReduce applications that leverage this infrastructure to analyze your data.

    The next article in this series, Building a MapReduce Application with Hadoop, will demonstrate how to set up a development environment and build MapReduce applications, which should give you a good feel for how this new paradigm works. And then the final installment in this series will walk you through setting up and managing a Hadoop production environment.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020