Home > Store

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS

Register your product to gain access to bonus material or receive a coupon.

Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS


  • Your Price: $39.99
  • List Price: $49.99
  • Usually ships in 24 hours.



  • The comprehensive, up-to-date Apache Hadoop 2 administration handbook and reference
  • The only Hadoop 2 administration book written by a working Hadoop administrator!
  • Practical examples show how to perform key day-to-day administration tasks and rapidly troubleshoot Hadoop clusters
  • Demystifies complex Hadoop environments and management concepts, offering expert advice and best-practice recommendations


  • Copyright 2017
  • Dimensions: 7" x 9-1/8"
  • Pages: 848
  • Edition: 1st
  • Book
  • ISBN-10: 0-13-459719-2
  • ISBN-13: 978-0-13-459719-5

The Comprehensive, Up-to-Date Apache Hadoop Administration Handbook and Reference

“Sam Alapati has worked with production Hadoop clusters for six years. His unique depth of experience has enabled him to write the go-to resource for all administrators looking to spec, size, expand, and secure production Hadoop clusters of any size.”

–Paul Dix, Series Editor

In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.

Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run.

  • Understand Hadoop’s architecture from an administrator’s standpoint
  • Create simple and fully distributed clusters
  • Run MapReduce and Spark applications in a Hadoop cluster
  • Manage and protect Hadoop data and high availability
  • Work with HDFS commands, file permissions, and storage management
  • Move data, and use YARN to allocate resources and schedule jobs
  • Manage job workflows with Oozie and Hue
  • Secure, monitor, log, and optimize Hadoop
  • Benchmark and troubleshoot Hadoop

Sample Content

Online Sample Chapter

HDFS Commands, HDFS Permissions and HDFS Storage

Sample Pages

Download the sample pages (includes Chapter 9 and Index).

Table of Contents

Foreword xxvii

Preface xxix

Acknowledgments xxxv

About the Author xxxvii

Part I: Introduction to Hadoop—Architecture and Hadoop Clusters 1

Chapter 1: Introduction to Hadoop and Its Environment 3

Hadoop—An Introduction 4

Cluster Computing and Hadoop Clusters 12

Hadoop Components and the Hadoop Ecosphere 15

What Do Hadoop Administrators Do? 18

Key Differences between Hadoop 1 and Hadoop 2 21

Distributed Data Processing: MapReduce and Spark, Hive and Pig 24

Data Integration: Apache Sqoop, Apache Flume and

Apache Kafka 27

Key Areas of Hadoop Administration 28

Summary 31

Chapter 2: An Introduction to the Architecture of Hadoop 33

Distributed Computing and Hadoop 33

Hadoop Architecture 34

Data Storage—The Hadoop Distributed File System 37

Data Processing with YARN, the Hadoop Operating System 48

Summary 57

Chapter 3: Creating and Configuring a Simple Hadoop Cluster 59

Hadoop Distributions and Installation Types 60

Setting Up a Pseudo-Distributed Hadoop Cluster 62

Performing the Initial Hadoop Configuration 71

Operating the New Hadoop Cluster 86

Summary 90

Chapter 4: Planning for and Creating a Fully Distributed Cluster 91

Planning Your Hadoop Cluster 92

Going from a Single Rack to Multiple Racks 95

Creating a Multinode Cluster 102

Modifying the Hadoop Configuration 106

Starting Up the Cluster 114

Configuring Hadoop Services, Web Interfaces and Ports 119

Summary 126

Part II: Hadoop Application Frameworks 127

Chapter 5: Running Applications in a Cluster—The MapReduce Framework (and Hive and Pig) 129

The MapReduce Framework 129

Apache Hive 141

Apache Pig 144

Summary 145

Chapter 6: Running Applications in a Cluster—The Spark Framework 147

What Is Spark? 148

Why Spark? 149

The Spark Stack 153

Installing Spark 155

Spark Run Modes 158

Understanding the Cluster Managers 159

Spark and Data Access 164

Summary 167

Chapter 7: Running Spark Applications 169

The Spark Programming Model 169

Spark Applications 173

Architecture of a Spark Application 179

Running Spark Applications Interactively 181

Creating and Submitting Spark Applications 185

Configuring Spark Applications 192

Monitoring Spark Applications 194

Handling Streaming Data with Spark Streaming 194

Using Spark SQL for Handling Structured Data 198

Summary 201

Part III: Managing and Protecting Hadoop Data and High Availability 203

Chapter 8: The Role of the NameNode and How HDFS Works 205

HDFS—The Interaction between the NameNode and the DataNodes 205

Rack Awareness and Topology 209

HDFS Data Replication 212

How Clients Read and Write HDFS Data 218

Understanding HDFS Recovery Processes 224

Centralized Cache Management in HDFS 227

Hadoop Archival Storage, SSD and Memory (Heterogeneous Storage) 232

Summary 241

Chapter 9: HDFS Commands, HDFS Permissions and HDFS Storage 243

Managing HDFS through the HDFS Shell Commands 243

Using the dfsadmin Utility to Perform HDFS Operations 251

Managing HDFS Permissions and Users 255

Managing HDFS Storage 260

Rebalancing HDFS Data 267

Reclaiming HDFS Space 274

Summary 276

Chapter 10: Data Protection, File Formats and Accessing HDFS 277

Safeguarding Data 278

Data Compression 289

Hadoop File Formats 295

Using Hadoop WebHDFS and HttpFS 308

Summary 315

Chapter 11: NameNode Operations, High Availability and Federation 317

Understanding NameNode Operations 318

The Checkpointing Process 323

NameNode Safe Mode Operations 329

Configuring HDFS High Availability 334

HDFS Federation 349

Summary 351

Part IV: Moving Data, Allocating Resources, Scheduling Jobs and Security 353

Chapter 12: Moving Data Into and Out of Hadoop 355

Introduction to Hadoop Data Transfer Tools 355

Loading Data into HDFS from the Command Line 356

Copying HDFS Data between Clusters with DistCp 361

Ingesting Data from Relational Databases with Sqoop 365

Ingesting Data from External Sources with Flume 388

Ingesting Data with Kafka 398

Summary 406

Chapter 13: Resource Allocation in a Hadoop Cluster 407

Resource Allocation in Hadoop 407

The FIFO Scheduler 410

The Capacity Scheduler 411

The Fair Scheduler 426

Comparing the Capacity Scheduler and the Fair Scheduler 435

Summary 436

Chapter 14: Working with Oozie to Manage Job Workflows 437

Using Apache Oozie to Schedule Jobs 437

Oozie Architecture 439

Deploying Oozie in Your Cluster 441

Understanding Oozie Workflows 446

How Oozie Runs an Action 449

Creating an Oozie Workflow 454

Running an Oozie Workflow Job 461

Oozie Coordinators 464

Managing and Administering Oozie 470

Summary 475

Chapter 15: Securing Hadoop 477

Hadoop Security—An Overview 478

Hadoop Authentication with Kerberos 481

Hadoop Authorization 505

Auditing Hadoop 518

Securing Hadoop Data 520

Other Hadoop-Related Security Initiatives 524

Summary 525

Part V: Monitoring, Optimization and Troubleshooting 527

Chapter 16: Managing Jobs, Using Hue and Performing Routine Tasks 529

Using the YARN Commands to Manage Hadoop Jobs 530

Decommissioning and Recommissioning Nodes 535

ResourceManager High Availability 541

Performing Common Management Tasks 545

Managing the MySQL Database 548

Backing Up Important Cluster Data 551

Using Hue to Administer Your Cluster 553

Implementing Specialized HDFS Features 562

Summary 567

Chapter 17: Monitoring, Metrics and Hadoop Logging 569

Monitoring Linux Servers 570

Hadoop Metrics 576

Using Ganglia for Monitoring 579

Understanding Hadoop Logging 582

Using Hadoop’s Web UIs for Monitoring 599

Monitoring Other Hadoop Components 609

Summary 610

Chapter 18: Tuning the Cluster Resources, Optimizing MapReduce Jobs and Benchmarking 611

How to Allocate YARN Memory and CPU 612

Configuring Efficient Performance 621

Tuning Map and Reduce Tasks—What the Administrator Can Do 625

Optimizing Pig and Hive Jobs 635

Benchmarking Your Cluster 638

Hadoop Counters 647

Optimizing MapReduce 652

Summary 658

Chapter 19: Configuring and Tuning Apache Spark on YARN 659

Configuring Resource Allocation for Spark on YARN 659

Dynamic Resource Allocation when Running Spark on YARN 676

Storage Formats and Compressing Data 678

Monitoring Spark Applications 681

Tuning Garbage Collection 686

Tuning Spark Streaming Applications 688

Summary 689

Chapter 20: Optimizing Spark Applications 691

Revisiting the Spark Execution Model 692

Shuffle Operations and How to Minimize Them 694

Partitioning and Parallelism (Number of Tasks) 703

Optimizing Data Serialization and Compression 710

Understanding Spark’s SQL Query Optimizer 712

Caching Data 717

Summary 723

Chapter 21: Troubleshooting Hadoop—A Sampler 725

Space-Related Issues 725

Handling YARN Jobs That Are Stuck 731

JVM Memory-Allocation and Garbage-Collection Strategies 732

Handling Different Types of Failures 737

Troubleshooting Spark Jobs 739

Debugging Spark Applications 740

Summary 742

Chapter 22: Installing VirtualBox and Linux and Cloning the Virtual Machines 743

Installing Oracle VirtualBox 744

Installing Oracle Enterprise Linux 745

Cloning the Linux Server 745

Index 747


Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.


Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.


If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.


This site is not directed to children under the age of 13.


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020