Home > Store

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

Register your product to gain access to bonus material or receive a coupon.

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture


  • Sorry, this book is no longer in print.
Not for Sale

eBook (Watermarked)

  • Your Price: $38.39
  • List Price: $47.99
  • Includes EPUB and PDF
  • About eBook Formats
  • This eBook includes the following formats, accessible from your Account page after purchase:

    ePub EPUB The open industry format known for its reflowable content and usability on supported mobile devices.

    Adobe Reader PDF The popular standard, used most often with the free Acrobat® Reader® software.

    This eBook requires no passwords or activation to read. We customize your eBook by discreetly watermarking it with your name, making it uniquely yours.



  • Shows how data works and moves in Hadoop clusters, and how to integrate Hadoop into enterprise data architecture
  • Gives "golden image templates" for deploying Hadoop smoothly, quickly, and consistently
  • Teaches how to avoid pitfalls and mitigate risks associated with virtualizing Hadoop
  • By pioneering Big Data and virtualization experts George Trujillo (Hortonworks), Charles Kim (Oracle ACE Director, VMware vExpert), and Steven Jones (VMware)


  • Copyright 2016
  • Pages: 480
  • Edition: 1st
  • Book
  • ISBN-10: 0-13-381102-6
  • ISBN-13: 978-0-13-381102-5

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility

Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.

First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.

Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it.

Coverage includes the following:

          •        Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop

          •        Understanding YARN resource management, HDFS storage, and I/O

          •        Designing data ingestion, movement, and organization for modern enterprise data platforms

          •        Defining SQL engine strategies to meet strict SLAs

          •        Considering security, data isolation, and scheduling for multitenant environments

          •        Deploying Hadoop as a service in the cloud

          •        Reviewing the essential concepts, capabilities, and terminology of virtualization 

          •        Applying current best practices, guidelines, and key metrics for Hadoop virtualization

          •        Managing multiple Hadoop frameworks and products as one unified system

          •        Virtualizing master and worker nodes to maximize availability and performance

          •        Installing and configuring Linux for a Hadoop environment

Sample Content

Online Sample Chapter

Understanding the Big Data World

Table of Contents

Foreword xix

Preface xxi

Part I: Introduction to Hadoop

Chapter 1 Understanding the Big Data World 1

The Data Revolution 2

Traditional Data Systems 4

    Semi-Structured and Unstructured Data 5

    Causation and Correlation 7

    Data Challenges 8

The Modern Data Architecture 17

Organizational Transformations 20

Industry Transformation 21

Summary 22

Chapter 2 Hadoop Fundamental Concepts 23

Types of Data in Hadoop 23

Use Cases 25

What Is Hadoop? 26

Hadoop Distributions 32

Hadoop Frameworks 32

NoSQL Databases 37

    What Is NoSQL? 38

A Hadoop Cluster 42

Hadoop Software Processes 45

    Hadoop Hardware Profiles 48

Roles in the Hadoop Environment 56

Summary 59

Chapter 3 YARN and HDFS 61

A Hadoop Cluster Is Distributed 61

Hadoop Directory Layouts 65

    Hadoop Operating System Users 67

The Hadoop Distributed File System 67

    YARN Logging 70

    The NameNode 70

    The DataNode 71

    Block Placement 75

    NameNode Configurations and Managing Metadata 77

Rack Awareness 82

    Block Management 83

    The Balancer 84

    Maintaining Data Integrity in the Cluster 84

Quotas and Trash 92

YARN and the YARN Processing Model 93

    Running Applications on YARN 101

    Resource Schedulers 107

    Benchmarking 112

    TeraSort Benchmarking Suite 115

Summary 117

Chapter 4 The Modern Data Platform 119

Designing a Hadoop Cluster 119

    Enterprise Data Movement 124

Summary 140

Chapter 5 Data Ingestion 141

Extraction, Loading, and Transformation (ELT) 141

    Sqoop: Data Movement with SQL Sources 143

    Flume: Streaming Data 148

    Oozie: Scheduling and Workfl ow 167

    Falcon: Data Lifecycle Management 172

    Kafka: Real-time Data Streaming 176

Summary 186

Chapter 6 Hadoop SQL Engines 187

Where SQL Was Born 187

SQL in Hadoop 188

Hadoop SQL Engines 190

    Selecting the SQL Tool For Hadoop 190

Now Getting Groovy with Hive and Pig 198

    Hive 199

    HCatalog 213

    Pig 215

Summary 221

Chapter 7 Multitenancy in Hadoop 223

Securing the Access 224

    Authentication 225

    Auditing 230

    Authorization 230

    Data Protection 232

    Isolating the Data 241

    Isolating the Process 251

Summary 255

Part II: Introduction to Virtualization

Chapter 8 Virtualization Fundamentals 257

Why Virtualize Hadoop? 258

    Introduction to Virtualization 261

Summary 276

References 276

Chapter 9 Best Practices for Virtualizing Hadoop 277

Running Virtualized Hadoop with Purpose and Discipline 277

    The Discipline of Purpose Starts with a Clear Target 279

    Virtualizing Different Tiers of Hadoop 280

    Industry Best Practices 282

Summary 298

Part III: Virtualizing Hadoop

Chapter 10 Virtualizing Hadoop 299

How Are Hadoop Ecosystems Going to Be Managed? 300

    Building an Enterprise Hadoop Platform That Is Agile and Flexible 301

    Clarification of Terms 302

    The Journey from Bare-Metal to Virtualization 303

Why Consider Virtualizing Hadoop? 304

    Benefits of Virtualizing Hadoop 305

    Virtualized Hadoop Can Run as Fast or Faster Than Native 306

    Coordination and Cross-Purpose Specialization Is the Future 309

    Barriers Can Be Organizational 310

    Virtualization Is Not an All or Nothing Option 310

    Rapid Provisioning and Improving Quality of Development and Test Environments 311

    Improve High Availability with Virtualization 313

    Use Virtualization to Leverage Hadoop Workloads 313

    Hadoop in the Cloud 314

    Big Data Extensions 314

    The Path to Virtualization 315

    The Software-Defined Data Center 316

    Virtualizing the Network 318

    vRealize Suite 320

Summary 321

References 322

Chapter 11 Virtualizing Hadoop Master Servers 323

Virtualizing Servers in a Hadoop Cluster 324

    Virtualizing the Environment Around Hadoop 325

    Virtualizing the Master Hadoop Servers 325

    Virtualizing Without the SAN 330

Summary 331

Chapter 12 Virtualizing the Hadoop Worker Nodes 333

A Brief Introduction to the Worker Nodes in Hadoop 333

Deployment Models for Hadoop Clusters 335

    The Combined Model 336

    The Separated Model 339

    Network Effects of the Data-Compute Separation 341

    The Shared-Storage Approach to the Data-Compute Separated Model 343

    Local Disks for the Application’s Temporary Data 345

    The Shared Storage Architecture Model Using Network-Attached Storage (NAS) 345

    Deployment Model Summary 348

Best Practices for Virtualizing Hadoop Workers 349

    Disk I/O 349

The Hadoop Virtualization Extensions (HVE) 354

Summary 357

References 358

Resources 358

Chapter 13 Deploying Hadoop as a Service in the Private Cloud 361

The Cloud Context 361

    Stakeholders for Hadoop 362

    Overview of the Solution Architecture 368

Summary 370

References 371

Chapter 14 Understanding the Installation of Hadoop 373

Map the Right Solutions to the Right Use Case 373

    Thoughts About Installing Hadoop 374

Configuring Repositories 376

    Installing HDP 2.2 378

    Environment Preparation 378

Setting Up the Hadoop Configuration 389

Starting HDFS and YARN 393

    Start YARN 396

    Verifying MapReduce Functionality 398

Installing and Configuring Hive 400

Installing and Configuring MySQL Database 401

Installing and Configuring Hive and HCatalog 401

Summary 404

Chapter 15 Configuring Linux for Hadoop 405

Supported Linux Platforms 406

Different Deployment Models 406

Linux Golden Templates 407

    Building a Linux Enterprise Hadoop Platform 408

    Selecting the Linux Distribution 411

Optimal Linux Kernel Parameters and System Settings 411

    epoll 411

    Disable Swap Space 412

    Disable Security During Install 412

    IO Scheduler Tuning 414

    Check Transparent Huge Pages Configuration 414

    Limits.conf 414

    Partition Alignment for RDMs 415

    File System Considerations 416

    Lazy Count Parameter for XFS 418

    Mount Options 418

    I/O Scheduler 419

    Disk Read and Write Options 421

    Storage Benchmarking 421

    Java Version 422

    Set Up NTP 423

    Enable Jumbo Frames 424

    Additional Network Considerations 425

Summary 427

Appendix A Hadoop Cluster Creation: A Prerequisite Checklist 429

Appendix B Big Data/Hadoop on VMware vSphere Reference Materials 433

Deployment Guides 433

Reference Architectures 434

Customer Case Studies 434

Performance 434

vSphere Big Data Extensions (BDE) 435

Other vSphere Features and Big Data 436

9780133811025   TOC   7/7/2015


Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.


Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.


If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.


This site is not directed to children under the age of 13.


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020