Home > Store

Hadoop and Spark Fundamentals LiveLessons

By Douglas Eadline
Published Feb 22, 2018 by Addison-Wesley Professional. Part of the LiveLessons series.

Online Video

Sorry, this book is no longer in print.
About this video

Video accessible from your Account page after purchase.

Not for Sale

Description

Downloads

Sample Content

Updates

More Information

Description

Copyright 2018
Edition: 1st

Online Video
ISBN-10: 0-13-477086-2
ISBN-13: 978-0-13-477086-4

9+ Hours of Video Instruction

The perfect (and fast) way to get started with Hadoop and Spark

Hadoop and Spark Fundamentals LiveLessons provides 9+ hours of video introduction to the Apache Hadoop Big Data ecosystem. The tutorial includes background information and explains the core components of Hadoop, including Hadoop Distributed File Systems (HDFS), MapReduce, the YARN resource manager, and YARN Frameworks. In addition, it demonstrates how to use Hadoop at several levels, including the native Java interface, C++ pipes, and the universal streaming program interface. Examples include how to use benchmarks and high-level tools, including the Apache Pig scripting language, Apache Hive "SQL-like" interface, Apache Flume for streaming input, Apache Sqoop for import and export of relational data, and Apache Oozie for Hadoop workflow management. In addition, there is comprehensive coverage of Spark, PySpark, and the Zeppelin web-GUI. The steps for easily installing a working Hadoop/Spark system on a desktop/laptop and on a local stand-alone cluster using the powerful Ambari GUI are also included. All software used in these LiveLessons is open source and freely available for your use and experimentation. A bonus lesson includes a quick primer on the Linux command line as used with Hadoop and Spark.

Skill Level

Beginner
Intermediate

Learn How To

Understand Hadoop design and key components
How the MapReduce process works in Hadoop
Understand the relationship of Spark and Hadoop
Key aspects of the new YARN design and Frameworks
Use, administer, and program HDFS
Run and administer Hadoop/Spark programs
Write basic MapReduce/Spark programs
Install Hadoop/Spark on a laptop/desktop
Run Apache Pig, Hive, Flume, Sqoop, Oozie, Spark applications
Perform basic data Ingest with Hive and Spark
Use the Zeppelin web-GUI for Spark/Hive programing
Install and administer Hadoop with the Apache Ambari GUI tool

Who Should Take This Course

Users, developers, and administrators interested in learning the fundamental aspects and operations of the open source Hadoop and Spark ecosystems

Course Requirements

Basic understanding of programming and development
A working knowledge of Linux systems and tools
Familiarity with Bash, Python, Java, and C++

Lesson 1: Background Concepts
This lesson introduces Hadoop and Spark along with the many aspects and features that enable the analysis of large unstructured data sets. Many of these discussions about Hadoop ignore the fundamental change Hadoop brings to data management. Doug explains this key point using the data lake metaphor, and then provides background on how the Hadoop data platform, MapReduce, and Spark fit into the data analytics landscape. A bonus lesson is also included for new Linux users that provides the basics of the command line interface used throughout these lessons.

Lesson 2: Running Hadoop on a Desktop or Laptop
A real Hadoop installation, whether it be a local cluster or in the cloud, can be difficult to configure and possibly an expensive proposition. In order to make the examples of this tutorial more accessible, you learn how to install the Hortonworks HDP Sandbox on a desktop or laptop. The "Sandbox" is a freely available Hadoop virtual machine that provides a full Hadoop environment (including Spark). You can use this environment to try most of the examples in this tutorial. If you would rather learn about Hadoop and Spark installation details, we will also do a direct single (Linux) machine install using the latest Hadoop and Spark binary code.

Lesson 3: The Hadoop Distributed File System
The backbone of Hadoop is the Hadoop Distributed File System or HDFS. In this lesson you learn the basics of HDFS and how it is different from many standard file systems used today. In particular, Doug explains why various design trade-offs provide HDFS with a performance edge in big data applications. You also learn how to navigate HDFS using the Hadoop tools and how to use HDFS in user programs. Finally, I present some of the new features available in HDFS including high availability, federation, snapshots, and NFS access.

Lesson 4: Hadoop MapReduce
If the Hadoop Distributed File System is the backbone of Hadoop, then MapReduce is the muscle that operates on big data. In this lesson, Doug shows you how MapReduce compares to a traditional search approach. From there, he shows you how to compile and run a Java MapReduce application. Deeper background on how MapReduce works is presented along with how to use MapReduce with other languages and how to do simple debugging of a MapReduce program.

Lesson 5: Hadoop MapReduce Examples
This lesson continues with MapReduce examples. Doug first shows you a multifile word count program, and then moves on to a more practical log file analysis. From there, he demonstrates how to use a really large text file, like Wikipedia. The lesson concludes with some examples of running MapReduce benchmarks and the using the YARN job browser.

Lesson 6: Higher Level Tools
While Hadoop is very effective at presenting a basic scalable MapReduce model, some higher-level approaches have been developed. In this lesson, Doug teaches you how to use Apache Pig–a Hadoop scripting language that simplifies using MapReduce. In addition, he shows you how to use Apache Hive QL–an SQL-like language that enables higher-level "ad hoc" queries using MapReduce and HDFS. And finally, the Oozie workflow manager is presented.

Lesson 7: Using the Spark Language
Spark has become a popular tool for data analytics. In this lesson, Doug provides some of the basic aspects of the Spark language and demonstrates the Python-Spark interface, PySpark, with a simple command line example. Additional aspects of the Spark language are also used in the next two lessons.

Lesson 8: Getting Data into Hadoop HDFS
The first, and often overlooked step in data analytics, is "data ingest." As was demonstrated in Lesson 3, files can be simply copied into HDFS. However, there are methods that can preserve and import structure that could be lost with simple copying. In this lesson. Doug demonstrates how to import data into Hive tables and use Spark to import data into HDFS. He also demonstrates importing log and other streaming data directly into HDFS using Apache Flume. Finally, a complete example of using Apache Sqoop to import and export a relational database into and out of HDFS is presented.

Lesson 9: Using the Zeppelin Web Interface
Although much of the early Hadoop applications were developed using the command line interface, new web-based GUI tools such as Apache Zeppelin offer a more user-friendly approach to application development. In this lesson, a walk-through of the Zeppelin interface is provided and includes an example of how to create an interactive Zeppelin notebook for a simple Spark application.

Lesson 10: Learning Basic Hadoop Installation and Administration
One of the challenges facing Hadoop users and administrators is setting up a real cluster for production use. In this lesson, Doug teaches you how to use the Ambari web GUI to install, monitor, and administer a full Hadoop installation. He also provides a few important command line tools that will help with basic administration. Finally, some additional HDFS features such as snapshots and NFSv3 mounts are demonstrated.

About Pearson Video Training

Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.



Downloads

The companion materials for this LiveLesson can be downloaded from https://www.clustermonkey.net/download/LiveLessons/Hadoop_Fundamentals/.



Sample Content

Introduction

Lesson 1: Background Concepts
Learning objectives
Lesson 1.1 Understand Big Data and analytics
Lesson 1.2 Understand Hadoop as a data platform
Lesson 1.3 Understand Hadoop MapReduce basics
Lesson 1.4 Understand Spark language basics
Lesson 1.5 Learn the Linux command line features

Lesson 2: Running Hadoop on a Desktop or Laptop
Learning objectives
Lesson 2.1 Install Hortonworks Hadoop and Spark HDP Sandbox
Lesson 2.2 Install from Hadoop sources
Lesson 2.3 Install from Spark from sources

Lesson 3: The Hadoop Distributed File System
Learning objectives
Lesson 3.1 Understand HDFS basics
Lesson 3.2 Use HDFS tools and do administration
Lesson 3.3 Use HDFS in programs
Lesson 3.4 Utilize additional features of HDFS

Lesson 4: Hadoop MapReduce
Learning objectives
Lesson 4.1 Understand the MapReduce paradigm
Lesson 4.2 Develop and run a Java MapReduce application
Lesson 4.3 Understand how MapReduce works

Lesson 5: Hadoop MapReduce Examples
Learning objectives
Lesson 5.1 Use the Streaming Interface
Lesson 5.2 Use the Pipes interface
Lesson 5.3 Run the Hadoop grep example
Lesson 5.4 Debugging MapReduce
Lesson 5.5 Understand Hadoop Version 2 MapReduce
Lesson 5.6 Use Hadoop Version 2 features Part 1
Lesson 5.6 Use Hadoop Version 2 features Part 2

Lesson 6: Higher Level Tools
Learning objectives
Lesson 6.1 Demonstrate a Pig example
Lesson 6.2 Demonstrate a Hive example
Lesson 6.3 Demonstrate an Oozie example Part 1
Lesson 6.3 Demonstrate an Oozie example Part 2

Lesson 7: Using the Spark Language
Lesson 7.1 Learn Spark language basics
Lesson 7.2 Demonstrate a PySpark command line example

Lesson 8: Getting Data into Hadoop HDFS
Learning objectives
Lesson 8.1 Import data into Hive tables
Lesson 8.2 Use Spark to import data into HDFS
Lesson 8.3 Demonstrate a Flume Example Part 1
Lesson 8.3 Demonstrate a Flume Example Part 2
Lesson 8.4 Demonstrate a Sqoop Example Part 1
Lesson 8.4 Demonstrate a Sqoop Example Part 2

Lesson 9: Using the Zeppelin Web Interface
Learning objectives
Lesson 9.1 Understand Zeppelin features
Lesson 9.2 Create a PySpark example in Zeppelin

Lesson 10: Learning Basic Hadoop Installation and Administration
Learning objectives
Lesson 10.1 Install and configure Hadoop using Ambari
Lesson 10.2 Perform simple administration and monitoring with Ambari
Lesson 10.3 Perform simple administration and monitoring
Lesson 10.4 Utilize additional features of HDFS

