Home > Store

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Data Engineering Foundations LiveLessons Part 1: Using Spark, Hive, and Hadoop Scalable Tools

Your browser doesn't support playback of this video. Please download the file to view it.

Online Video

Register your product to gain access to bonus material or receive a coupon.

Description

  • Copyright 2022
  • Edition: 1st
  • Online Video
  • ISBN-10: 0-13-744066-9
  • ISBN-13: 978-0-13-744066-5

6+ Hours of Video Instruction


The perfect way to get started with scalable data engineering tools. All tools and examples are presented using a practical, hands-on approach that can be reproduced on a freely provided virtual machine. By the completion of these LiveLessons, participants will have gained the understanding and experience to begin working within the big data engineering ecosystem.


Overview


Data Engineering Foundations Part1: Using Spark, Hive, and Hadoop Scalable Tools LiveLessons provides more than six hours of video introducing you to the Apache Hadoop big data ecosystem. The tutorial includes background information and demonstrates the core components of data engineering and scalability, including Apache PySpark, Hadoop, Hadoop Distributed File Systems (HDFS), MapReduce, Hive, and the Zeppelin web notebook. It also covers the use of basic Linux command line analytic tools. All lesson examples and open-source software used in these LiveLessons are freely available on a companion virtual machine that enables continued exploration of the lesson examples.


Skill Level

  • Beginner
  • Intermediate

Learn How To
  • Understand basic data engineering concepts
  • Understand Apache Hadoop, MapReduce, and Spark operation
  • Understand scalable systems
  • Use Linux command line analytic tools
  • Use Apache Zeppelin web notebooks with different tools
  • Use Apache Hadoop and the Hadoop Distributed File System
  • Use Apache Hadoop MapReduce with Python
  • Use the Apache Hive Scalable Database
  • Use Apache PySpark with MapReduce
  • Use Apache PySpark with dataframes and Hive tables


Who Should Take This Course
  • Users, developers, and administrators interested in learning the fundamental aspects and operations of date engineering and scalable systems

Course Requirements
  • Basic understanding of programming and application development
  • A working knowledge of Linux systems, command line, and tools
  • Familiarity with Python, SQL, and the Bash shell

Lesson Descriptions

Lesson 1: Background Concepts

In Lesson 1, Doug introduces you to the important concepts you need to know to understand big data, Hadoop, and Spark ecosystem. He begins with a description of big data and big data analytic concepts and then presents Hadoop as a big data platform. He then turns to the basics of Hadoop and the Spark language to finish up the lesson.


Lesson 2: Working with Scalable Systems
In Lesson 2, Doug introduces you to working with scalable systems. The lessons start with Doug covering scalable computing concepts and then turns to a freely available Linux-based virtual machine that is runnable on most laptop and desktop systems. Using this virtual machine, you can run most of the examples in the lessons. Doug also uses the virtual machine to demonstrate some of the Linux command line analytic tools and introduce the Zeppelin web notebook.


Lesson 3: Using the Hadoop HDFS File System

Doug explains the Hadoop Distributed File System (HDFS) in Lesson 3. He also presents a quick-start on how to use HDFS command line tools. Finally, he finishes up the lesson by explaining how to use the HDFS web interface.


Lesson 4: Using Hadoop MapReduce

In this lesson, Doug explains and demonstrates how to use Hadoop MapReduce. He begins with an explanation of the MapReduce algorithm and how it operates in a clustered parallel environment. Doug then demonstrates how to run MapReduce examples and use the Hadoop streaming interface on your local machine. He concludes the lesson by demonstrating Hadoop performance using a four-node Hadoop cluster and the web-based MapReduce jobs interface.


Lesson 5: Using the Hive Scalable Database

In Lesson 5, Doug introduces the Hive scalable database. Based on Hadoop MapReduce, Hive is used to derive a new feature from an existing dataset. This important data engineering process is demonstrated from both the command line and the Zeppelin web notebook,


Lesson 6 : Using the Apache PySpark

In the final lesson of Part 1, Doug introduces PySpark. Based on the underlying Spark language, PySpark enables Python programmers to learn scalable data engineering. Before the hands-on lessons, Doug provides a solid introduction to Spark and PySpark operations. This background includes using the Spark web interface and demonstrates how to manage a SparkSession and a SparkContext for distributed operation. Examples of MapReduce programming and DataFrame operations are presented from both the command line and a Zeppelin notebook. The lesson concludes with the operations needed to transfer data to and from PySpark and Hive database tables.


About Pearson Video Training


Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Sams, and Que. Topics include IT certification, network security, programming, web development, mobile development, data analytics, and more. Learn more about Pearson video training at http://www.informit.com/video.


Video Lessons are available for download for offline viewing within the streaming format. Look for the green arrow in each lesson.

Sample Content

Table of Contents

Introduction

Lesson 1: Background Concepts

Learning objectives

1.1 Understand big data and data analytics concepts

1.2 Understand Hadoop as a big data platform

1.3 Understand Hadoop MapReduce basics

1.4 Understand Spark language basics

Lesson 2: Working with Scalable Systems

Learning objectives

2.1 Understand scalable concepts

2.2 Emulate scalable systems

2.3 Use Linux command line analytics tools

2.4 Use the Zeppelin web notebook

Lesson 3: Using the Hadoop HDFS File System

Learning objectives

3.1 Understand HDFS basics

3.2 Use HDFS command line tools

3.3 Use the HDFS web interface

Lesson 4: Using Hadoop MapReduce

Learning objectives

4.1 Understand the MapReduce paradigm and platform

4.2 Understand parallel MapReduce

4 3 Run MapReduce examples

4.4 Use the streaming interface

4.5 Use the MapReduce (YARN) web interface

Lesson 5: Using the Hive Scalable Database

Learning objectives

5.1 Run a Hive "SQL" example using the command line

5.2 Run a Hive example using a Zeppelin notebook

Lesson 6 : Using the Apache PySpark

Learning objectives

6.1 Understand Spark language basics

6.2 Understand SparkSession and Context

6.3 Use PySpark for MapReduce programing

6.4 Run a PySpark example using a Zeppelin notebook

Summary

Updates

Submit Errata

More Information

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.