Home > Store

Spark, Ray, and Python for Scalable Data Science (Video Training)

By Jonathan Dinu
Published Jun 18, 2021 by Addison-Wesley Professional.

Online Video

Your Price: $239.99
List Price: $299.99
About this video

Video accessible from your Account page after purchase.

Add to cart

Description

Sample Content

Updates

More Information

Description

Copyright 2021
Edition: 1st

Online Video
ISBN-10: 0-13-680594-9
ISBN-13: 978-0-13-680594-6

7.5 Hours of Video Instruction

Conceptual overviews and code-along sessions get you scaling up your data science projects using Spark, Ray, and Python.

Overview
Machine learning is moving from futuristic AI projects to data analysis on your desk. You need to go beyond following along in discussions to coding machine learning tasks. Spark, Ray, and Python for Scalable Data Science LiveLessons show you how to scale machine learning and artificial intelligence projects using Python, Spark, and Ray.

Skill Level

Beginner to Intermediate

Learn How To

Integrate Python and distributed computing
Scale data processing with Spark
Conduct exploratory data analysis with PySpark
Utilize parallel computing with Ray
Scale machine learning and artificial intelligence applications with Ray

Who Should Take This Course

This course is a good fit for anyone who needs to improve their fundamental understanding of scalable data processing integrated with Python for use in machine learning or artificial intelligence applications.

Course Requirements

A basic understanding of programming in Python (variables, basic control flow, simple scripts).
Familiarity with the vocabulary of data processing at scale, machine learning (dataset, training set, test set, model), and AI.

Lesson Descriptions

Lesson 1: Introduction to Distributed Computing in Python
Lesson 1 starts with an introduction to the data science process and workflow. It then turns to a bit of history on why frameworks like Spark and Ray are necessary. Next comes a short primer on distributed systems theory. Python-based distributed computing frameworks come up next. Finally, Jonathan begins to explain the Spark ecosystem as well as how Spark compares to Ray.
Lesson 2: Scaling Data Processing with Spark
Lesson 2 goes into detail on the Spark framework beginning with a "Hello World" example of programming with Spark. Then Jonathan turns to the Spark APIs. You get some experience with one of Spark's primary data structures, the resilient distributed dataset (RDD). Next is key-value pairs and how Spark does operations on them similar to MapReduce. The lesson finishes up with a bit of Spark internals and the overall Spark application lifecycle.
Lesson 3: Exploratory Data Analysis with PySpark
In Lesson 3, Jonathan continues using Spark but now in the context of a larger data science workflow centered around natural language processing (NLP). He starts off with a general introduction to exploratory data analysis (EDA), followed by a quick tour of Jupyter notebooks. Next he discusses how to do EDA with Spark at scale, and then he shows you how to create statistics and data visualizations to summarize data sets. Finally, he tackles the NLP example, showing you how to transform a large corpus of text into numerical representation suitable for machine learning.
Lesson 4: Parallel Computing with Ray
Lesson 4 introduces the Ray programming API, with Jonathan comparing the similarities and differences between the Ray and Spark APIs. You learn how you can distribute functions with Ray, as well as how you can perform operations with distributed classes or objects with Ray actors. Finally, Jonathan finishes up with a large scale simulation to highlight the strengths of the Ray framework.
Lesson 5: Scaling AI Applications with Ray
Lesson 5 discusses how Ray enables you to scale up machine learning and artificial intelligence applications with Python. The lesson starts with the general model training and evaluation process in Python. Then it turns to how Ray enables you to scale both the evaluation and tuning of our models. You see how Ray makes possible very efficient hyperparameter tuning. You also see how, once you have a trained model, Ray can serve predictions from your machine learning model. Finally, the lesson finishes with an introduction to how Ray can enable you to both deploy machine learning models to production and monitor them once they are there.

About Pearson Video Training
Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.
Video Lessons are available for download for offline viewing within the streaming format. Look for the green arrow in each lesson.



Sample Content

Introduction

Lesson 1: Introduction to Distributed Computing in Python
Topics
1.1 Introduction and Materials
1.2 The Data Science Process
1.3 A Brief Historical Diversion
1.4 Distributed Systems Primer
1.5 Python Distributed Computing Frameworks
1.6 The What and Why of Spark
1.7 The Spark Platform
1.8 Spark versus Ray

Lesson 2: Scaling Data Processing with Spark
Topics
2.1 Course Coding Setup
2.2 Your First PySpark Job
2.3 Introduction to RDDs
2.4 Transformations versus Actions
2.5 RDD Deep Dive
2.6 The Spark Execution Context
2.7 Spark versus Hadoop
2.8 Spark Application Lifecycle

Lesson 3: Exploratory Data Analysis with PySpark
Topics
3.1 Introduction to Exploratory Data Analysis
3.2 A Quick Tour of Jupyter Notebooks
3.3 Parsing Data at Scale
3.4 Spark DataFrames: Integration into Existing Workflows
3.5 Scaling Exploratory Data Analysis with Spark
3.6 Making Sense of Data: Summary Statistics and Data Visualization
3.7 Working with Text: Introduction to NLP
3.8 Tokenization and Vectorization with MLlib

Lesson 4: Parallel Computing with Ray
Topics
4.1 The What and Why of Ray
4.2 The Ray Programming Model
4.3 Parallelizing Functions with Ray Tasks
4.4 Asynchronous Programming with Actors
4.5 Cellular Automata and the Game of Life
4.6 Distributed Agent-Based Models with Ray

Lesson 5: Scaling AI Applications with Ray
Topics
5.1 Introduction to Model Evaluation
5.2 Serializing Data for Machine Learning Applications
5.3 Cross Validation with scikit-learn
5.4 Strategies for Tuning Machine Learning Models
5.5 Grid Search in Python
5.6 Distributed Hyperparameter Optimization with Ray Tune
5.7 Resource Efficient Search with Principled Early Stopping
5.8 Diving Deeper into Ray's Internals
5.9 Serving Machine Learning Models
5.10 Deploying AI Applications with Ray Serve
5.11 Monitoring Model Performance in Production

Summary