Four-Project Series

End-to-End Batch Data Pipeline with Spark

you own this product

prerequisites: basic Python • basics of Jupyter Notebook • basic distributed computing • basic SQL • basic data processing in Python • basic understanding of distributed data lakes, basic common data visualization types
skills learned: use Apache Spark to read • transform, and write data • use Apache Superset to create interactive dashboards and create and update Delta Lake tables • perform custom data manipulation according to specific user needs • design data pipelines that enable end users to consume data efficiently

Mahdi Karabiben

4 weeks · 4-6 hours per week average · BEGINNER

Included with a Manning Online subscription

catalog / Data Science

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

whole series

$69.99 $49.99

you save $20.00 (29%)

Put on your data engineer hat! In this series of liveProjects, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.

go to series

These projects are designed for learning purposes and are not complete, production-ready applications or solutions.

here's what's included

Project 1 Data Ingestion and Cleaning

Imagine you’re a data engineer working at an enterprise. In this liveProject, you’ll set up a Databricks platform, creating clusters and notebooks, interacting with the Databricks File System (DBFS), and leveraging important Databricks features. You’ll also gain first-hand experience with Apache Spark—the world’s most widely used distributed processing framework—on tasks like reading the input data in CSV and JSON format, filtering, and writing the data to the data lake’s curated layer on DBFS.

learn more

$29.99 $19.99

Project 2 Data Manipulation

Step into the role of a data engineer working at an enterprise. Your task is to build a data lake’s serving layer and ensure that business queries run on it in a matter of seconds. You’ll start with reading cleansed data that’s already sitting in the curated layer. Then you’ll transform it, enrich it, aggregate it, and denormalize it using Apache Spark. When you’re finished, you’ll have multiple output tables that make up the serving layer of the data lake.

learn more

$29.99 $19.99

Project 3 From Data Lake to Lakehouse

Turn an existing data lake into a Lakehouse, using Delta Lake, an open table format (and the cornerstone of Databricks’ Lakehouse design). For data processing and interacting with the Lakehouse, you’ll use Apache Spark. As you transform the existing tables into Delta tables, you’ll explore Delta Lake’s rich features, see firsthand how it handles potential problems, and appreciate the sophistication of the Lakehouse design.

learn more

$29.99 $19.99

Project 4 Interactive Superset Dashboard

Give your end-users what they want! In this liveProject, your challenge is to deliver an infrastructure that allows your end-users to query an existing, fully functioning modern Lakehouse for two different use cases: analyzing data aggregates over a long period of time to identify trends, and analyzing recently ingested data for monitoring purposes. You’ll use Preset, an SaaS platform that offers a managed version of Apache Superset, to run queries on the Lakehouse and build two interactive dashboards for your distinct use cases.

learn more

$29.99 $19.99

go to series

whole series

$69.99 $49.99

you save $20.00 (29%)

choose your plan

pro

monthly

annual

$24.99

$249.99
only $20.83 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose another free product every time you renew
choose twelve free products per year
exclusive 50% discount on all purchases
End-to-End Batch Data Pipeline with Spark project for free

team

monthly

annual

$49.99

$399.99
only $33.33 per month

five seats for your team
access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose another free product every time you renew
choose twelve free products per year
exclusive 50% discount on all purchases
End-to-End Batch Data Pipeline with Spark project for free

more seats?

project author

Mahdi Karabiben

Mahdi is a senior data engineer at Zendesk. With four years of experience in data engineering, he has worked on multiple large-scale projects within the AdTech and financial sectors. He's a Cloudera-certified Apache Spark developer and works with Big Data technologies on a daily basis, designing and building data pipelines, data lakes, and data services that rely on petabytes of data. Thanks to his degree in software engineering (with a minor in big data), he is comfortable with a wide range of technologies and concepts. He additionally writes for major Medium publications (Towards Data Science, The Startup) and technology websites (TheNextWeb, Software Engineering Daily, freeCodeCamp).

Prerequisites

This liveProject series is for software engineers and data professionals interested in onboarding big data processing skills including processing large amounts of data and building cloud-based data lakes. To begin these liveProjects you’ll need to be familiar with the following:

TOOLS

Basic Python
Basics of Jupyter Notebook
Basic distributed computing
Basic SQL

TECHNIQUES

Basic data processing in Python
Basic understanding of distributed data lakes
Basic common data visualization types

features

Self-paced: You choose the schedule and decide how much time to invest as you build your project.
Project roadmap: Each project is divided into several achievable steps.
Get Help: While within the liveProject platform, get help from fellow participants and even more help with paid sessions with our expert mentors.
Compare with others: For each step, compare your deliverable to the solutions by the author and other participants.
book resources: Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.