Four-Project Series

End-to-End Batch Data Pipeline with Spark you own this product

prerequisites
basic Python • basics of Jupyter Notebook • basic distributed computing • basic SQL • basic data processing in Python • basic understanding of distributed data lakes, basic common data visualization types
skills learned
use Apache Spark to read • transform, and write data • use Apache Superset to create interactive dashboards and create and update Delta Lake tables • perform custom data manipulation according to specific user needs • design data pipelines that enable end users to consume data efficiently
Mahdi Karabiben
4 weeks · 4-6 hours per week average · BEGINNER

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


Put on your data engineer hat! In this series of liveProjects, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.

These projects are designed for learning purposes and are not complete, production-ready applications or solutions.

here's what's included

Project 1 Data Ingestion and Cleaning

Imagine you’re a data engineer working at an enterprise. In this liveProject, you’ll set up a Databricks platform, creating clusters and notebooks, interacting with the Databricks File System (DBFS), and leveraging important Databricks features. You’ll also gain first-hand experience with Apache Spark—the world’s most widely used distributed processing framework—on tasks like reading the input data in CSV and JSON format, filtering, and writing the data to the data lake’s curated layer on DBFS.

Project 2 Data Manipulation

Step into the role of a data engineer working at an enterprise. Your task is to build a data lake’s serving layer and ensure that business queries run on it in a matter of seconds. You’ll start with reading cleansed data that’s already sitting in the curated layer. Then you’ll transform it, enrich it, aggregate it, and denormalize it using Apache Spark. When you’re finished, you’ll have multiple output tables that make up the serving layer of the data lake.

Project 3 From Data Lake to Lakehouse

Turn an existing data lake into a Lakehouse, using Delta Lake, an open table format (and the cornerstone of Databricks’ Lakehouse design). For data processing and interacting with the Lakehouse, you’ll use Apache Spark. As you transform the existing tables into Delta tables, you’ll explore Delta Lake’s rich features, see firsthand how it handles potential problems, and appreciate the sophistication of the Lakehouse design.

Project 4 Interactive Superset Dashboard

Give your end-users what they want! In this liveProject, your challenge is to deliver an infrastructure that allows your end-users to query an existing, fully functioning modern Lakehouse for two different use cases: analyzing data aggregates over a long period of time to identify trends, and analyzing recently ingested data for monitoring purposes. You’ll use Preset, an SaaS platform that offers a managed version of Apache Superset, to run queries on the Lakehouse and build two interactive dashboards for your distinct use cases.

choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • End-to-End Batch Data Pipeline with Spark project for free

project author

Mahdi Karabiben

Mahdi is a senior data engineer at Zendesk. With four years of experience in data engineering, he has worked on multiple large-scale projects within the AdTech and financial sectors. He's a Cloudera-certified Apache Spark developer and works with Big Data technologies on a daily basis, designing and building data pipelines, data lakes, and data services that rely on petabytes of data. Thanks to his degree in software engineering (with a minor in big data), he is comfortable with a wide range of technologies and concepts. He additionally writes for major Medium publications (Towards Data Science, The Startup) and technology websites (TheNextWeb, Software Engineering Daily, freeCodeCamp).

Prerequisites

This liveProject series is for software engineers and data professionals interested in onboarding big data processing skills including processing large amounts of data and building cloud-based data lakes. To begin these liveProjects you’ll need to be familiar with the following:


TOOLS
  • Basic Python
  • Basics of Jupyter Notebook
  • Basic distributed computing
  • Basic SQL
TECHNIQUES
  • Basic data processing in Python
  • Basic understanding of distributed data lakes
  • Basic common data visualization types

you will learn

In this liveProject series, you’ll learn to build a complete big data pipeline for a cloud-based data lake. In a world where data is a high-value commodity, so are the skills you’ll learn here:


  • Use Apache Spark to read, transform, and write data
  • Use Apache Superset to create interactive dashboards
  • Create, work with, and update Delta Lake tables
  • Perform custom data manipulation to transform according to specific user needs
  • Design and implement data pipelines and lakes to enable end-users to consume data efficiently
  • Leverage the different components of Databricks
  • Use Preset to create and manage Apache Superset workspaces

features

Self-paced
You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.