Put on your data engineer hat! In this series of liveProjects, you’ll build a modern, cloud-based, three-layer data Lakehouse. First, you’ll set up your workspace on the Databricks platform, leveraging important Databricks features, before pushing the data into the first two layers of the data lake. Next, using Apache Spark, you’ll build the third layer, used to serve insights to different end-users. Then, you’ll use Delta Lake to turn your existing data lake into a Lakehouse. Finally, you’ll deliver an infrastructure that allows your end-users to perform specific queries, using Apache Superset, and build dashboards on top of the existing data. When you’re done with the projects in this series, you’ll have a complete big data pipeline for a cloud-based data lake—and you’ll understand why the three-layer architecture is so popular.
Imagine you’re a data engineer working at an enterprise. In this liveProject, you’ll set up a Databricks platform, creating clusters and notebooks, interacting with the Databricks File System (DBFS), and leveraging important Databricks features. You’ll also gain first-hand experience with Apache Spark—the world’s most widely used distributed processing framework—on tasks like reading the input data in CSV and JSON format, filtering, and writing the data to the data lake’s curated layer on DBFS.
Step into the role of a data engineer working at an enterprise. Your task is to build a data lake’s serving layer and ensure that business queries run on it in a matter of seconds. You’ll start with reading cleansed data that’s already sitting in the curated layer. Then you’ll transform it, enrich it, aggregate it, and denormalize it using Apache Spark. When you’re finished, you’ll have multiple output tables that make up the serving layer of the data lake.
Turn an existing data lake into a Lakehouse, using Delta Lake, an open table format (and the cornerstone of Databricks’ Lakehouse design). For data processing and interacting with the Lakehouse, you’ll use Apache Spark. As you transform the existing tables into Delta tables, you’ll explore Delta Lake’s rich features, see firsthand how it handles potential problems, and appreciate the sophistication of the Lakehouse design.
Give your end-users what they want! In this liveProject, your challenge is to deliver an infrastructure that allows your end-users to query an existing, fully functioning modern Lakehouse for two different use cases: analyzing data aggregates over a long period of time to identify trends, and analyzing recently ingested data for monitoring purposes. You’ll use Preset, an SaaS platform that offers a managed version of Apache Superset, to run queries on the Lakehouse and build two interactive dashboards for your distinct use cases.
This liveProject series is for software engineers and data professionals interested in onboarding big data processing skills including processing large amounts of data and building cloud-based data lakes. To begin these liveProjects you’ll need to be familiar with the following:
In this liveProject series, you’ll learn to build a complete big data pipeline for a cloud-based data lake. In a world where data is a high-value commodity, so are the skills you’ll learn here:
geekle is based on a wordle clone.