Six-Project Series

A Storage Layer for Big Data in AWS you own this product

prerequisites
Basic understanding of data storage and management concepts • familiarity with AWS services and console, knowledge of Python • basic networking concepts and configurations • basic understanding of security measures and best practices • understanding of data quality concepts and metrics • understanding of data pipeline and workflow concepts and tools.
skills learned
Moving data to the cloud using AWS DataSync or AWS Database Migration Service • transforming data using AWS Glue and PySpark • developing data quality checks using PyDeequ • orchestration of data pipelines using AWS Step Functions and AWS Glue Workflow • CloudFormation for infrastructure automation • data lake management and organization • data lifecycle policies and management.
Gianluigi Mucciolo
6 weeks · 5-7 hours per week average · INTERMEDIATE

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


Nextstellar Corp is a streaming media company that generates huge amounts of data from its customers. They want to move their data infrastructure from on-prem solutions that analyze only a sample of data to a modern cloud solution that gives them full access to all the data they produce. That’s where you come in! In this series of liveProjects, you’ll build an Extract, Transform, and Load (ETL) solution that can transfer data from numerous existing sources to the AWS cloud. You’ll learn how to code raw data transformation logic; use AWS Glue jobs to normalize, transform, and validate data quality rules; coordinate jobs into seamless workflows with AWS Step Functions; and more.

These projects are designed for learning purposes and are not complete, production-ready applications or solutions.

The content vastly exceeded my expectations and I think it’s truly excellent.

Nick Miller, Independent computational linguist

here's what's included

Project 1 Migrate Files to the Cloud

The Nextstellar Corp media service has a lot of data—too much data to handle on prem! In order to properly analyze all the data generated by their streaming customers, they’re migrating to the cloud. In this liveProject, you’ll be helping them. You’ll tackle the common challenge of transferring on-prem data to AWS using the handy AWS DataSync tool. You’ll use Infrastructure-as-Code to create Landing Zone Amazon S3 buckets, automate data migration, and finally prepare a summary of likely infrastructure costs for your boss to review.

Project 2 Input Transactional Data

Nextstellar Corp is a media company with huge amounts of data to analyze. Some of that data is sitting in a PostgreSQL database, which is used for both authentication management and decision-making, as well as maintaining user preferences and feedback. Your boss doesn’t want that data sitting in the database—he wants it in the cloud! Moving it is exactly what you’ll be doing in this liveProject. You’ll use AWS Database Migration Service to enrich Nextstellar’s data lake with the PostgreSQL database so it can take full advantage of both modern data architecture and the AWS ecosystem.

Project 3 Integrate Data with AWS Glue

Media company Nextstellar Corp has completed the migration of their data to the cloud—now they need to analyze it! That’s where you come in. You’ll take on the challenge of processing and integrating file and transactional data into a new cloud database solution that uses modern data architecture. You’ll use the AWS Glue tool to automate the whole process, from creating your crawlers and database to building your repository, Glue jobs, triggers, and establishing monitoring, troubleshooting, and scaling.

Project 4 Serverless Transformation

Nextstellar Corp is very excited—they’re taking all their data to the cloud! They’ve recently migrated and have an early-defined data lake solution. The CEO has approached you to deliver the next step of their cloud data process: using AWS Glue to apply the transformation logic and store the curated data in Amazon S3. You’ll utilize Jupyter Notebooks to curate and standardize your data sets—crafting, organizing, and managing datasets to ensure they are easily accessible and usable—then design a CI/CD pipeline to test and deploy code with a single push after completion.

Project 5 Data Quality Check

Nextstellar Corp has recently migrated to the cloud, and for the first time, they can analyze 100% of their company’s data. But there’s a problem: your CEO isn’t confident in your data’s quality. He wants to add more data sources and collect more user behavior information, and ensure these new sources are top-notch by utilizing the Python- (or Scala-) based Deequ library. Your task is to utilize Jupyter Notebooks with AWS Glue Studio to experiment with PyDeequ for data quality assessment. Next, you’ll enhance Nextstellar’s assessment capabilities by employing AWS Glue jobs to react and take action on data quality issues. Finally, you’ll monitor data quality using CloudWatch for prompt response and maintenance of data reliability.

Project 6 Orchestrate an ETL Pipeline

Nextstellar Corp needs you to tackle a big challenge: completing their cloud migration by rebuilding their historical data lake as a data layer in the cloud. You’ll implement an effective and automated data orchestration framework for the ingestion and transform and curate layers, using best practices for Infrastructure-as-Code to automate your data layer. Finally, you’ll establish a monitoring system that will automatically alert you to any issues or problems that might crop up.

book resources

When you start each of the projects in this series, you'll get full access to the following book for 90 days.

choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • A Storage Layer for Big Data in AWS project for free

The project series has a great structure, starting from the data sources and encompassing all relevant AWS services and Python libraries.

Ninoslav Cerkez, Senior ML Engineer, Rimac Technology

project author

Gianluigi Mucciolo
Gianluigi Mucciolo specializes in AWS technologies and Agile methodologies. As an AWS Authorized Instructor and Cloud Technical Principal, he is dedicated to advancing cloud professionals' knowledge and participates in community-building initiatives. With a strong background in Artificial Intelligence and Big Data, Gianluigi constantly seeks growth opportunities.

Prerequisites

This liveProject is for engineers who want to build a data lake lambda architecture using AWS fully managed services. You will need to know:


TOOLS
  • Basics of the Linux console
  • Basics of AWS
  • Basics of the AWS console
  • Basics of the AWS CLI

TECHNIQUES
  • Basics of infrastructure automation
  • Basics of big data Lambda architecture

features

Self-paced
You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.