Designing Cloud Data Platforms
Danil Zburivsky and Lynda Partner
  • MEAP began December 2019
  • Publication in Early 2021 (estimated)
  • ISBN 9781617296444
  • 400 pages (estimated)
  • printed in black & white

Here's a great book about the different parts of a cloud-based data platform and how you can build one using what's on offer from the different major cloud vendors.

George Thomas
Centralized data warehouses, the long-time defacto standard for housing data for analytics, are rapidly giving way to multi-faceted cloud data platforms. Companies that embrace modern cloud data platforms benefit from an integrated view of their business using all of their data and can take advantage of advanced analytic practices to drive predictions and as yet unimagined data services. Designing Cloud Data Platforms is an hands-on guide to envisioning and designing a modern scalable data platform that takes full advantage of the flexibility of the cloud. As you read, you’ll learn the core components of a cloud data platform design, along with the role of key technologies like Spark and Kafka Streams. You’ll also explore setting up processes to manage cloud-based data, keep it secure, and using advanced analytic and BI tools to analyse it.

About the Technology

Access to affordable, dependable, serverless cloud services has revolutionized the way organizations can approach data management, and companies both big and small are raring to migrate to the cloud. But without a properly designed data platform, data in the cloud can remain just as siloed and inaccessible as it is today for most organizations. Designing Cloud Data Platforms lays out the principles of a well-designed platform that uses the scalable resources of the public cloud to manage all of an organization's data, and present it as useful business insights.

About the book

In Designing Cloud Data Platforms, you’ll learn how to integrate data from multiple sources into a single, cloud-based, modern data platform. Drawing on their real-world experiences designing cloud data platforms for dozens of organizations, cloud data experts Danil Zburivsky and Lynda Partner take you through a six-layer approach to creating cloud data platforms that maximizes flexibility and manageability and reduces costs. Starting with foundational principles, you’ll learn how to get data into your platform from different databases, files, and APIs, the essential practices for organizing and processing that raw data, and how to best take advantage of the services offered by major cloud vendors. As you progress past the basics you’ll take a deep dive into advanced topics to get the most out of your data platform, including real-time data management, machine learning analytics, schema management, and more.
Table of Contents detailed table of contents

1 Introducing the Data Platform

1.1 The backstory

1.2 Data warehouses struggle with data Variety, Volume and Velocity

1.2.1 Variety

1.2.2 Volume

1.2.3 Velocity

1.2.4 All the V’s at once

1.3 Data Lakes to the rescue?

1.4 Along came the Cloud

1.5 Cloud, data lakes and data warehouses belong together - the emergence of cloud data platforms

1.6 Building blocks of a cloud data platform

1.6.1 Ingestion layer

1.6.2 Storage layer

1.6.3 Processing layer

1.6.4 Serving layer

1.7 How the Cloud Data Platform deals with the 3 Vs

1.7.1 Variety

1.7.2 Volume

1.7.3 Velocity

1.7.4 Two More Vs

1.8 Common Use Cases

1.9 Summary

2 Why a Data Platform and not just a Data Warehouse

2.1 Cloud Data Platforms and Cloud Warehouses. The practical aspects

2.1.1 A closer look at the data sources

2.1.2 An example cloud data warehouse-only architecture

2.1.3 An example cloud data platform architecture

2.2 Ingesting data

2.2 1 Ingesting data directly into an Azure Synapse

2.2 2 Ingesting data into an Azure data platform

2.2.3 Managing changes in upstream data sources

2.3 Processing data

2.3.1 Processing data in the warehouse

2.3.2 Processing data in the data platform

2.4 Accessing data

2.5 Cloud costs considerations

2.6 Summary

2.7 Exercise Answers

3 Getting bigger and leveraging the Big 3 — Google, Amazon and Microsoft

3.1 Cloud data platform layered architecture

3.1.1 Data ingestion layer

3.1.2 Fast and slow storage

3.1.3 Processing layer

3.1.4 Technical Metadata layer

3.1.5 The Serving Layer and data consumers

3.1.6 Orchestration and ETL overlay layers

3.2 The importance of layers in a data platform architecture

3.3 Mapping cloud data platform layers to specific tools

3.3.1 AWS

3.3.2 Google Cloud

3.3.3 Azure

3.4 Open Source and commercial alternatives

3.4.1 Batch data ingestion

3.4.2 Streaming data ingestion and real time analytics

3.4.3 Orchestration layer

3.5 Summary

3.6 Exercise Answers

4 Getting data into the platform

4.1 Databases, files, APIs and streams

4.1.1 Relational databases

4.1.2 Files

4.1.3 SaaS data via API

4.1.4 Streams

4.2 Ingesting data from relational databases

4.2.1 Ingesting data from RDBMS using an SQL interface

4.2.2 Full table ingestion

4.2.3 Incremental table ingestion

4.2.4 Change Data Capture (CDC)

4.2.5 CDC Vendors Overview

4.2.6 Data Types Conversion

4.2.7 Ingesting data from NoSQL databases

4.2.8 Capturing important metadata for RDBMS or NoSQL ingestion pipeline

4.3 Ingesting data from files

4.3.1 Tracking ingested files

4.3.2 Capturing file ingestion metadata

4.4 Ingesting data from streams

4.4.1 Differences between batch and streaming ingestion

4.4.2 Capturing streaming pipeline metadata

4.5 Ingesting data from SaaS applications

4.6 Network and security considerations for data ingestion into the cloud

4.6.1 Connecting other networks to your cloud data platform

4.7 Summary

4.8 Exercise Answers

5 Organizing and processing data

5.1 Processing as a separate layer in the data platform

5.2 Data processing stages

5.3 Organizing your cloud storage

5.3.1 Cloud storage containers and folders

5.4 Common data processing steps

5.4.1 File Format conversion

5.4.2 Data deduplication

5.4.3 Data Quality Checks

5.5 Configurable Pipelines

5.6 Summary

5.7 Exercise Answers

6 Real-time data processing and analytics

6.1 Real-time ingestion vs real-time processing

6.2 Use cases for real time data processing

6.2.1 Retail use case - real-time ingestion

6.2.2 Online gaming use case - real-time ingestion and real-time processing

6.2.3 Real-time ingestion vs real—​time processing summary

6.3 When should you use real-time ingestion and/or real-time processing?

6.4 Organizing data for real-time use

6.4.1 The anatomy of fast storage

6.4.2 How does fast storage scale?

6.4.3 Organizing data in the real time storage

6.5 Common data transformations in real time

6.5.1 Causes of duplicates in real-time systems

6.5.2 Deduplicating data in the real-time systems

6.5.3 Converting message formats in real-time pipelines

6.5.4 Real time data quality checks

6.5.5 Combining batch and real-time data

6.6 Cloud services for real-time data processing

6.6.1 AWS real-time processing services

6.6.2 Google Cloud real-time processing services

6.6.3 Azure real-time processing services

6.7 Summary


7 Metadata layer architecture

7.1 What we mean by metadata

7.1.1 Business metadata

7.1.2 Data Platform internal metadata or “pipeline metadata”

7.2 Taking advantage of pipeline metadata

7.3 Metadata model

7.3.1 Metadata domains

7.4 Metadata layer implementation options

7.4.1 Metadata layer as a collection of configuration files

7.4.2 Metadata database

7.4.3 Metadata API

7.5 Overview of existing solutions

7.5.1 Cloud metadata services

7.5.2 Open source metadata layer implementations

7.6 Summary

7.7 Exercise Answers

8 Schema management

8.1 Why Schema Management

8.1.1 Schema changes in a traditional data warehouse architecture

8.1.2 Schema-on-read approach

8.2 Schema Management Approaches

8.2.1 Schema as a contract

8.2.2 Schema management in the data platform

8.2.3 Monitoring schema changes

8.3 Schema Registry Implementation

8.3.1 Apache Avro schemas

8.3.2 Existing Schema Registry implementations

8.3.3 Schema Registry as a part of a Metadata layer

8.4 Schema Evolution Scenarios

8.4.1 Schema compatibility rules

8.4.2 Schema evolution and data transformation pipelines

8.5 Schema Evolution and Data Warehouses

8.5.1 Schema management features of cloud data warehouses

8.6 Summary

9 Data Access

10 Cloud cost optimizations

11 Data Platforms in the real world

What's inside

  • The tools of different public cloud for implementing data platforms
  • Best practices for managing structured and unstructured data sets
  • Machine learning tools that can be used on top of the cloud
  • Cost optimization techniques

About the reader

For data professionals familiar with the basics of cloud computing and distributed data processing systems like Hadoop and Spark.

About the authors

Danil Zburivsky has over 10 years experience designing and supporting large-scale data infrastructure for enterprises across the globe. Lynda Partner is the VP of Analytics-as-a-Service at Pythian, and has been on the business side of data for over 20 years.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $29.99 $59.99 pBook + eBook + liveBook
Additional shipping charges may apply
Designing Cloud Data Platforms (print book) added to cart
continue shopping
go to cart

eBook $24.99 $47.99 3 formats + liveBook
Designing Cloud Data Platforms (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks