Azure Data Engineering
Real-time, streaming, and batch analytics
Richard L. Nuckolls
  • MEAP began April 2019
  • Publication in Spring 2020 (estimated)
  • ISBN 9781617296307
  • 400 pages (estimated)
  • printed in black & white

Was a great read! The writer knows Azure deeply and it's evident when reading. Very easy to get into.

Taylor Dolezal
The Microsoft Azure cloud is an ideal platform for data-intensive applications. Designed for productivity, Azure provides pre-built services that make collection, storage, and analysis much easier to implement and manage. Azure Data Engineering teaches you how to design a reliable, performant, and cost-effective data infrastructure in Azure by progressively building a complete working analytics system.
Table of Contents detailed table of contents

1 What is data engineering

1.1 What is data engineering?

1.2 What do data engineers do?

1.3 How does Microsoft define data engineering?

1.3.1 Data acquisition

1.3.2 Data storage

1.3.3 Data processing

1.3.4 Data queries

1.3.5 Orchestration

1.3.6 Data retrieval

1.4 What tools does Azure provide for data engineering?

1.5 Azure Data Engineers

1.6 Example Application

1.7 Summary

2 Building an analytics system in Azure

2.1 Lambda architecture

2.2 Azure cloud services

2.2.1 Event Hubs

2.2.2 Stream Analytics

2.2.3 Data Lake Store

2.2.4 Data Lake Analytics

2.2.5 SQL Data Warehouse

2.2.6 Data Factory

2.2.7 Azure PowerShell

2.3 Azure analytics system architecture

2.4 Walkthrough of processing a series of event data records

2.4.1 Hot Path

2.4.2 Cold Path

2.4.3 Choosing abstract Azure services

2.5 Calculating cloud hosting costs

2.5.1 Event Hubs

2.5.2 Stream Analytics

2.5.3 Data Lake Storage

2.5.4 Data Lake Analytics

2.5.5 SQL Data Warehouse

2.5.6 Data Factory

2.6 Summary

3 Azure Storage Blob service

3.1 Azure naming conventions

3.1.1 Resource group

3.2 Searching for services

3.3 Cloud storage services

3.3.1 Problem definition: backup IIS logs

3.3.2 Create an Azure Storage account

3.3.3 Selecting a Storage account container

3.3.4 Create a Storage account container

3.3.5 Copy tools for Blob service

3.3.6 Blob tiering

3.4 Storage access

3.4.1 Problem definition: Backup files from two departments to common cloud storage. Maintain separate security access.

3.4.2 Designing Storage account access

3.5 Summary

4 Azure Data Lake storage

4.1 Storage services compared

4.1.1 Problem definition: backup IIS logs

4.1.2 Create an Azure Data Lake store

4.1.3 Copy tools for Data Lake store

4.2 Storage access

4.2.1 Access schemes

4.2.2 Problem definition: Backup files from two departments to common cloud storage. Maintain separate security access.

4.2.3 Configuring ADL access

4.2.4 Hierarchy structure in Data Lake store

4.3 Storage folder structure and data drift

4.3.1 Problem definition: IIS logging configuration is adding fields

4.3.2 Data drift

4.3.3 Hierarchy structure revisited

4.4 Summary

5 Queueing with Event Hubs

5.1 What is a queue?

5.1.1 Complete a task at a set rate

5.1.2 When input rate exceeds output rate

5.1.3 Problem definition: Increase the submission rate of box sorter

5.1.4 Queue-based load leveling

5.2 Create an Azure Storage Queue

5.2.1 Storage queue usage

5.2.2 Problem definition: Save pitching statistics to cloud storage

5.2.3 Azure Storage queue creation

5.3 Create an Azure Event Hub

5.3.1 Azure Event Hub as queue

5.3.2 Problem definition: Save pitcher biometric stats to cloud storage

5.3.3 Event Hubs Namespace creation

5.3.4 Event Hub creation

5.3.5 Event Hub options

5.3.6 Configure Capture

5.4 Queue Parallelization

5.4.1 When input rate exceeds output rate

5.4.2 Problem definition: Increase the sort rate of box sorter

5.4.3 Multiple queue processors

5.5 Secure access to Event Hub

5.5.1 Firewall and Virtual Networks

5.5.2 Shared access signature policies

5.6 Summary

6 Real-time queries with Azure Stream Analytics

6.1 Elements of a Stream Analytics job

6.1.1 Create an ASA job using the Azure Portal

6.1.2 Create an ASA job using Azure PowerShell

6.2 Configuring Inputs and Outputs

6.2.1 Easy access to raw data

6.2.2 Create an Event Hub input

6.2.3 Create a Data Lake store output

6.2.4 Create a SQL Database output

6.3 Updating the ASA job Query

6.3.1 Update transformations

6.3.2 Starting the ASA job

6.3.3 Output exceptions

6.4 Writing queries

6.4.1 Window functions

6.4.2 Machine learning functions

6.5 Scaling Stream Analytics jobs

6.5.1 Streaming units

6.5.2 Event ordering

6.6 Exercises

6.6.1 Determine if each ASA job Query can use more than six SUs.

6.6.2 You want 100 hopping window calculations each hour. Which of these options will give you that count?

6.7 Summary

7 Batch queries with Azure Data Lake Analytics

7.1 Elements of a Data Lake Analytics job

7.1.1 Extractors

7.1.2 Outputters

7.1.3 File selectors

7.1.4 Expressions

7.2 Your first job

7.2.1 Reading files

7.2.2 Passthrough query

7.2.3 Writing files

7.3 Create a Data Lake Analytics service

7.3.1 Create a Data Lake Analytics service with the Azure Portal

7.3.2 Creating the U-SQL job with Azure Portal

7.3.3 Create a Data Lake Analytics service with Azure PowerShell

7.3.4 Creating the U-SQL job with Azure PowerShell

7.4 Analytics units

7.4.1 Vertexes

7.4.2 Scaling the initial job execution

7.5 Reading from Blob storage

7.5.1 Reading files

7.5.2 Adding alternate data storage

7.5.3 Filter query

7.5.4 Writing files

7.6 Exercises

7.6.1 Exercise 1

7.6.2 Exercise 2

7.7 Summary

8 Integrating with Azure Data Lake Analytics

9 U-SQL for complex analytics

10 Service integration with Azure Data Factory

11 Distributed SQL with Azure SQL Data Warehouse

12 Data movement in Azure SQL Data Warehouse

Appendixes

Appendix A: Set up of Azure resources through PowerShell

A.1 Setup Azure PowerShell

A.2 Create a subscription

A.3 Azure naming conventions

A.4 Setup common Azure resources using PowerShell

A.4.1 Create a new resource group

A.4.2 Create new Azure Active Directory user

A.4.3 Create new Azure Active Directory group

A.5 Setup Azure services using PowerShell

A.5.1 Create new Storage account

A.5.2 Create new Data Lake store

A.5.3 Create new Event Hub

A.5.4 Create new Stream Analytics job

A.5.5 Create new Data Lake Analytics account

A.6 Summary

About the Technology

The Microsoft Azure cloud platform can host virtually any sort of computing task, from simple web applications to full-scale enterprise systems. With many pre-built services for everything from data storage to advanced machine learning, Azure offers all the building blocks for scalable big data analysis systems including ingestion, processing, querying, and migration.

About the book

Azure Data Engineering teaches you to build high-capacity data analytics systems using Azure cloud services for storing, collecting, and analyzing data. In it, seasoned IT professional and author Richard Nuckolls starts you off with an overview of core data engineering tasks and the Azure tools that support them. Then, you’ll dive right into building your analytics system, starting with Data Lake Store for data retention, Azure Event Hubs for high-throughput ingestion, and Stream Analytics for real-time query processing.

For batch scheduling and aggregate data movement, you’ll add Data Factory and Data Lake Analytics, along with SQL Data Warehouse for interactive queries. With Azure Active Directory, you’ll manage security by applying permissions and access roles. And because your design is based on the Lambda architecture, you can be sure it will handle large volumes of data beautifully and with lightning speed!

What's inside

  • Azure cloud services architecture
  • Building a data warehouse in Azure
  • How to choose the right Azure technology for your task
  • Calculating fixed and variable costs
  • Hot and cold path analytics
  • Stream processing with Azure Stream Analytics and Event Hub integration
  • Giving structure to distributed storage
  • Practical examples leading up to a fully functioning analytics system

About the reader

Readers should be comfortable with RDBMS systems like SQLServer and scripting using a language like PowerShell, Bash, or Python. Book examples use PowerShell and C#.

About the author

Richard Nuckolls is a senior developer building a big data analytics and reporting system in Azure. During his nearly 20 years of experience, he’s done server and database administration, desktop and web development, and more recently has led teams in building a production content management system in Azure.

Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
MEAP combo $49.99 pBook + eBook + liveBook
MEAP eBook $39.99 pdf + ePub + kindle + liveBook
Prices displayed in rupees will be charged in USD when you check out.

placing your order...

Don't refresh or navigate away from the page.

FREE domestic shipping on three or more pBooks