Azure Storage, Streaming, and Batch Analytics
A guide for data engineers
Richard L. Nuckolls
  • MEAP began April 2019
  • Publication in Fall 2020 (estimated)
  • ISBN 9781617296307
  • 470 pages (estimated)
  • printed in black & white

Was a great read! The writer knows Azure deeply and it's evident when reading. Very easy to get into.

Taylor Dolezal
The Microsoft Azure cloud is an ideal platform for data-intensive applications. Designed for productivity, Azure provides pre-built services that make collection, storage, and analysis much easier to implement and manage. Azure Storage, Streaming, and Batch Analytics teaches you how to design a reliable, performant, and cost-effective data infrastructure in Azure by progressively building a complete working analytics system.

About the Technology

The Microsoft Azure cloud platform can host virtually any sort of computing task, from simple web applications to full-scale enterprise systems. With many pre-built services for everything from data storage to advanced machine learning, Azure offers all the building blocks for scalable big data analysis systems including ingestion, processing, querying, and migration.

About the book

Azure Storage, Streaming, and Batch Analytics teaches you to build high-capacity data analytics systems using Azure cloud services for storing, collecting, and analyzing data. In it, seasoned IT professional and author Richard Nuckolls starts you off with an overview of core data engineering tasks and the Azure tools that support them. Then, you’ll dive right into building your analytics system, starting with Data Lake Store for data retention, Azure Event Hubs for high-throughput ingestion, and Stream Analytics for real-time query processing.

For batch scheduling and aggregate data movement, you’ll add Data Factory and Data Lake Analytics, along with SQL Data Warehouse for interactive queries. With Azure Active Directory, you’ll manage security by applying permissions and access roles. And because your design is based on the Lambda architecture, you can be sure it will handle large volumes of data beautifully and with lightning speed!
Table of Contents detailed table of contents

1 What is data engineering

1.1 What is data engineering?

1.2 What do data engineers do?

1.3 How does Microsoft define data engineering?

1.3.1 Data acquisition

1.3.2 Data storage

1.3.3 Data processing

1.3.4 Data queries

1.3.5 Orchestration

1.3.6 Data retrieval

1.4 What tools does Azure provide for data engineering?

1.5 Azure Data Engineers

1.6 Example Application

1.7 Summary

2 Building an analytics system in Azure

2.1 Fundamentals of Azure architecture

2.1.1 Azure subscriptions

2.1.2 Azure regions

2.1.3 Azure naming conventions

2.1.4 Resource groups

2.1.5 Finding resources

2.2 Lambda architecture

2.3 Azure cloud services

2.3.1 Azure analytics system architecture

2.3.2 Event Hubs

2.3.3 Stream Analytics

2.3.4 Data Lake Store

2.3.5 Data Lake Analytics

2.3.6 SQL Database

2.3.7 Data Factory

2.3.8 Azure PowerShell

2.4 Walk-through of processing a series of event data records

2.4.1 Hot Path

2.4.2 Cold Path

2.4.3 Choosing abstract Azure services

2.5 Calculating cloud hosting costs

2.5.1 Event Hubs

2.5.2 Stream Analytics

2.5.3 Data Lake Storage

2.5.4 Data Lake Analytics

2.5.5 SQL Database

2.5.6 Data Factory

2.6 Summary

3 Azure Storage Blob service

3.1 Cloud storage services

3.1.1 Before you begin

3.2 Creating an Azure Storage account

3.2.1 Using Azure Portal

3.2.2 Using Azure PowerShell

3.2.3 Azure Storage replication

3.3 Storage account services

3.3.1 Blob storage

3.3.2 Creating a Blob service container

3.3.3 Blob tiering

3.3.4 Copy tools

3.3.5 Queues

3.3.6 Creating a Queue

3.3.7 Storage queue options

3.4 Storage account access

3.4.1 Blob container security

3.4.2 Designing Storage account access

3.5 Exercises

3.5.1 Exercise 1

3.5.2 Exercise 2

3.6 Summary

4 Azure Data Lake storage

4.1 Create an Azure Data Lake store

4.1.1 Using Azure Portal

4.1.2 Using Azure PowerShell

4.2 Data Lake store access

4.2.1 Access schemes

4.2.2 Configuring ADL access

4.2.3 Hierarchy structure in Data Lake store

4.3 Storage folder structure and data drift

4.3.1 Hierarchy structure revisited

4.3.3 Data drift

4.4 Copy tools for Data Lake store

4.4.1 Data explorer

4.4.2 ADLCopy tool

4.4.3 Azure Storage Explorer tool

4.5 Exercises

4.5.1 Exercise 1

4.5.2 Exercise 2

4.6 Summary

5 Message handling with Event Hubs

5.1 How does an Event Hub work?

5.2 Collecting data in Azure

5.3 Create an Event Hubs Namespace

5.3.1 Using Azure PowerShell

5.3.2 Throughput units

5.3.3 Event Hub Geo-Disaster Recovery

5.3.4 Fail over with Geo-Disaster Recovery

5.4 Create an Event Hub

5.4.1 Using Azure Portal

5.4.2 Using Azure PowerShell

5.4.3 Shared access policy

5.5 Event Hub partitions

5.5.1 Multiple consumers

5.5.2 Why specify a partition?

5.5.3 Why not specify a partition?

5.5.4 Event Hubs message journal

5.5.5 Partitions and throughput units

5.6 Configure Capture

5.6.1 File name formats

5.6.2 Secure access for Capture

5.6.3 Enabling Capture

5.6.4 The importance of time

5.7 Securing access to Event Hubs

5.7.1 Shared access signature policies

5.7.2 Writing to Event Hubs

5.8 Exercises

5.8.1 Exercise 1

5.8.2 Exercise 2

5.8.3 Exercise 3

5.9 Summary

6 Real-time queries with Azure Stream Analytics

6.1 Creating a Stream Analytics service

6.1.1 Elements of a Stream Analytics job

6.1.2 Create an ASA job using the Azure Portal

6.1.3 Create an ASA job using Azure PowerShell

6.2 Configuring inputs and outputs

6.2.1 Event Hub job input

6.2.2 ASA job outputs

6.3 Creating a job query

6.3.1 Starting the ASA job

6.3.2 Failure to start

6.3.3 Output exceptions

6.4 Writing job queries

6.4.1 Window functions

6.4.2 Machine learning functions

6.5 Managing performance

6.5.1 Streaming units

6.5.2 Event ordering

6.6 Exercises

6.6.1 Exercise 1

6.6.2 Exercise 2

6.7 Summary

7 Batch queries with Azure Data Lake Analytics

7.1 U-SQL language

7.1.1 Extractors

7.1.2 Outputters

7.1.3 File selectors

7.1.4 Expressions

7.2 U-SQL jobs

7.2.1 Selecting the biometric data files

7.2.2 Schema extraction

7.2.3 Aggregation

7.2.4 Writing files

7.3 Creating a Data Lake Analytics service

7.3.1 Using Azure Portal

7.3.2 Using Azure PowerShell

7.4 Submitting jobs to ADLA

7.4.1 Using Azure Portal

7.4.2 Using Azure PowerShell

7.5 Efficient U-SQL job executions

7.5.1 Monitoring a U-SQL job

7.5.2 Analytics units

7.5.3 Vertexes

7.5.4 Scaling the job execution

7.6 Using Blob storage

7.6.1 Constructing Blob file selectors

7.6.2 Adding a new data source

7.6.3 Filtering rowsets

7.7 Exercises

7.7.1 Exercise 1

7.7.2 Exercise 2

7.8 Summary

8 U-SQL for complex analytics

8.1 Data Lake Analytics Catalog

8.1.1 Simplifying U-SQL queries

8.1.2 Simplifying data access

8.1.3 Loading data for reuse

8.2 Window functions

8.3 Local C# functions

8.4 Exercises

8.4.1 Exercise 1

8.4.2 Exercise 2

8.5 Summary

9 Integrating with Azure Data Lake Analytics

9.1 Processing unstructured data

9.1.1 Azure Cognitive services

9.1.2 Managing assemblies in the Data Lake

9.1.3 Image data extraction with Advanced Analytics

9.2 Reading different file types

9.2.1 Adding custom libraries with a Catalog

9.2.2 Creating a catalog database

9.2.3 Building the U-SQL DataFormats solution

9.2.4 Code folders

9.2.5 Using custom assemblies

9.3 Connecting to remote sources

9.3.1 External databases

9.3.2 Credentials

9.3.3 Data Source

9.3.4 Tables and views

9.4 Exercises

9.4.1 Exercise 1

9.4.2 Exercise 2

9.5 Summary

10 Service integration with Azure Data Factory

10.1 Creating an Azure Data Factory

10.2 Secure authentication

10.2.1 Azure Active Directory integration

10.2.2 Azure Key Vault

10.3 Copying files with ADF

10.3.1 Creating a Files storage container

10.3.2 Add secret to AKV

10.3.3 Creating a Files storage linkedservice

10.3.4 Creating an ADL linkedservice

10.3.5 Creating a pipeline and activity

10.3.6 Creating a scheduled trigger

10.4 Running an ADLA job

10.4.1 Creating an ADLA linkedservice

10.4.2 Creating a pipeline and activity

10.5 Exercises

10.5.1 Exercise 1

10.5.2 Exercise 2

10.6 Summary

11 Managed SQL with Azure SQL Database

11.1 Creating an Azure SQL Database

11.1.1 Create a SQL Server and SQLDB

11.2 Securing SQLDB

11.3 Availability and recovery

11.3.1 Restoring and moving SQLDB

11.3.2 Database safeguards

11.3.3 Creating alerts for SQLDB

11.4 Optimizing cost for SQLDB

11.4.1 Pricing structure

11.4.2 Scaling SQLDB

11.4.3 Serverless

11.4.4 Elastic Pools

11.5 Exercises

11.5.1 Exercise 1

11.5.2 Exercise 2

11.5.3 Exercise 3

11.5.4 Exercise 4

11.6 Summary

12 Integrating Data Factory with SQL Database

12.1 Before you begin

12.2 Importing data with external data sources

12.2.1 Creating a database scoped credential

12.2.2 Creating an external data source

12.2.3 Creating an external table

12.2.4 Importing blob files

12.3 Importing file data with ADF

12.3.1 Authenticating between ADF and SQLDB

12.3.2 Creating SQL Database linked service

12.3.3 Creating datasets

12.3.4 Creating Copy activity and pipeline

12.4 Version control of ADF configuration files

12.4.1 Git version control

12.4.2 Using ADF with Git version control

12.5 Exercises

12.5.1 Exercise 1

12.5.2 Exercise 2

12.5.3 Exercise 3

12.6 Summary

13 Where to go next

13.1 Data catalog

13.1.1 Data Catalog as a service

13.1.2 Data locations

13.1.3 Data definitions

13.1.4 Data frequency

13.1.5 Business drivers

13.2 Version control and backups

13.2.1 Storage account Blob service

13.2.2 Data Lake store

13.2.3 Stream Analytics

13.2.4 Data Lake Analytics

13.2.5 Data Factory configuration files

13.2.6 SQL Database

13.3 Microsoft certifications

13.4 Signing off

13.5 Summary

Appendixes

Appendix A: Set up of Azure resources through Powershell

A.1 Setup Azure PowerShell

A.2 Create a subscription

A.3 Azure naming conventions

A.4 Setup common Azure resources using PowerShell

A.4.1 Create a new resource group

A.4.2 Create new Azure Active Directory user

A.4.3 Create new Azure Active Directory group

A.5 Setup Azure services using PowerShell

A.5.1 Create new Storage account

A.5.2 Create new Data Lake store

A.5.3 Create new Event Hub

A.5.4 Create new Stream Analytics job

A.5.5 Create new Data Lake Analytics account

A.5.6 Create new SQL Server and Database

A.5.7 Create new Data Factory

A.5.8 Create new App registration

A.5.9 Create new Key Vault

A.5.10 Create new SQL Server and Database with lookup data

A.6 Summary

What's inside

  • Azure cloud services architecture
  • Building a data warehouse in Azure
  • How to choose the right Azure technology for your task
  • Calculating fixed and variable costs
  • Hot and cold path analytics
  • Stream processing with Azure Stream Analytics and Event Hub integration
  • Giving structure to distributed storage
  • Practical examples leading up to a fully functioning analytics system

About the reader

Readers should be comfortable with RDBMS systems like SQLServer and scripting using a language like PowerShell, Bash, or Python. Book examples use PowerShell and C#.

About the author

Richard Nuckolls is a senior developer building a big data analytics and reporting system in Azure. During his nearly 20 years of experience, he’s done server and database administration, desktop and web development, and more recently has led teams in building a production content management system in Azure.

placing your order...

Don't refresh or navigate away from the page.
Manning Early Access Program (MEAP) Read chapters as they are written, get the finished eBook as soon as it’s ready, and receive the pBook long before it's in bookstores.
print book $24.99 $49.99 pBook + eBook + liveBook
Additional shipping charges may apply
Azure Storage, Streaming, and Batch Analytics (print book) added to cart
continue shopping
go to cart

eBook $19.99 $39.99 3 formats + liveBook
Azure Storage, Streaming, and Batch Analytics (eBook) added to cart
continue shopping
go to cart

Prices displayed in rupees will be charged in USD when you check out.

FREE domestic shipping on three or more pBooks