Learn AI Data Engineering in a Month of Lunches you own this product

David Melillo

MEAP began September 2025
Last updated May 2026
Publication in Fall 2026 (estimated)

ISBN 9781633435728
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

available in Russian

catalog / Data Science / AI

resources: Source code Book forum Source code on GitHub

table of content

Part 1: Core Concepts of Data Engineering with AI

1 Before You Begin

1.1 Why AI Matters to Data Engineering

1.2 Is This Book for You?

1.2.1 The Many Uses for AI

1.2.2 The Many Flavors of AI

1.3 How to Use This Book

1.3.1 The Main Chapters

1.3.2 Hands-on Labs

1.3.3 Chapter Setup Files

1.4 Setting Up Your Environment

1.4.1 Installing PostgreSQL and pgAdmin

1.4.2 Installing Jupyter Lab for Python Work

1.4.3 Creating an OpenAI Account

1.5 Being Immediately Effective with AI and Data Engineering

2 Advantages & Disadvantages of Using a Coding Companion

2.1 Mental Model: The Data Engineer and the Coding Companion

2.2 Advantages of Using an AI/LLM Coding Companion

2.2.1 Rapid Code Generation for Data Engineering Tasks

2.3 Disadvantages of Using a Coding Companion

2.3.1 Introduction to the Pagila Dataset

2.3.2 Example: Asking a Simple Question

2.4 Lab

2.5 Lab Answers

3 Using a Coding Companion with SQL

3.1 Zero-Shot Prompting

3.2 Few-Shot Prompting

3.3 Chain-of-Thought Prompting

3.4 Self-Consistency Prompting

3.5 Tree-of-Thought Prompting

3.6 Role-Playing, Domain Priming, Prompt Chaining and Beyond

3.7 Lab

3.8 Lab Answers

4 Using a Coding Companion with Python

4.1 Interacting with APIs Using AI Coding Companions & Python

4.1.1 Fetching Data from an API

4.1.2 Enhancing API Calls with AI Coding Companions and API Documentation

4.2 Unnesting Complex JSON Objects with AI Companions & Python

4.2.1 Simple Example: Flattening a Single Nested Field

4.2.2 Complex Example: Extracting Deeply Nested & Combined Fields

4.3 Using AI to Implement Regex Patterns

4.3.1 Extracting Phone Numbers from Text

4.3.2 Normalizing Phone Numbers with Regex and AI

4.3.3 Extracting Number Components into a DataFrame

4.4 Lab

4.5 Lab Answers

5 Using the OpenAI API in Data Workflows

5.1 Initial Setup and Data Extraction

5.2 Preprocessing Articles

5.3 Using ChatGPT for Sentiment Analysis

5.3.1 Understanding the ChatGPT API and Chat Completions Endpoint

5.3.2 Raw API Response Processing

5.4 Iteration - Normalizing Sentiment Output, Logging & Consolidation

5.4.1 Normalizing Sentiment Output

5.4.2 Logging & Consolidation

5.5 Lab

5.6 Lab Answers

Part 2: Data Cleaning & Transformation Pipelines with AI

6 AI & Data Quality

6.1 Identifying Data Quality Issues

6.2 Fixing Data Quality Issues

6.2.1 Understanding Data Classes

6.2.2 Using response_format

6.2.3 Working with Multiple Messages

6.3 Fixing Structural and Format Issues

6.4 Lab

6.5 Lab Answers

7 AI and Advanced Data Transformations

7.1 Complex Text Processing with Regular Expressions

7.2 Handling Hierarchical and Nested Data Structures

7.3 Entity Resolution

7.4 Time Series and Date-Time Transformations

7.5 Lab

7.6 Lab Answers

8 AI and The Data Lifecycle

8.1 From AI insights to data pipelines

8.1.1 Evolving AI Integration

8.1.2 Understanding ETL and ELT

8.2 Extracting News Data with AI

8.2.1 Extracting the Raw API Payload

8.2.2 Extracting Data with AI

8.3 Transforming News Data with AI

8.3.1 The Transformation Prompt

8.3.2 The AI Data Engineering Code Harness

8.3.3 The Transformation Pipeline

8.4 Loading News Data with AI

8.4.1 The Contract and Prompt

8.4.2 Response Handling

8.5 Lab

8.6 Lab Answers

9 Data Cleaning and Transformation Pipelines in Practice

9.1 Data Orchestration

9.1.1 Apache Airflow

9.1.2 Beyond Scheduling

9.1.3 Task Framework

9.2 Event Driven Architecture

9.2.1 What are events?

9.2.2 Pub/Sub and Beyond

9.3 Pipelines in Practice

9.3.1 Inspecting the Data & Inferring the Schema

9.3.2 Extracting the basics

9.3.3 Data Quality Transformations

9.3.4 Advanced Transformations

9.3.5 Analysis

9.4 Lab

9.5 Lab Answers

Part 3: Generating Data with AI

10 Introduction to Web Scraping

10.1 Why Web Scraping?

10.1.1 The Gap Between Display and Data

10.1.2 When Scraping Makes Sense

10.2 The Product Enrichment Challenge

10.2.1 From a Simple Product List to a Structured Catalog

10.2.2 The Enrichment Pipeline

10.3 Loading the Product Data

10.4 Finding a Product URL Manually

10.4.1 Search for the product page

10.4.2 Evaluate the results and choose a source

10.4.3 Open the page and inspect what is actually available

10.5 Web Scraping Fundamentals

10.5.1 The Classic Scraping Harness: requests plus BeautifulSoup

10.5.2 Fetching raw HTML with requests

10.5.3 Parsing and cleaning HTML with BeautifulSoup

10.6 Examining Real HTML Structure

10.7 Manual Extraction Across Multiple Sites

10.8 Lab

10.9 Lab Answers

11 AI-powered web scraping

11.1 Where we left off

11.2 Recognizing AI Opportunities in Data Pipelines

11.2.1 Extracted Data

11.2.2 Enriched Data

11.2.3 Synthetic Data

11.3 Mapping AI to the Enrichment Pipeline

11.4 AI-Assisted URL Discovery

11.4.1 Finding Candidate URLs Programmatically

11.4.2 Ranking URLs with AI

11.5 Smarter HTML Cleaning

11.6 From Manual Selectors to AI Extraction

11.6.1 Defining What You Want

11.6.2 Letting AI Do the Extraction

11.6.3 Manual vs. AI: A Side-by-Side Comparison

11.7 Scaling Extraction Across Multiple Sites

11.7.1 One Prompt, Many Sites

11.7.2 Handling Failures and Partial Results

11.8 Cost and Token Awareness

11.9 Building Your AI Opportunity Checklist

11.10 Lab

11.11 Lab Answers

12 Agentic workflows in action

Appendices

Appendix A: Setting Up Your Environment

Appendix B: Prompt Engineering Reference

Appendix C: Using the OpenAI API

Appendix D: Dataset Index

Appendix E: Troubleshooting Common Errors

Overview

7 AI and Advanced Data Transformations

This chapter moves from data quality to advanced data transformations that are common in production pipelines, arguing that AI can serve as a unifying, conversational interface to reduce tool sprawl and context switching. It surveys four transformation domains—complex text parsing, hierarchical/nested data handling, entity resolution, and time series/date-time work—showing each first with traditional code and then with AI-driven, schema-enforced workflows. Across examples, the pattern is consistent: define the target structure (for example with Pydantic), provide clear task instructions, and let the model generate structured outputs, while maintaining rigor through validation and testing.

For text processing, the chapter demonstrates how regex can reliably extract fields from messy logs but demands specialized syntax knowledge and breaks under shifting formats; an AI alternative can generate or bypass regex altogether to return structured fields, provided you enforce schemas and guard against pitfalls like hallucinated fields, incorrect patterns, inconsistent outputs, and overfitting to examples. Handling nested JSON likewise contrasts manual flattening logic with a model that maps directly to a declared class, making schema evolution simpler: change the class, not the parsing code. The overarching guidance is to pair AI’s flexibility with strong output constraints and validation, treating the model as a junior engineer whose work must be checked.

Entity resolution highlights the limits of traditional fingerprinting and fuzzy matching—useful but opaque—versus an AI approach that produces a best match with confidence and explicit reasoning, contingent on well-designed prompts that encode business rules and feature weighting. Time series transformations show how timezone conversion, custom fiscal calendars, business-day due dates, and contribution percentages are achievable with pandas but require domain-specific functions and care; the AI pattern streamlines this with a consistent loop and response schema, plus minimal precomputation (such as account totals). A capstone lab combines these skills on messy CRM and transactions data to parse text, flatten nested structures, resolve duplicates, compute time-based metrics, and assemble golden records for downstream B2B analysis.

A single user appears under two separate accounts—one personal, one professional. Despite differences in name and email domain, shared signals like device ID and activity reveal an underlying connection. Entity resolution helps unify these records to build a complete view of the customer.

Lab Answers

Refer to the Chapter 7 Lab Jupyter Notebook for full answers.

1. Flattening Nested JSON with AI

The AI approach uses structured response parsing to extract nested data reliably:

This approach handles schema changes gracefully—if new nested fields are added, you simply update the prompt and data class without rewriting complex extraction logic.

2. Complex Text Processing with AI

AI excels at parsing messy text formats that would require complex regex patterns:

The AI approach adapts to format variations automatically and can handle new subscription types like ENTERPRISE3xHYBRID without code changes.

3. AI-Powered Entity Resolution

Instead of rule-based matching, AI performs sophisticated entity resolution by considering multiple factors:

The AI evaluates each customer against all others, providing confidence scores and detailed reasoning for matches. This handles edge cases like nickname variations and email format differences that rule-based systems often miss.

4. Time Series Transformations with AI

AI handles complex date-time transformations and business logic:

The AI correctly handles timezone conversions, business day calculations, and custom fiscal calendars without requiring complex date manipulation libraries.

5. Extra Credit: Build a Golden Account List

AI creates comprehensive B2B account profiles by analyzing combined customer and transaction data:

This produces actionable B2B intelligence by synthesizing customer demographics, spending patterns, and engagement metrics into prioritized account lists.The key advantage of the AI approach throughout this lab is adaptability—as data formats change, new subscription types are added, or business rules evolve, you update prompts rather than rewriting complex parsing and transformation logic. This makes AI-driven data engineering pipelines more maintainable and scalable for production environments.

FAQ

What is the main focus of Chapter 7 and why use AI for advanced transformations?

Chapter 7 tackles real-world transformations that go beyond basic cleaning: complex text parsing (logs with regex), nested/hierarchical data (JSON), entity resolution, and time-series/date-time work. Traditionally this requires many tools and bespoke code. AI offers a conversational, schema-driven alternative that reduces context switching and adapts as formats change—often with less code and added explainability.

When should I use traditional regex vs. an AI-driven approach for log parsing?

- Use regex when formats are stable, performance must be predictable, and you can precisely define patterns (e.g., r"(ERROR|INFO|WARNING)\s(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})").
- Use AI when formats vary across sources, edge cases abound, or you need to evolve fields rapidly without maintaining many patterns. AI shines when combined with a schema to enforce consistent outputs and when you want natural-language tweaks instead of regex rewrites.

How do I extend the AI log extraction (Listing 7.2) to include the message text?

1) Add message: Optional[str] to the LogExtraction model.
2) Update the prompt to instruct the model to extract the text after the timestamp as message.
3) Rerun the loop; verify structured outputs contain log_type, date, time, and message. This leverages schema enforcement to keep results consistent across lines.

How does schema-enforced structured output reduce AI variability?

By defining a Pydantic model and passing it as the response format, the model is steered to return a strict shape (fields, types, and nullability). This mitigates inconsistent formats, prevents extra fields, and makes parsing deterministic. If the model deviates, parsing raises an error you can catch, log, and retry, improving reliability in pipelines.

What pitfalls should I watch for with AI-based extraction and how do I mitigate them?

- Hallucinated fields: Enforce a strict schema, instruct “only extract fields present,” and validate post-hoc.
- Incorrect patterns/assumptions: Provide examples for all log levels and date formats; include negative examples when possible.
- Inconsistent outputs: Use response schemas, keep prompts explicit and concise, set temperature low.
- Overfitting to examples: Include diverse samples and add instructions about variability. Always validate against a test set.

What’s the trade-off between flattening nested JSON with pandas vs. using an AI schema mapping?

- pandas/json_normalize or custom loops: Fast and explicit, but requires intimate knowledge of the structure, manual edge-case handling, and code rewrites as schemas evolve.
- AI with a data class (e.g., LibraryBook): Less code for traversal/mapping, easier to evolve—update the class and prompt. You still attach shared attributes (e.g., library_name) and validate outputs via the model/schema.

How should I design prompts for entity resolution so results are reliable and explainable?

Specify: the task, all fields to consider (names, emails, device_id, IP, language), how to weigh them (e.g., “salesforce_id is canonical; prioritize email domain over IP”), how to treat ambiguity, and require an output schema with confidence and a short reasoning string. This yields traceable decisions and tunable behavior without rewriting code.

Can I combine fuzzy matching and AI to improve entity resolution?

Yes. A practical pattern is a two-stage workflow: 1) Candidate generation with fast heuristics/fuzzy scores (e.g., RapidFuzz on normalized fingerprints) to shortlist records; 2) Send the shortlist plus instructions to the AI for final selection with reasoning and confidence. This improves speed, reduces cost, and increases accuracy and explainability.

What are common gotchas in time series transformations and how does AI help?

- Time zones and DST: Convert from UTC to required zones carefully; verify offsets around DST changes.
- Business-day math: Use BusinessDay calendars; document whether holidays apply.
- Fiscal calendars: Custom quarters must be encoded consistently (e.g., Q1=Feb–Apr).
- AI can compute these from instructions when provided with clear definitions, but you should include examples, enforce schemas, and cross-check a sample against a canonical Python implementation.

How should I operationalize this chapter’s workflows (env, dependencies, testing)?

- Dependencies: Use ch07/requirements.txt; pin versions for reproducibility.
- Secrets: Load API keys via .env and environment variables; never hardcode.
- Testing/validation: Create golden test sets; validate AI outputs against Pydantic models; add unit tests for regex and date logic.
- Throughput/costs: Batch requests, reuse context sparingly, set rate limits/retries. Log prompts/outputs for observability and drift detection.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $27.99

you save $28.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $27.99

you save $28.00 (50%)

eBook

pdf, ePub, online

$55.99 $27.99

you save $28.00 (50%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more