Learn AI Data Engineering in a Month of Lunches you own this product

David Melillo

MEAP began September 2025
Last updated October 2025
Publication in Summer 2026 (estimated)

ISBN 9781633435728
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

table of content

Part 1: Core Concepts of Data Engineering with AI

1 Before You Begin

1.1 Why AI Matters to Data Engineering

1.2 Is This Book for You?

1.2.1 The Many Uses for AI

1.2.2 The Many Flavors of AI

1.3 How to Use This Book

1.3.1 The Main Chapters

1.3.2 Hands-on Labs

1.3.3 Chapter Setup Files

1.4 Setting Up Your Environment

1.4.1 Installing PostgreSQL and pgAdmin

1.4.2 Installing Jupyter Lab for Python Work

1.4.3 Creating an OpenAI Account

1.5 Being Immediately Effective with AI and Data Engineering

2 Advantages & Disadvantages of Using a Coding Companion

2.1 Mental Model: The Data Engineer and the Coding Companion

2.2 Advantages of Using an AI/LLM Coding Companion

2.2.1 Rapid Code Generation for Data Engineering Tasks

2.3 Disadvantages of Using a Coding Companion

2.3.1 Introduction to the Pagila Dataset

2.3.2 Example: Asking a Simple Question

2.4 Lab

2.5 Lab Answers

3 Using a Coding Companion with SQL

3.1 Zero-Shot Prompting

3.2 Few-Shot Prompting

3.3 Chain-of-Thought Prompting

3.4 Self-Consistency Prompting

3.5 Tree-of-Thought Prompting

3.6 Role-Playing, Domain Priming, Prompt Chaining and Beyond

3.7 Lab

3.8 Lab Answers

4 Using a Coding Companion with Python

4.1 Interacting with APIs Using AI Coding Companions & Python

4.1.1 Fetching Data from an API

4.1.2 Enhancing API Calls with AI Coding Companions and API Documentation

4.2 Unnesting Complex JSON Objects with AI Companions & Python

4.2.1 Simple Example: Flattening a Single Nested Field

4.2.2 Complex Example: Extracting Deeply Nested & Combined Fields

4.3 Using AI to Implement Regex Patterns

4.3.1 Extracting Phone Numbers from Text

4.3.2 Normalizing Phone Numbers with Regex and AI

4.3.3 Extracting Number Components into a DataFrame

4.4 Lab

4.5 Lab Answers

5 Using the OpenAI API in Data Workflows

5.1 Initial Setup and Data Extraction

5.2 Preprocessing Articles

5.3 Using ChatGPT for Sentiment Analysis

5.3.1 Understanding the ChatGPT API and Chat Completions Endpoint

5.3.2 Raw API Response Processing

5.4 Iteration - Normalizing Sentiment Output, Logging & Consolidation

5.4.1 Normalizing Sentiment Output

5.4.2 Logging & Consolidation

5.5 Lab

5.6 Lab Answers

Part 2: Data Cleaning & Transformation Pipelines with AI

6 AI & Data Quality

6.1 Identifying Data Quality Issues

6.2 Fixing Data Quality Issues

6.2.1 Understanding Data Classes

6.2.2 Using response_format

6.2.3 Working with Multiple Messages

6.3 Fixing Structural and Format Issues

6.4 Lab

6.5 Lab Answers

7 AI & Advanced Data Transformations

8 AI & The Data Lifecycle

9 Data Cleaning and Transformation Pipelines in Practice

Part 3: Generating Data with AI

10 Introduction to Web Scraping

11 Identifying Opportunities for AI-Generated Data

12 Handling Unstructured Data with AI

13 Data Scraping & AI

Part 4: Data Cleaning & Transformation Pipelines with AI

14 Introduction to Agentic Workflows for Data Engineers

15 Generating Subject Matter Expertise with AI

16 SME and Agentic Workflows: Decision Paths and Data Activation

17 Practical Application: AI-Driven Outreach for Marketing and Sales

Appendices

Appendix A: Setting Up Your Environment

Appendix B: Prompt Engineering Reference

Appendix C: Using the OpenAI API

Appendix D: Dataset Index

Appendix E: Troubleshooting Common Errors

Overview

2 Advantages & Disadvantages of Using a Coding Companion

Large language models have evolved from conversational tools into capable coding companions that can accelerate many parts of a data engineering workflow. This chapter presents a balanced view of how these assistants fit into real projects: they can boost productivity, compress iteration cycles, and reduce manual effort, yet they must be guided by clear prompts and paired with careful human oversight. A key mental model is to keep the companion and the execution environment distinct, so you can observe how prompts are interpreted, evaluate generated code safely, and learn the model’s reasoning patterns before moving toward tighter integrations.

On the benefits side, LLMs shine when complexity rises—rapidly producing draft code for tasks like multi-step transformations, API integrations, and nested JSON normalization. Their conversational interface supports iterative refinement, enabling you to steer solutions toward correctness and maintainability without starting from scratch. While modern IDEs increasingly blend AI into the editing experience, the chapter emphasizes starting with a separated workflow to build foundational skills: crafting effective prompts, validating results, and understanding what the model does and does not infer.

The drawbacks are equally important. LLMs can hallucinate—confidently inventing functions, columns, or logic that don’t exist—and they lack awareness of your specific schemas, business rules, and system constraints. Their finite context windows can cause truncated reasoning or forgotten details in long prompts. To mitigate these risks, the chapter stresses treating AI output as a first draft, validating against real schemas and requirements, and using focused, incremental prompts. It grounds practice in a well-known sample database to share context with the model, illustrates how seemingly correct SQL can miss intent without precise criteria, and closes with hands-on exercises that reinforce prompt iteration, code execution, and rigorous human review as non-negotiable parts of AI-assisted development.

The AI coding companion mental model: You, the data engineer, interact with a coding companion via prompts and responses, then test and execute code within a separate coding environment. This separation reinforces how AI generates code and supports learning prompt engineering

Reading from bottom to top, this timeline compares the traditional manual steps (left) versus an AI-assisted workflow (right) for API data extraction. The "Without AI" column illustrates the cumulative time and effort required to manually code each part of the task. In contrast, the "With AI" column shows how a well-structured prompt and a few refinements can significantly compress the workflow.

The Pagila Entity Relationship Diagram illustrating relationships between the different objects in the Paglia DVD Store data set

Lab Answers

Note: Due to the probabilistic nature of language models, your AI-generated answers might differ slightly from the code shown here. As long as the results are correct and the logic aligns with the prompt, your solution is valid.

FAQ

What is an AI/LLM coding companion and how does it fit into a data engineering workflow?

An AI coding companion (e.g., ChatGPT) is a conversational tool that generates code and suggestions from your prompts. You craft prompts and review responses in the companion, then paste and run the code in your separate coding environment (e.g., Jupyter, VS Code, pgAdmin). This separation helps you observe how prompts are interpreted and validate outputs against real systems.

What are the main advantages of using an AI coding companion for data engineering?

They accelerate code generation, reduce boilerplate, automate repetitive tasks, and support rapid, iterative prototyping. By refining prompts based on output, you can converge on accurate, elegant solutions faster, especially as complexity increases.

Which kinds of tasks benefit most from LLM assistance?

Complex, multi-step tasks such as nested JSON normalization, API integrations, conditional data transformations, and non-trivial SQL/ETL logic benefit greatly. For example, extracting structured columns from product descriptions and applying status-based tax rules can be generated quickly and refined conversationally.

What is a hallucination in this context, and what might it look like?

Hallucinations are confident but incorrect outputs. Examples include suggesting non-existent library methods, referencing columns that aren’t in your dataset, or producing SQL that looks valid but fails due to incorrect joins or misunderstood relationships.

How can I handle or prevent hallucinations in AI-generated code?

Always validate code against your actual schema and business rules, treat AI output as a first draft (not production-ready), and use feedback loops to correct and refine results. Human review is essential for accuracy and safety.

Why keep the AI companion separate from the IDE, at least initially?

Separation makes the generation process transparent, helping you learn how prompts influence reasoning, spot misunderstandings, and build prompt-engineering skills. IDE-integrated tools are powerful but can hide the steps you need to understand early on.

What are token limits and context windows, and why do they matter?

LLMs process only a finite amount of text at once. Long schemas, prompts, or conversations can exceed this window, causing the model to forget earlier details or fabricate missing pieces, which leads to incomplete or incorrect reasoning.

How can I work around token/context window constraints?

Provide concise, relevant schema snippets (DDL, ERDs), break requests into smaller steps, restate key relationships in your prompt, validate outputs against known definitions, and explore a schema iteratively with targeted questions.

Why does the chapter use the Pagila dataset?

Pagila is a well-known, well-documented sample schema that fits within a typical context window and is likely familiar to LLMs. It minimizes missing context so you can focus on prompt strategy, validation, and iterative refinement rather than lengthy schema ingestion.

Why can a “simple” SQL question still go wrong, and how do I improve results?

Ambiguity in terms like “most popular” can cause misalignment (e.g., missing date ranges, store filters, or handling of edge cases). Make prompts explicit about metrics, filters, and business rules, then iterate based on results and validation against your data.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $41.99

you save $14.00 (25%)

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more