Learn AI Data Engineering in a Month of Lunches you own this product

David Melillo

MEAP began September 2025
Last updated October 2025
Publication in Summer 2026 (estimated)

ISBN 9781633435728
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

table of content

Part 1: Core Concepts of Data Engineering with AI

1 Before You Begin

1.1 Why AI Matters to Data Engineering

1.2 Is This Book for You?

1.2.1 The Many Uses for AI

1.2.2 The Many Flavors of AI

1.3 How to Use This Book

1.3.1 The Main Chapters

1.3.2 Hands-on Labs

1.3.3 Chapter Setup Files

1.4 Setting Up Your Environment

1.4.1 Installing PostgreSQL and pgAdmin

1.4.2 Installing Jupyter Lab for Python Work

1.4.3 Creating an OpenAI Account

1.5 Being Immediately Effective with AI and Data Engineering

2 Advantages & Disadvantages of Using a Coding Companion

2.1 Mental Model: The Data Engineer and the Coding Companion

2.2 Advantages of Using an AI/LLM Coding Companion

2.2.1 Rapid Code Generation for Data Engineering Tasks

2.3 Disadvantages of Using a Coding Companion

2.3.1 Introduction to the Pagila Dataset

2.3.2 Example: Asking a Simple Question

2.4 Lab

2.5 Lab Answers

3 Using a Coding Companion with SQL

3.1 Zero-Shot Prompting

3.2 Few-Shot Prompting

3.3 Chain-of-Thought Prompting

3.4 Self-Consistency Prompting

3.5 Tree-of-Thought Prompting

3.6 Role-Playing, Domain Priming, Prompt Chaining and Beyond

3.7 Lab

3.8 Lab Answers

4 Using a Coding Companion with Python

4.1 Interacting with APIs Using AI Coding Companions & Python

4.1.1 Fetching Data from an API

4.1.2 Enhancing API Calls with AI Coding Companions and API Documentation

4.2 Unnesting Complex JSON Objects with AI Companions & Python

4.2.1 Simple Example: Flattening a Single Nested Field

4.2.2 Complex Example: Extracting Deeply Nested & Combined Fields

4.3 Using AI to Implement Regex Patterns

4.3.1 Extracting Phone Numbers from Text

4.3.2 Normalizing Phone Numbers with Regex and AI

4.3.3 Extracting Number Components into a DataFrame

4.4 Lab

4.5 Lab Answers

5 Using the OpenAI API in Data Workflows

5.1 Initial Setup and Data Extraction

5.2 Preprocessing Articles

5.3 Using ChatGPT for Sentiment Analysis

5.3.1 Understanding the ChatGPT API and Chat Completions Endpoint

5.3.2 Raw API Response Processing

5.4 Iteration - Normalizing Sentiment Output, Logging & Consolidation

5.4.1 Normalizing Sentiment Output

5.4.2 Logging & Consolidation

5.5 Lab

5.6 Lab Answers

Part 2: Data Cleaning & Transformation Pipelines with AI

6 AI & Data Quality

6.1 Identifying Data Quality Issues

6.2 Fixing Data Quality Issues

6.2.1 Understanding Data Classes

6.2.2 Using response_format

6.2.3 Working with Multiple Messages

6.3 Fixing Structural and Format Issues

6.4 Lab

6.5 Lab Answers

7 AI & Advanced Data Transformations

8 AI & The Data Lifecycle

9 Data Cleaning and Transformation Pipelines in Practice

Part 3: Generating Data with AI

10 Introduction to Web Scraping

11 Identifying Opportunities for AI-Generated Data

12 Handling Unstructured Data with AI

13 Data Scraping & AI

Part 4: Data Cleaning & Transformation Pipelines with AI

14 Introduction to Agentic Workflows for Data Engineers

15 Generating Subject Matter Expertise with AI

16 SME and Agentic Workflows: Decision Paths and Data Activation

17 Practical Application: AI-Driven Outreach for Marketing and Sales

Appendices

Appendix A: Setting Up Your Environment

Appendix B: Prompt Engineering Reference

Appendix C: Using the OpenAI API

Appendix D: Dataset Index

Appendix E: Troubleshooting Common Errors

Overview

4 Using a Coding Companion with Python

This chapter shifts from SQL to Python to show how an AI coding companion can accelerate common data engineering tasks. By iterating with clear, targeted prompts, you can generate templates, refine logic, and quickly converge on working code for real-world problems. The emphasis is on step-by-step, few-shot workflows that let you “speak the AI’s language” while keeping control of the outcome—useful for chores like calling APIs, parsing nested JSON, and crafting regex, which are flexible but often tedious and error-prone.

Through hands-on examples, the chapter demonstrates fetching and processing data from public APIs, starting with a zero-shot NumbersAPI script that builds requests, parses JSON, populates pandas DataFrames, and adds basic error handling. It then layers in retries, console feedback, and logging, and shows how uploading API documentation helps the model use optional parameters accurately—while warning about pitfalls like misread specs, hallucinated parameters, and model-to-model variability. For nested data, it uses JSONPlaceholder to flatten structures safely with chained .get() calls, combine fields (like lat/lng), and build resilient pipelines that tolerate missing or changing schema.

Regex work illustrates how AI can quickly draft patterns to extract, normalize, and structure phone numbers, including splitting components for DataFrames or JSON documents, while highlighting the brittleness of regex and the need for test cases and validation. A final lab with the Open Brewery DB ties everything together: pulling records, cleaning phone numbers, and extracting domains to practice prompt clarity, incremental refinement, and verification. The core habits reinforced throughout are to be explicit about execution context, treat outputs as drafts, test against real samples, and iterate deliberately for production-ready reliability.

Interacting with an API: This diagram shows how a client constructs a full request URL by combining a base endpoint with query parameters that define the data request (e.g., filters, limits, and format). The server receives the request, handles tasks like authentication and data lookup, and returns a structured JSON response. This client-server pattern is central to data engineering workflows, where APIs often serve as the primary source of external or cloud-based data. Knowing how to shape requests and interpret responses is essential for working with modern data pipelines.

Result of the zero-shot prompt response against the NumbersAPI

Result of the one-shot prompt response against the NumbersAPI

Result of the API documentation enhanced response against the NumbersAPI

Result of the simple JSON unpacking response against the JSONPlaceholder API.

Result of the complex JSON unpacking response against the JSONPlaceholder API.

Structured breakdown of phone numbers using regex and pandas

Results for Lab Answers 1

Results for Lab Answers 2

Results for Lab Answers 3

FAQ

How is collaborating with an AI coding companion in Python different from using it for SQL?

Python workflows are iterative and stateful—you build scripts step by step, run cells, and refine logic. This makes few-shot prompting effective: ask, run, review, and iterate. SQL prompts often aim for one-shot queries, while Python benefits from progressive prompting for structure, retries, error handling, and data shaping.

What should I include in my prompt to get notebook-friendly code instead of overengineered scripts?

Be explicit about context and intent. Say “Python code to run in a Jupyter notebook,” ask for print statements for real-time feedback, and avoid phrases like “script” if you don’t want files, main blocks, or production logging. Specify libraries, inputs, and desired outputs (e.g., “return a pandas DataFrame”).

How can an AI help me interact with APIs like NumbersAPI using Python?

It can scaffold requests with the requests library, construct URLs with query parameters, parse JSON, handle errors, and load results into a pandas DataFrame. With clear prompts, it can also add timeouts, retries, and concise console messages for interactive runs.

How do I add robust retry logic and timeouts to API requests?

Ask the AI to implement retries for transient errors (like timeouts), with a limited number of attempts, a delay between attempts, and a per-request timeout. Also request clear console prints for progress and outcomes, plus optional logging for post-run diagnostics.

What are common pitfalls when asking AI to use API documentation, and how do I prevent them?

LLMs may misread docs, invent parameters, skip required headers, or mix path and query params. Mitigate by: uploading the docs, asking the model to list required vs optional parameters, explaining each part of the generated URL, and verifying with curl/Postman and sample responses before coding further.

How do I flatten nested JSON from APIs (e.g., JSONPlaceholder) into a DataFrame?

Prompt the AI to safely access nested keys with chained .get(), select the fields you need (like address.city, company.name, company.catchPhrase), and combine fields when useful (e.g., “(lat, lng)”). Have it return a clean pandas DataFrame with clear column names.

How can I make JSON parsing resilient to missing or inconsistent fields?

Use chained .get() with default fallbacks, guard parent objects that may be None, test against multiple records, and print or log a sample of the raw JSON before parsing. Ask the AI to include error handling and defaults for optional fields so the code doesn’t crash when the schema shifts.

When should I use print versus logging in notebooks?

Use print for real-time, in-notebook progress and status messages; it’s simple and visible. Use logging when you need structured, persistent records (e.g., files) for debugging or production. You can combine them—console prints during development and logging for deeper diagnostics.

How can an AI help me write and refine regex for phone numbers, and what are the limits?

The AI can draft regex to match multiple formats, use capture groups to extract parts, and propose normalization steps (e.g., +1-XXX-XXX-XXXX). Limits: regex is brittle on messy, real-world input. Provide examples and edge cases in your prompt, validate on real data, or switch to specialized libraries (like phonenumbers) when patterns get complex.

Should I output to a DataFrame or JSON for downstream systems?

Choose DataFrame for tabular analysis and pandas-driven workflows; choose JSON for document-oriented systems (e.g., MongoDB, REST payloads). You can convert between them (DataFrame to JSON via df.to_json), but starting in the target format often simplifies integration.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $41.99

you save $14.00 (25%)

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more