Learn AI Data Engineering in a Month of Lunches you own this product

David Melillo

MEAP began September 2025
Last updated October 2025
Publication in Summer 2026 (estimated)

ISBN 9781633435728
225 pages (estimated)

Included with a Manning Online subscription

printed in black & white

catalog / Data Science / AI

table of content

Part 1: Core Concepts of Data Engineering with AI

1 Before You Begin

1.1 Why AI Matters to Data Engineering

1.2 Is This Book for You?

1.2.1 The Many Uses for AI

1.2.2 The Many Flavors of AI

1.3 How to Use This Book

1.3.1 The Main Chapters

1.3.2 Hands-on Labs

1.3.3 Chapter Setup Files

1.4 Setting Up Your Environment

1.4.1 Installing PostgreSQL and pgAdmin

1.4.2 Installing Jupyter Lab for Python Work

1.4.3 Creating an OpenAI Account

1.5 Being Immediately Effective with AI and Data Engineering

2 Advantages & Disadvantages of Using a Coding Companion

2.1 Mental Model: The Data Engineer and the Coding Companion

2.2 Advantages of Using an AI/LLM Coding Companion

2.2.1 Rapid Code Generation for Data Engineering Tasks

2.3 Disadvantages of Using a Coding Companion

2.3.1 Introduction to the Pagila Dataset

2.3.2 Example: Asking a Simple Question

2.4 Lab

2.5 Lab Answers

3 Using a Coding Companion with SQL

3.1 Zero-Shot Prompting

3.2 Few-Shot Prompting

3.3 Chain-of-Thought Prompting

3.4 Self-Consistency Prompting

3.5 Tree-of-Thought Prompting

3.6 Role-Playing, Domain Priming, Prompt Chaining and Beyond

3.7 Lab

3.8 Lab Answers

4 Using a Coding Companion with Python

4.1 Interacting with APIs Using AI Coding Companions & Python

4.1.1 Fetching Data from an API

4.1.2 Enhancing API Calls with AI Coding Companions and API Documentation

4.2 Unnesting Complex JSON Objects with AI Companions & Python

4.2.1 Simple Example: Flattening a Single Nested Field

4.2.2 Complex Example: Extracting Deeply Nested & Combined Fields

4.3 Using AI to Implement Regex Patterns

4.3.1 Extracting Phone Numbers from Text

4.3.2 Normalizing Phone Numbers with Regex and AI

4.3.3 Extracting Number Components into a DataFrame

4.4 Lab

4.5 Lab Answers

5 Using the OpenAI API in Data Workflows

5.1 Initial Setup and Data Extraction

5.2 Preprocessing Articles

5.3 Using ChatGPT for Sentiment Analysis

5.3.1 Understanding the ChatGPT API and Chat Completions Endpoint

5.3.2 Raw API Response Processing

5.4 Iteration - Normalizing Sentiment Output, Logging & Consolidation

5.4.1 Normalizing Sentiment Output

5.4.2 Logging & Consolidation

5.5 Lab

5.6 Lab Answers

Part 2: Data Cleaning & Transformation Pipelines with AI

6 AI & Data Quality

6.1 Identifying Data Quality Issues

6.2 Fixing Data Quality Issues

6.2.1 Understanding Data Classes

6.2.2 Using response_format

6.2.3 Working with Multiple Messages

6.3 Fixing Structural and Format Issues

6.4 Lab

6.5 Lab Answers

7 AI & Advanced Data Transformations

8 AI & The Data Lifecycle

9 Data Cleaning and Transformation Pipelines in Practice

Part 3: Generating Data with AI

10 Introduction to Web Scraping

11 Identifying Opportunities for AI-Generated Data

12 Handling Unstructured Data with AI

13 Data Scraping & AI

Part 4: Data Cleaning & Transformation Pipelines with AI

14 Introduction to Agentic Workflows for Data Engineers

15 Generating Subject Matter Expertise with AI

16 SME and Agentic Workflows: Decision Paths and Data Activation

17 Practical Application: AI-Driven Outreach for Marketing and Sales

Appendices

Appendix A: Setting Up Your Environment

Appendix B: Prompt Engineering Reference

Appendix C: Using the OpenAI API

Appendix D: Dataset Index

Appendix E: Troubleshooting Common Errors

Overview

5 Using the OpenAI API in Data Workflows

This chapter pivots from treating AI as an external coding helper to embedding it directly inside data pipelines. Instead of manually passing prompts to a chat interface, the model becomes a programmable step that sits alongside familiar libraries like Requests and pandas. Sentiment analysis serves as the anchor use case: it’s simple to visualize, broadly useful, and a clear example of AI as a drop-in enrichment stage that feeds downstream tools such as scikit-learn, databases, and BI dashboards. The emphasis is on integration and operational hygiene—securely loading API keys via environment variables and designing prompts and parameters that behave predictably in production.

The walkthrough builds a small but realistic pipeline: fetch recent news with a flexible articles endpoint, preprocess titles and body text into clean content blocks, then call the OpenAI Chat Completions API to classify sentiment. It adopts a crawl-walk-run approach—starting with plain prompts, inspecting raw JSON responses, and then iterating for reliability. Practical concerns are front and center: control costs with token discipline, cap outputs, and monitor usage; extract only the fields you need from responses; and expect variability in LLM judgments (especially with sarcasm or mixed tone). To make results pipeline-ready, the chapter normalizes outputs to compact labels (Positive, Neutral, Negative) or numeric scores, and adds logging and consolidation steps that enrich a DataFrame with consistent sentiment signals.

By the end, the pipeline loops through articles, applies sentiment, logs progress, and returns a clean, enriched dataset ready for modeling, storage, or visualization. The lab reinforces these skills: swapping the query topic, returning numeric scores from −1 to 1, and scaling to larger batches with simple aggregations. The key takeaway is that AI is no longer just a conversational assistant—it’s a composable building block for extraction, enrichment, and delivery. This foundation sets up future chapters where natural language can steer data transformations, making AI a first-class element of modern data engineering workflows.

In earlier chapters, the data engineer acted as a go-between—manually passing instructions between an AI coding companion and the coding environment. Now, the AI component is embedded directly within the workflow, alongside libraries like Requests and Pandas, enabling natural language tasks (like sentiment analysis) to execute as native steps within the Jupyter notebook

This diagram illustrates the full mental model for the use case. Articles are fetched from the NewsAPI using the requests library, then preprocessed into clean content blocks. The OpenAI API is embedded within the Python environment to analyze sentiment, returning structured labels. Finally, results are stored in a pandas DataFrame and logged—ready for downstream use in tools like scikit-learn, PostgreSQL, or Tableau.

This figure shows the results of the full AI-powered sentiment pipeline. The top portion displays logging output for each step, including API requests and processed sentiment labels. The bottom shows the final pandas DataFrame, where each article title is paired with a normalized sentiment category. Results will vary over time, as the input depends on live Tesla news retrieved from NewsAPI.

Example of what output by sentiment score should look like.

Example of what aggregation results may look like.

FAQ

What does it mean to embed the OpenAI API inside a data workflow instead of using it as a separate coding companion?

It means the model becomes a native step in your pipeline—right alongside libraries like Requests and pandas—so tasks like sentiment analysis run programmatically inside your notebook or scripts. Outputs flow directly into DataFrames, ML workflows, databases, or dashboards, making AI an integrated enrichment component rather than a separate assistant.

Why use sentiment analysis as the first AI task, and where can the results go downstream?

Sentiment analysis is clear, widely applicable, and easy to visualize. It demonstrates how AI can act as a drop-in enrichment step. Results can feed into pandas for analysis, scikit-learn models, PostgreSQL tables, or BI tools like Tableau—just another column you can aggregate, join, or visualize.

How should I manage API keys for NewsAPI and OpenAI securely?

Store keys in environment variables and load them via dotenv. Example: import os; from dotenv import load_dotenv; load_dotenv(); OPENAI_API_KEY = os.getenv("OPENAI_API_KEY"); NEWS_API_KEY = os.getenv("NEWS_API_KEY"). Avoid hardcoding keys in code or notebooks. See env_setup.md in the repo for a walkthrough.

Which NewsAPI endpoint and parameters are used in this chapter?

Use /v2/everything with parameters like q (keywords), from and to (date range), language (e.g., en), and sortBy (e.g., publishedAt). Free tier access covers the last 30 days, so use recent dates. The example fetches articles about “Tesla” from yesterday through today using dynamic date calculation.

How do I extract and preprocess articles before sending them to the model?

Extraction: call the NewsAPI URL with requests, handle status codes, and log results; collect articles from response.json()["articles"]. Preprocessing: combine title, description, and content into a single clean text field (replace newlines, strip whitespace) and load into a pandas DataFrame. Limit to a few records (e.g., 5) while testing.

How do I call the OpenAI Chat Completions API for sentiment analysis in Python?

Install openai (v1+), load OPENAI_API_KEY from the environment, then call openai.chat.completions.create with model="gpt-4o", messages=[system, user], and parameters like max_tokens and temperature. Extract the result with response.choices[0].message.content.strip().

How do I parse the raw API response and get structured, pipeline-ready outputs?

From the raw JSON, use choices[0].message.content to get the text. Normalize via prompting: ask the model to return only "Positive", "Neutral", or "Negative". Alternatively, ask for a numerical score between -1 and 1 and return only the number. For stricter structure, you can later use response_format with JSON schemas.

What does this cost and how can I control it?

Costs are token-based (input + output). As of writing, gpt-4o is around $0.005 per 1,000 input tokens and $0.015 per 1,000 output tokens. Control costs by sampling small batches first, setting max_tokens, keeping prompts concise, and monitoring usage in the OpenAI dashboard. Always check the current pricing page before scaling.

What can go wrong with LLM sentiment analysis, and how do I improve reliability?

Models can misread sarcasm, overreact to charged words, and vary across runs. Improve reliability by normalizing outputs (fixed labels or numeric scores), using explicit prompts (“Return only Positive/Neutral/Negative”), optionally specifying the method (e.g., VADER or hybrid), and treating AI sentiment as one signal among many—validate before acting.

How do I apply sentiment analysis across the DataFrame and prepare results for downstream use?

Iterate over df["content"], call your perform_sentiment_analysis function, log progress, and add a new df["sentiment"] column. For categorical labels, use value_counts() to summarize. For numeric scores, convert with pd.to_numeric(..., errors="coerce") and compute aggregates like mean. Store or serve the enriched DataFrame to ML pipelines, databases, or dashboards.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$55.99 $41.99

you save $14.00 (25%)

eBook

pdf, ePub, online

$55.99 $41.99

you save $14.00 (25%)

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more