Data Analysis with LLMs you own this product

Text, tables, images and sound

Immanuel Trummer

April 2025
ISBN 9781633437647
232 pages

Included with a Manning Online subscription

printed in black & white

available in Italian, Korean, Simplified Chinese

catalog / Data Science

table of content

Part 1 Introducing language models

1 Analyzing data with large language models

1.1 What can language models do?

1.2 What you will learn

1.3 How to use language models

1.3.1 Prompting

1.3.2 Example prompt

1.3.3 Interfaces

1.4 Using language models for data analysis

1.4.1 Using language models directly on data

1.4.2 Data analysis via external tools

1.5 Minimizing costs

1.5.1 Picking the best model

1.5.2 Optimally configuring models

1.5.3 Prompt engineering

1.6 Advanced software frameworks and agents

Summary

2 Chatting with ChatGPT

2.1 Accessing the web interface

2.2 Making introductions

2.3 Processing text with ChatGPT

2.4 Processing tables with ChatGPT

2.4.1 Processing tables in the web interface

2.4.2 Processing tables on your platform

Summary

References

Part 2 Data analysis with language models

3 The OpenAI Python library

3.1 Prerequisites

3.2 Installing OpenAI’s Python library

3.3 Listing available models

3.4 Chat completion

3.5 Customizing model behavior

3.5.1 Configuring termination conditions

3.5.2 Configuring output generation

3.5.3 Configuring randomization

3.5.4 Customization example

3.5.5 Further parameters

Summary

4 Analyzing text data

4.1 Preliminaries

4.2 Classification

4.2.1 Overview

4.2.2 Creating prompts

4.2.3 Calling the model

4.2.4 End-to-end classification code

4.2.5 Classifying documents

4.2.6 Running the code

4.2.7 Trying out variants

4.3 Text extraction

4.3.1 Overview

4.3.2 Generating prompts

4.3.3 Postprocessing

4.3.4 End-to-end extraction code

4.3.5 Trying it out

4.4 Clustering

4.4.1 Overview

4.4.2 Calculating embeddings

4.4.3 Clustering vectors

4.4.4 End-to-end code for text clustering

4.4.5 Trying it out

4.4.6 Other use cases for embedding vectors

Summary

References

5 Analyzing structured data

5.1 Chapter outline

5.2 A natural language query interface for analyzing game sales

5.2.1 Setting up an SQLite database

5.2.2 SQL basics

5.2.3 Overview

5.2.4 Generating prompts for text-to-SQL translation

5.2.5 Complete code

5.2.6 Trying it out

5.3 A general natural language query interface

5.3.1 Executing queries

5.3.2 Extracting the database structure

5.3.3 Complete code

5.3.4 Trying it out

5.4 A natural language query interface for graph data

5.4.1 What is graph data?

5.4.2 Setting up a Neo4j database

5.4.3 The Cypher query language

5.4.4 Translating questions to Cypher queries

5.4.5 Generating prompts

5.4.6 Complete code

5.4.7 Trying it out

Summary

6 Analyzing images and videos

6.1 Setup

6.2 Answering questions about images

6.2.1 Specifying multimodal input

6.2.2 Code discussion

6.2.3 Trying it out

6.3 Tagging people in images

6.3.1 Overview

6.3.2 Encoding locally stored images

6.3.3 Sending locally stored images to OpenAI

6.3.4 The end-to-end implementation

6.3.5 Trying it out

6.4 Generating titles for videos

6.4.1 Overview

6.4.2 Encoding video frames

6.4.3 The end-to-end implementation

6.4.4 Trying it out

Summary

7 Analyzing audio data

7.1 Preliminaries

7.2 Transcribing audio files

7.2.1 Transcribing speech

7.2.2 End-to-end code

7.2.3 Trying it out

7.3 Querying relational data via voice

7.3.1 Preliminaries

7.3.2 Overview

7.3.3 Recording audio

7.3.4 End-to-end code

7.3.5 Trying it out

7.4 Speech-to-speech translation

7.4.1 Overview

7.4.2 Generating speech

7.4.3 End-to-end code

7.4.4 Trying it out

Summary

Part 3 Advanced topics

8 GPT alternatives

8.1 Anthropic

8.1.1 Chatting with Claude

8.1.2 Python library

8.2 Cohere

8.2.1 Chatting with Command R+

8.2.2 Python library

8.3 Google

8.3.1 Chatting with Gemini

8.3.2 The Python library

8.4 Hugging Face

8.4.1 Web platform

8.4.2 Python library

Summary

References

9 Optimizing cost and quality

9.1 Example scenario

9.2 Untuned classifier

9.3 Model tuning

9.4 Model selection

9.5 Prompt engineering

9.6 Tunable classifier

9.7 Fine-tuning

9.8 Generating training data

9.9 Starting a fine-tuning job

9.10 Using the fine-tuned model

Summary

10 Software frameworks

10.1 LangChain

10.2 Classifying reviews with LangChain

10.2.1 Overview

10.2.2 Creating a classification chain

10.2.3 Putting it together

10.2.4 Trying it out

10.3 Agents: Putting the large language model into the driver’s seat

10.4 Building an agent for data analysis

10.4.1 Overview

10.4.2 Creating an agent with LangChain

10.4.3 Complete code for data-analysis agent

10.4.4 Trying it out

10.5 Adding custom tools

10.5.1 The currency converter

10.5.2 Trying it out

10.6 Indexing multimodal data with LlamaIndex

10.6.1 Overview

10.6.2 Installing LlamaIndex

10.6.3 Implementing a simple question-answering system

10.6.4 Trying it out

10.7 Concluding remarks

Summary

References

Overview

1 Analyzing data with large language models

Large language models are presented as versatile neural networks that can tackle a wide range of data analysis tasks through natural-language instructions rather than task-specific training. The chapter introduces their multimodal capabilities across text, images, audio, video, tables, and graphs, and frames data analysis as either directly processing raw content with the model or indirectly using the model to generate code for specialized tools. It also sets expectations around background knowledge, practical challenges, and the importance of scalability and cost-awareness when moving from toy prompts to real pipelines.

The core usage pattern is prompting: describing the task, providing necessary context and data, specifying the desired output format, and optionally including examples (few-shot) versus relying on instructions alone (zero-shot). For repeatable workflows, prompt templates with placeholders enable programmatic generation of consistent prompts at scale. The chapter contrasts direct analysis (e.g., classifying reviews or answering questions about images by embedding the data in the prompt) with tool-mediated analysis for structured data (e.g., translating natural-language questions into SQL for relational databases or into graph queries). It covers practical interfaces—from web UIs for ad hoc trials to Python libraries for automation and integration with other data processing components.

To control costs and improve reliability, the chapter outlines strategies for model selection, configuration, and prompt engineering. It emphasizes choosing appropriately sized models, understanding token-driven pricing, constraining outputs, and using fine-tuning when it can replace more expensive models for a specific task. It advocates empirically benchmarking models and prompts, and introduces higher-level frameworks that simplify building complex applications, including agent-based approaches that plan multi-step analyses and invoke external tools on demand. Together, these practices enable building robust, efficient, and flexible data analysis systems powered by large language models.

Illustration by GPT-4o, connecting the topics “data analysis” and “large language models”.

Using language models directly for data analysis: a prompt template describes the analysis task. It contains placeholders that are substituted by data to analyze. After substituting placeholders, the resulting prompt is submitted to the language model to produce output.

Using language models indirectly to build a natural language interface for tabular data: the prompt template contains placeholders for questions about data. After substituting placeholders, the resulting prompt is used as input for the language model. The model translates the question into an SQL query that is executed via a relational database management system.

Holistic Evaluation of Language Models (HELM): comparing language models offered by different providers according to various metrics.

PromptBase: a Web site for selling and buying templates for prompts, short text documents used to describe tasks to a language model.

Summary

Language models can solve novel tasks without specialized training.
The prompt is the input to the language model.
Prompts may combine text with other types of data such as images.
A prompt contains a task description, context, and optionally examples.
Language models can analyze certain types of data directly.
When analyzing data directly, the data must appear in the prompt.
Prompt templates contain placeholders, e.g., to represent data items.
By substituting placeholders in a prompt template, we obtain a prompt.
Language models can also help to analyze data via external tools.
E.g., language models can instruct other tools on how to process data.
Models are available in many different sizes with significant cost differences.
Models can be configured using various configuration parameters.
LangChain and LlamaIndex help to develop complex applications.
Agents exploit language models to solve complex problems.

FAQ

What are large language models and why are they useful for data analysis?

Large language models (LLMs) are powerful neural networks, pre-trained on vast data, that can understand and generate natural language (and often other modalities like images). For data analysis, they can extract information from text, answer questions about images, write and explain code, translate questions into database queries, and help build end-to-end analysis pipelines across diverse data types.

What does GPT stand for, and what does each part mean?

GPT stands for Generative Pre-Trained Transformer. Generative: it produces content (text, code, etc.). Pre-Trained: it is trained on large generic datasets before being adapted to tasks. Transformer: the neural architecture that handles variable-length inputs/outputs and underpins most modern generative AI.

Which data types can LLMs analyze in this book, and what’s the difference between structured and unstructured data?

This book covers text, images, videos, audio, tables, and graphs. Structured data (tables, graphs) has explicit schema/relationships and benefits from specialized tools (e.g., SQL databases, graph systems). Unstructured data (text, images, video, audio) lacks easily exploitable structure and is typically analyzed directly with an LLM.

What is prompting, and what should a good prompt include?

Prompting is how you instruct an LLM. A good prompt includes: clear task instructions; all necessary context/data (e.g., the text or image to analyze); the desired output format; and, optionally, examples illustrating inputs and correct outputs to improve accuracy.

What’s the difference between zero-shot and few-shot prompting?

Zero-shot means the model gets only the task description (no examples). Few-shot includes a handful of input–output examples in the prompt to guide the model’s behavior. Few-shot often improves accuracy on nuanced tasks.

What is a prompt template and how do placeholders work?

A prompt template is a reusable prompt for a specific task with placeholders for parts that change per input (e.g., [ReviewText]). At runtime, you substitute placeholders with the current data item to create an instance of the prompt, enabling efficient batch processing.

How can I interact with LLMs: web UI or Python libraries?

Providers offer web interfaces for ad hoc, single prompts. For automation and scale (e.g., processing many documents, integrating post-processing), use Python libraries (e.g., from OpenAI, Google, Anthropic, Cohere) to programmatically send prompts, handle responses, and integrate with other tools.

When should I analyze data directly with an LLM versus via external tools?

Use direct analysis when the data fits in the prompt and the task suits LLM strengths (e.g., classifying a review, describing an image). Use external tools when dealing with large or structured data (tables/graphs): have the LLM generate code (e.g., SQL, Cypher) and execute it in databases/graph systems for efficiency, scalability, and lower cost.

How can I minimize costs while maintaining quality?

- Choose the right model size/provider; bigger isn’t always better for simple tasks. - Mind tokens: costs scale with tokens read/written (roughly ~4 characters per token). - Configure outputs: constrain output choices and limit maximum output length when possible. - Consider fine-tuning for a specific task to use smaller, cheaper models effectively. - Invest in prompt engineering and benchmark alternatives (e.g., via HELM) to find the best cost–quality tradeoff.

What are agents, and how do frameworks like LangChain and LlamaIndex help?

Agents use LLMs to plan multi-step workflows, decompose problems, and invoke external tools (e.g., run SQL queries) based on intermediate results. Frameworks like LangChain and LlamaIndex make it easier to build such applications, especially when steps, data sources, or tools vary per query and can’t be hard-coded in advance.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$31.99 $19.19

you save $12.80 (40%)

include audio $19.99 $11.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$31.99 $19.19

you save $12.80 (40%)

include audio $19.99 $11.99

eBook

pdf, ePub, online

$31.99 $19.19

you save $12.80 (40%)

include audio $19.99 $11.99

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more