Overview

1 Analyzing data with large language models

Large language models are presented as versatile neural networks that can tackle a wide range of data analysis tasks through natural-language instructions rather than task-specific training. The chapter introduces their multimodal capabilities across text, images, audio, video, tables, and graphs, and frames data analysis as either directly processing raw content with the model or indirectly using the model to generate code for specialized tools. It also sets expectations around background knowledge, practical challenges, and the importance of scalability and cost-awareness when moving from toy prompts to real pipelines.

The core usage pattern is prompting: describing the task, providing necessary context and data, specifying the desired output format, and optionally including examples (few-shot) versus relying on instructions alone (zero-shot). For repeatable workflows, prompt templates with placeholders enable programmatic generation of consistent prompts at scale. The chapter contrasts direct analysis (e.g., classifying reviews or answering questions about images by embedding the data in the prompt) with tool-mediated analysis for structured data (e.g., translating natural-language questions into SQL for relational databases or into graph queries). It covers practical interfaces—from web UIs for ad hoc trials to Python libraries for automation and integration with other data processing components.

To control costs and improve reliability, the chapter outlines strategies for model selection, configuration, and prompt engineering. It emphasizes choosing appropriately sized models, understanding token-driven pricing, constraining outputs, and using fine-tuning when it can replace more expensive models for a specific task. It advocates empirically benchmarking models and prompts, and introduces higher-level frameworks that simplify building complex applications, including agent-based approaches that plan multi-step analyses and invoke external tools on demand. Together, these practices enable building robust, efficient, and flexible data analysis systems powered by large language models.

Illustration by GPT-4o, connecting the topics “data analysis” and “large language models”.
figure
Using language models directly for data analysis: a prompt template describes the analysis task. It contains placeholders that are substituted by data to analyze. After substituting placeholders, the resulting prompt is submitted to the language model to produce output.
figure
Using language models indirectly to build a natural language interface for tabular data: the prompt template contains placeholders for questions about data. After substituting placeholders, the resulting prompt is used as input for the language model. The model translates the question into an SQL query that is executed via a relational database management system.
figure
Holistic Evaluation of Language Models (HELM): comparing language models offered by different providers according to various metrics.
figure
PromptBase: a Web site for selling and buying templates for prompts, short text documents used to describe tasks to a language model.
figure

Summary

  • Language models can solve novel tasks without specialized training.
  • The prompt is the input to the language model.
  • Prompts may combine text with other types of data such as images.
  • A prompt contains a task description, context, and optionally examples.
  • Language models can analyze certain types of data directly.
  • When analyzing data directly, the data must appear in the prompt.
  • Prompt templates contain placeholders, e.g., to represent data items.
  • By substituting placeholders in a prompt template, we obtain a prompt.
  • Language models can also help to analyze data via external tools.
  • E.g., language models can instruct other tools on how to process data.
  • Models are available in many different sizes with significant cost differences.
  • Models can be configured using various configuration parameters.
  • LangChain and LlamaIndex help to develop complex applications.
  • Agents exploit language models to solve complex problems.

FAQ

What are large language models and why are they useful for data analysis?Large language models (LLMs) are powerful neural networks, pre-trained on vast data, that can understand and generate natural language (and often other modalities like images). For data analysis, they can extract information from text, answer questions about images, write and explain code, translate questions into database queries, and help build end-to-end analysis pipelines across diverse data types.
What does GPT stand for, and what does each part mean?GPT stands for Generative Pre-Trained Transformer. Generative: it produces content (text, code, etc.). Pre-Trained: it is trained on large generic datasets before being adapted to tasks. Transformer: the neural architecture that handles variable-length inputs/outputs and underpins most modern generative AI.
Which data types can LLMs analyze in this book, and what’s the difference between structured and unstructured data?This book covers text, images, videos, audio, tables, and graphs. Structured data (tables, graphs) has explicit schema/relationships and benefits from specialized tools (e.g., SQL databases, graph systems). Unstructured data (text, images, video, audio) lacks easily exploitable structure and is typically analyzed directly with an LLM.
What is prompting, and what should a good prompt include?Prompting is how you instruct an LLM. A good prompt includes: clear task instructions; all necessary context/data (e.g., the text or image to analyze); the desired output format; and, optionally, examples illustrating inputs and correct outputs to improve accuracy.
What’s the difference between zero-shot and few-shot prompting?Zero-shot means the model gets only the task description (no examples). Few-shot includes a handful of input–output examples in the prompt to guide the model’s behavior. Few-shot often improves accuracy on nuanced tasks.
What is a prompt template and how do placeholders work?A prompt template is a reusable prompt for a specific task with placeholders for parts that change per input (e.g., [ReviewText]). At runtime, you substitute placeholders with the current data item to create an instance of the prompt, enabling efficient batch processing.
How can I interact with LLMs: web UI or Python libraries?Providers offer web interfaces for ad hoc, single prompts. For automation and scale (e.g., processing many documents, integrating post-processing), use Python libraries (e.g., from OpenAI, Google, Anthropic, Cohere) to programmatically send prompts, handle responses, and integrate with other tools.
When should I analyze data directly with an LLM versus via external tools?Use direct analysis when the data fits in the prompt and the task suits LLM strengths (e.g., classifying a review, describing an image). Use external tools when dealing with large or structured data (tables/graphs): have the LLM generate code (e.g., SQL, Cypher) and execute it in databases/graph systems for efficiency, scalability, and lower cost.
How can I minimize costs while maintaining quality?- Choose the right model size/provider; bigger isn’t always better for simple tasks. - Mind tokens: costs scale with tokens read/written (roughly ~4 characters per token). - Configure outputs: constrain output choices and limit maximum output length when possible. - Consider fine-tuning for a specific task to use smaller, cheaper models effectively. - Invest in prompt engineering and benchmark alternatives (e.g., via HELM) to find the best cost–quality tradeoff.
What are agents, and how do frameworks like LangChain and LlamaIndex help?Agents use LLMs to plan multi-step workflows, decompose problems, and invoke external tools (e.g., run SQL queries) based on intermediate results. Frameworks like LangChain and LlamaIndex make it easier to build such applications, especially when steps, data sources, or tools vary per query and can’t be hard-coded in advance.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Analysis with LLMs ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Analysis with LLMs ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Data Analysis with LLMs ebook for free