1 Large Language Models
Large Language Models have been swept up in a familiar hype cycle, obscuring a clear view of what they can and cannot do. This chapter cuts through that noise by tracing today’s LLMs back to the 2017 breakthrough of the Transformer architecture and explaining the shift from costly, label-hungry supervised learning toward self-supervised training at scale. By predicting the next token across vast corpora, LLMs internalize syntax and semantics and exhibit emergent abilities, from summarization and translation to code completion and basic reasoning, marking the transition from earlier RNN-era NLP to broadly capable generative models. ChatGPT’s 2022 launch accelerated commercialization, but the foundations predate it.
The chapter contrasts Transformers with RNNs—highlighting self-attention, parallelism, and embeddings—and outlines the evolution into encoder-only (BERT-like) and decoder-only (GPT-like) families suited to discrimination versus generation. It also introduces RLHF as a technique to refine interactive behavior. Beyond language-centric tasks, the models generalize to symbolic formats and diverse applications such as classification, Q&A, document summarization, and code generation. A major theme is the open-source surge: organizations can reduce costs and time-to-value by starting from pretrained models rather than training from scratch. While development and training burdens drop with fine-tuning, deployment and inference remain nontrivial, motivating techniques to optimize models for performance and efficiency in constrained environments.
Relying on closed generalist LLMs entails risks—data leaving organizational boundaries, potential leakage, opacity about model internals and training data, and hallucinations that are hard to audit—making them ill-suited for sensitive or highly regulated uses. Domain-specific LLMs, adapted via transfer learning on high-quality proprietary and public domain data and run privately, can deliver higher accuracy, compliance, and control, often at smaller scales that are more cost- and energy-efficient. The chapter closes by emphasizing that alignment should prioritize helpful, truthful, and harmless intent within domain constraints, and by listing practical prerequisites for readers: working knowledge of Python and PyTorch, familiarity with Transformer fundamentals and training basics, and comfort with common tooling for experimentation.
Some examples of diverse content an LLM can generate
The timeline of LLMs since 2019 (image taken from paper [2])
Order of magnitude of costs for each phase of LLM implementation from scratch
Order of magnitude of costs for each phase of LLM implementation when starting from a pretrained model
Ratios of data source types used to train some popular existing LLMs
Generic model specialization to a given domain
A LLM trained for tasks on molecule structures (generation and captioning)
Summary
The next chapter introduces you to techniques to specialize pre-trained models on your own data using Hugging Face libraries. In this chapter you have learned:
- What is a Transformer.
- Which risks and challenges to consider with generalist and/or closed source LLMs.
- How to decide when it is the case to prefer a domain specific LLM.
- Techniques to train and optimize an Open Source LLM on your own data exist and are the core topic of this book.
References
FAQ
What is the purpose of Chapter 1 and how does it address the LLM hype?
This chapter cuts through the hype around Large Language Models by explaining what LLMs can and cannot do, outlining their pros, risks, and challenges, and contrasting generalist models with domain-specific ones to help readers decide what fits their business needs.How did Transformers spark the LLM revolution, and how are they different from RNNs?
Transformers, introduced in 2017, replaced sequential processing with self-attention and removed recurrence, enabling full input parallelism and faster training. Unlike RNNs (including LSTM/GRU), which are slow and limited by sequential data flow, Transformers scale better and capture long-range dependencies more effectively.Why do LLMs use self-supervised learning, and how does next-word prediction training work?
Self-supervised learning avoids costly human labeling by generating labels from raw text. A common approach removes the next word from a sequence and trains the model to predict it. Comparing predictions with the original word provides the learning signal, enabling training on massive unlabeled corpora.What role do word embeddings play in Transformers?
Word embeddings map tokens to high-dimensional vectors that capture syntactic and semantic relationships. Operating in this numerical space lets models reason about meaning and context, treating words as interrelated rather than isolated items.What’s the difference between encoder-only and decoder-only Transformers (BERT vs. GPT)?
The original Transformer used an encoder-decoder. Later, encoder-only models like BERT excel at understanding tasks (e.g., classification/prediction), while decoder-only models like GPT excel at generative tasks (e.g., text/code generation). The choice depends on the target task.What is RLHF and why is it used in LLMs like ChatGPT?
RLHF (Reinforcement Learning from Human Feedback) fine-tunes a model to maximize a reward aligned with human preferences. Applied to GPT-based systems behind ChatGPT, it improves helpfulness, safety, and overall performance beyond pretraining alone.What can LLMs do beyond natural language text?
LLMs handle a wide range of tasks, including language understanding, classification, generation, question answering, summarization, semantic parsing, pattern recognition, basic math, code generation, dialogue, general knowledge, and logical reasoning. They can also work with symbolic formats (e.g., programming code or specialized domain notations).Why consider open-source LLMs, and how do costs compare to building from scratch?
Open-source models offer choice, control, and cost savings. Starting from a pretrained model reduces development and training costs significantly compared to building a custom LLM from scratch. However, deployment and inference still carry substantial costs and operational challenges. Always review licenses before commercial use.What are the main risks of using closed-source generalist LLMs?
- Data leaves your network (prompted inputs may be exposed or reused per provider terms)- Potential data leakage due to provider-side security incidents
- Lack of transparency (implementation details, training data, version changes)
- Unknown training data can introduce bias and legal/copyright concerns
- Hallucinations (intrinsic and extrinsic) with limited mitigation
- Code-generation misuse risks and guardrail bypasses
Domain-Specific Small Language Models ebook for free