Overview

5 Fundamentals of machine learning

This chapter builds a practical and conceptual foundation for machine learning by centering on the tension between optimization and generalization. It explains how models learn to reduce training loss yet must be evaluated by how well they perform on unseen data, and it frames overfitting as the universal failure mode when optimization goes too far. The material emphasizes reliable evaluation and disciplined experimentation as the backbone of effective model development, setting expectations for how to read learning curves, recognize underfitting versus overfitting, and steer training toward robust, real-world performance.

It then explores why overfitting happens and what enables generalization. Overfitting is amplified by noisy or mislabeled examples, inherently ambiguous targets, and rare features that invite spurious correlations. The chapter introduces the manifold hypothesis: natural data occupies low-dimensional, structured subspaces within high-dimensional input spaces, which lets deep networks generalize primarily by interpolating between training samples along these learned manifolds. Because deep models are smooth, highly expressive function approximators, they can memorize; but with good data and gradual training they also learn manifold structure well enough to interpolate. Consequently, data quality and coverage are paramount—dense, informative sampling of the input space is the most powerful lever for generalization; when data is limited, regularization is used to constrain memorization.

On measurement and practice, the chapter formalizes evaluation protocols: separate training, validation, and test splits; avoid information leaks; and use hold-out, K-fold, or iterated K-fold validation when data is scarce. It recommends establishing common‑sense baselines and guarding against pitfalls such as unrepresentative splits, temporal leakage, and duplicate samples across splits. To improve model fit, start by getting optimization to work (tune learning rate and batch size), select architectures with the right inductive biases for the modality, and increase capacity until overfitting is possible. To improve generalization, prioritize dataset curation and feature engineering, apply early stopping, and regularize via capacity control, weight penalties (L1/L2), and dropout. The overarching message: measure carefully, fit confidently, and regularize deliberately so that interpolation on well-structured data yields reliable performance.

Canonical overfitting behavior
typical overfitting
Some pretty weird MNIST training samples
weird mnist
Mislabeled MNIST training samples
mislabeled mnist
Dealing with outliers: robust fit vs. overfitting
outliers and overfitting
Robust fit vs. overfitting giving an ambiguous area of the feature space
overfitting with uncertainty
Effect of noise channels on validation accuracy
mnist with added noise channels or zeros channels
Different MNIST digits gradually morphing into one another, showing that the space of handwritten digits forms a “manifold”. This image was generated using code from Chapter 17.
mnist manifold
Difference between linear interpolation and interpolation on the latent manifold. Every point on the latent manifold of digits is a valid digit, but the average of two digits usually isn’t.
linear interpolation vs manifold interpolation
Uncrumpling a complicated manifold of data
ch02 geometric interpretation 4
Going from a random model to an overfit model, and achieving a robust fit as an intermediate state
the cartoon of fitting
A dense sampling of the input space is necessary in order to learn a model capable of accurate generalization.
dense sampling
Simple hold-out validation split
holdout validation
Three-fold validation
k fold validation
Effect of insufficient model capacity on loss curves
effect of insufficient model capacity on val loss
Validation loss for a model with appropriate capacity
effect of correct model capacity on val loss
Effect of excessive model capacity on validation loss
effect of excessive model capacity on val loss
Feature engineering for reading the time on a clock
clock diagram
Original model vs. smaller model on IMDB review classification
original model vs smaller model imdb
Original model vs. much larger model on IMDB review classification
original model vs larger model imdb
Effect of L2 weight regularization on validation loss
original model vs l2 regularized model imdb
Dropout applied to an activation matrix at training time, with rescaling happening during training. At test time, the activation matrix is unchanged.
dropout
Effect of dropout on validation loss
original model vs dropout regularized model imdb

Chapter summary

  • The purpose of a machine learning model is to generalize: to perform accurately on never-seen-before inputs. It’s harder than it seems.
  • A deep neural network achieves generalization by learning a parametric model that can successfully interpolate between training samples – such a model can be said to have learned the latent manifold of the training data. This is why deep learning models can only make sense of inputs that are very close to what they’ve seen during training.
  • The fundamental problem in machine learning is the tension between optimization and generalization: to attain generalization, you must first achieve a good fit to the training data, but improving your model’s fit to the training data will inevitably start hurting generalization after a while. Every single deep learning best practice deals with managing this tension.
  • The ability of deep learning models to generalize comes from the fact that they manage to learn to approximate the latent manifold of their data, and can thus make sense of new inputs via interpolation.
  • It’s essential to be able to accurately evaluate the generalization power of your model while you’re developing it. You have at your disposal an array of evaluation methods, from simple hold-out validation, to K-fold cross-validation and iterated K-fold cross-validation with shuffling. Remember to always keep a completely separate test set for final model evaluation, since information leaks from your validation data to your model may have occurred.
  • When you start working on a model, your goal is first to achieve a model that has some generalization power and that can overfit. Best practices to do this include tuning your learning rate and batch size, leveraging better architecture priors, increasing model capacity, or simply training longer.
  • As your model starts overfitting, your goal switches to improving generalization through model regularization. You can reduce your model’s capacity, add dropout or weight regularization, and use early stopping. And naturally, a larger or better dataset is always the number one way to help a model generalize.

[1] Mark Twain even called it “the most delicious fruit known to men”.

FAQ

What is the tension between optimization and generalization in machine learning?Optimization is the process of minimizing training loss by fitting the model to the training data; generalization is how well the trained model performs on new, unseen data. If you optimize “too well,” the model starts to memorize specifics of the training set (overfitting), causing validation/test performance to degrade. The goal is good generalization, but you can only directly control optimization, so you must manage training to avoid overfitting.
How can I tell whether my model is underfitting or overfitting?Early in training, both training and validation losses improve together; the model is underfit and still learning useful patterns. After a point, validation loss plateaus and then rises while training loss keeps dropping—this is overfitting. If you can’t get validation metrics to budge or you can’t make the model overfit at all, you’re likely underfitting due to insufficient capacity or poor priors.
What data issues commonly cause overfitting?Noisy or mislabeled samples, inherently ambiguous regions of the feature space, and rare features that induce spurious correlations all encourage memorization. For example, adding random noise features can reduce validation accuracy via spurious patterns the model learns. Mitigations include better data curation, label proofreading, and feature selection to remove uninformative inputs.
What is the manifold hypothesis and why does it matter for deep learning?The manifold hypothesis states that natural data (images, speech, text, etc.) lies on low-dimensional, structured subspaces within the high-dimensional input space. Deep networks are smooth, differentiable mappings that can approximate these latent manifolds during training. This structure enables models to generalize by interpolating between nearby training points on the manifold.
How does interpolation explain generalization, and what are its limits?Within a learned approximation of the data manifold, the model can infer new cases by relating them to nearby training samples—interpolating along the manifold (not simple linear averaging in pixel/input space). This yields local generalization. However, it doesn’t handle extreme novelty; humans rely on additional mechanisms (reasoning, abstraction, priors) beyond interpolation.
Why is training data quality and coverage paramount?Deep learning is essentially curve fitting; it works best when the training set densely covers the input manifold, especially near decision boundaries. More and better data simplifies the manifold the model must learn and improves generalization. When more data isn’t available, you must constrain the model (regularization) or improve the information content of inputs (feature engineering).
How should I split my data, and why isn’t a train/test split enough?Use three sets: train, validation, and test. You tune hyperparameters using validation performance, which leaks information into the model; the test set must remain untouched for a final, unbiased estimate. Beware pitfalls: ensure splits are representative, avoid temporal leakage in time series, and eliminate duplicates between splits.
Which evaluation protocols help when data is limited?- Simple hold-out validation: fast but can be high-variance with small datasets. - K-fold cross-validation: average performance across K folds for a more reliable estimate. - Iterated K-fold with shuffling: repeat K-fold multiple times (more precise but computationally expensive). Even with these, keep a distinct test set for the final check.
Training is stuck or unstable—what should I tune first?Start with gradient descent settings: learning rate and batch size. Too-high learning rates overshoot and stall; too-low rates make progress look flat. Larger batches reduce gradient noise. If issues persist, review optimizer choice and initialization, and ensure the architecture matches the data modality (appropriate priors).
What are the most effective ways to improve generalization and reduce overfitting?- Get more or better-curated data; perform feature selection/engineering. - Use early stopping to capture the best validation epoch. - Reduce capacity (fewer/lower-width layers) to limit memorization. - Add weight regularization (L1/L2) for smaller models. - Add dropout, especially for large models. Choose architectures with priors that match your data (e.g., convnets for images, sequence models for text/time series).

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Deep Learning with Python, Third Edition ebook for free