Overview

1 Understanding foundation models

Foundation models mark a shift from crafting many task-specific forecasting models to reusing a single, broadly trained model across diverse problems. Trained on massive and varied datasets, these models can generate useful forecasts out of the box and transfer across domains. In time series, that means one model can handle different frequencies, trends, and seasonalities, and even perform related tasks such as anomaly detection and classification. The chapter motivates this shift, highlights real-world momentum around foundation models, and frames their application to forecasting use cases ranging from retail demand to weather.

The chapter defines a foundation model and distinguishes algorithms (the recipe) from models (the trained result). It identifies four pillars: very large and diverse training data, large parameter counts, broad task applicability, and adaptability via fine-tuning. Most such systems rely on Transformers: time series are tokenized and embedded; positional encodings preserve temporal order; an encoder with self-attention learns rich dependencies; and a decoder with masked attention generates forecasts step by step, using multi-head attention to capture complementary patterns like trend and seasonality. Fine-tuning selectively adjusts parameters to better match a specific scenario without retraining from scratch.

Practically, foundation models simplify pipelines, reduce the expertise needed to get started, work surprisingly well with limited data, and can be reused across tasks. Trade-offs include privacy and data-governance concerns (especially with hosted, proprietary systems), limited control over built-in capabilities or horizons, the possibility that a bespoke model outperforms them for a given niche, and substantial compute and storage demands—though API access can offset infrastructure costs. The chapter emphasizes rigorous evaluation to decide when these models are the right fit and previews the rest of the book: hands-on experiments with leading time-series foundation models, adaptations of large language models to forecasting, and a capstone comparing foundation models to statistical baselines.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.
Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.
Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.
Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.
We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.
Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.
Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.
Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

  • A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
  • Derivatives of the Transformer architecture are what powers most foundation models.
  • Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
  • Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
  • Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What is a foundation model?A foundation model is a very large machine learning model trained on massive, diverse datasets so it can be reused across many tasks. It typically has millions to billions of parameters, generalizes to new scenarios, and can often be adapted (fine-tuned) to a specific use case for better performance.
How is a model different from an algorithm?An algorithm is the procedure or recipe for learning (the steps a program follows). A model is the learned result of applying that algorithm to data. Using the recipe analogy: the algorithm is the recipe, the data are the ingredients, and the model is the cake.
What are the defining characteristics of a foundation model?
  • Trained on very large and diverse datasets
  • Large parameter count (often millions to billions)
  • Reusable across different tasks and scenarios
  • Adaptable via fine-tuning to specific domains
Why are foundation models useful for time series forecasting?A single foundation model can forecast series with varied frequencies, trends, seasonality, and holiday effects, and may also handle related tasks like anomaly detection or time series classification. They can produce reasonable predictions even with limited task-specific data and streamline workflows by reducing the need to build many bespoke models.
What is the Transformer architecture and why does it matter here?The Transformer, introduced in 2017, underpins many top-performing foundation models, including those for time series. It uses attention mechanisms to capture complex, long-range dependencies efficiently. For forecasting, it provides a flexible encoder–decoder framework well-suited to learning rich representations and generating sequence predictions.
How are time series prepared for a Transformer (tokens, embeddings, positional encoding)?
  • The series is split into tokens (e.g., individual values or windows).
  • An embedding layer learns a dense vector representation of these tokens.
  • Positional encoding (sinusoidal patterns) is added so the model knows the order of time steps and can distinguish identical values occurring at different positions.
How do the encoder and decoder work together to produce forecasts?
  • Encoder: stacks of self-attention and feed-forward layers learn deep representations and dependencies from historical data.
  • Decoder: uses masked attention (to avoid peeking into the future) and attends to the encoder’s output to generate predictions.
  • Generation: a linear projection maps the decoder’s representation to forecast values, which are fed back autoregressively until the horizon is covered.
What is self-attention and multi-head attention in simple terms?Self-attention lets the model weigh how much each past token should influence the current step, learning dependencies across the sequence. Multi-head attention runs several attention mechanisms in parallel so different heads can focus on different patterns (e.g., trends vs. seasonality), enriching the learned representation.
What are the main advantages and drawbacks of foundation models for forecasting?
  • Advantages: out-of-the-box pipelines; useful with limited data; lower expertise barrier; reusable across tasks.
  • Drawbacks: privacy concerns (especially with hosted proprietary models); limited control over capabilities and constraints (e.g., horizon limits, multivariate/exogenous support); may underperform custom models in some domains; high storage and compute needs (mitigable via APIs or private/self-hosted deployments).
What comes next after this chapter?The book builds a small “foundation-like” model to expose design challenges, then explores purpose-built time series foundation models (e.g., TimeGPT, Lag-LLaMa, Chronos, Moirai, TimesFM) with real data. It also examines adapting large language models (e.g., PromptCast, Time-LLM) for forecasting, and concludes with a capstone comparing foundation models to classical statistical methods.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free