Overview

1 Understanding foundation models

Foundation models mark a shift from building task-specific forecasters to reusing a single, large, pre-trained model across many scenarios. Trained on massive, diverse datasets and equipped with millions to billions of parameters, they can be adapted to new problems via fine-tuning or used directly in a zero-shot manner. In time series, this enables one model to forecast across heterogeneous frequencies and patterns (trends, seasonality, holidays) and to perform related tasks such as anomaly detection and classification. The chapter motivates this paradigm through its growing presence in everyday applications and frames the book’s hands-on focus: defining what foundation models are, clarifying where they excel or struggle, and applying them to practical forecasting problems.

At the architectural core of most foundation models is the Transformer. Time series inputs are tokenized and mapped into embeddings, with positional encodings added to preserve temporal order. The encoder’s self-attention (often multi-headed) learns rich dependencies across time, while the decoder uses masked attention to prevent peeking into the future and cross-attends to encoder outputs to generate forecasts autoregressively across the horizon. A final projection layer returns predictions to the target scale. Understanding this flow—and the associated hyperparameters—helps practitioners fine-tune models effectively and diagnose when certain designs (for example, handling exogenous variables or multivariate targets) may or may not fit a use case.

The chapter closes by weighing benefits against trade-offs. Benefits include simpler, out-of-the-box pipelines, usefulness with limited data, lower expertise barriers, and reuse across tasks and datasets. Trade-offs involve privacy and governance concerns when using proprietary services, limited control over a model’s built-in capabilities and horizons, the possibility that a specialized model outperforms a general one, and substantial compute and storage needs (mitigable via APIs). Ultimately, adopting a foundation model is an empirical decision guided by performance and cost. The book proceeds to build intuition with a small model, then evaluates leading time series foundation models and LLM-based approaches on real data, culminating in a comparative capstone against classical statistical methods.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.
Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.
Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.
Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.
We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.
Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.
Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.
Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

  • A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
  • Derivatives of the Transformer architecture are what powers most foundation models.
  • Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
  • Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
  • Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What is a foundation model?A foundation model is a large machine learning model trained on very large and diverse datasets so it can be applied to a wide variety of tasks. It typically has millions or billions of parameters and can be adapted to a specific use case through fine-tuning.
How do foundation models differ from traditional, task-specific forecasting models?Traditional approaches train a new model for each dataset and use case. Foundation models are trained once on massive, diverse data and can be reused across many scenarios, often producing reasonable forecasts zero-shot and optionally improved via fine-tuning on your data.
What are the four key characteristics of a foundation model?1) Trained on very large, diverse datasets. 2) Very large parameter count. 3) Applicable to multiple tasks. 4) Adaptable via fine-tuning.
How are foundation models used for time series forecasting?A single model can forecast series with different frequencies, trends, seasonalities, and holiday effects. More advanced models can also perform related tasks such as anomaly detection and time series classification.
Why is the Transformer architecture central to foundation models for time series?Transformers efficiently capture complex dependencies. Many foundation time series models use the Transformer (full, decoder-only, or adapted versions), making it important for understanding capabilities, hyperparameters, and limitations when fine-tuning.
What are tokens and embeddings in this context?Tokens are pieces of the time series (e.g., individual values or windows). The embedding layer learns a numerical vector representation that encodes features of the series in a form the model can process.
Why is positional encoding needed and how does it work?Time order matters in forecasting. Fixed positional encoding (using sine and cosine at multiple frequencies) is added to embeddings so the model knows where each token occurs in the sequence, avoiding confusion between identical values at different times.
What does the Transformer encoder learn?The encoder uses self-attention (often multi-headed) to learn relationships among tokens—e.g., nearby or seasonal dependencies. Multiple attention heads capture different patterns (such as trend vs seasonality), producing a deep representation for the decoder.
How does the Transformer decoder generate forecasts?The decoder uses masked multi-headed attention to prevent peeking into the future, then cross-attends to the encoder’s output. A linear projection maps the deep representation to predictions. Values are generated one step at a time and fed back in until the full horizon is produced.
What are the main advantages and drawbacks of using foundation models for forecasting?Advantages: out-of-the-box pipelines, less expertise required, workable with few data points, reusable across tasks. Drawbacks: privacy concerns (especially with hosted/proprietary models), limited control over capabilities, may not be optimal for specific cases, and high resource requirements. Choose based on experiments that balance accuracy, costs, and constraints (e.g., self-hosted vs API access).

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$399.99
only $33.33 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free