Overview

1 Understanding foundation models

Foundation models mark a shift from building narrow, data-specific models toward training a single, large model on vast and diverse datasets so it can be reused across many tasks. The chapter clarifies the difference between an algorithm (the recipe) and a model (the learned result on data) and defines key traits of foundation models: they are big (many parameters), trained on broad data, useful for multiple tasks, and adaptable via fine-tuning. In time series, this enables one model to forecast series with varied frequencies, trends, and seasonal patterns, and to tackle related tasks such as anomaly detection and classification, often working zero-shot and improving further with targeted fine-tuning.

The chapter introduces the Transformer, the backbone of most foundation models, from a time series perspective. Raw sequences are tokenized and passed through embeddings, then enriched with positional encoding so temporal order is preserved. The encoder’s multi-head self-attention learns dependencies (e.g., trends, seasonality) across the history, while the decoder uses masked attention to generate forecasts step by step without peeking into the future, guided by the encoder’s learned representation. A final projection maps deep representations back to forecast values, and predictions are fed back iteratively until the full horizon is produced. Many time series foundation models adopt this architecture wholesale, use decoder-only variants, or introduce task-specific adaptations.

Using foundation models offers practical benefits: quick, out-of-the-box pipelines; strong performance even with few local data points; lower expertise barriers; and reusability across datasets and tasks. The trade-offs include privacy considerations for hosted solutions, limited control over a model’s built-in capabilities, the possibility that a custom model may outperform in niche settings, and higher compute and storage requirements. The chapter sets expectations and boundaries for when these models excel, and previews hands-on work with several time series foundation models, applying zero-shot and fine-tuned forecasting to real datasets, exploring anomaly detection, extending language models to time series, and rigorously evaluating against traditional statistical baselines.

Result of performing linear regression on two different datasets. While the algorithm to build the linear model stays the same, the model is definitely very different depending on the dataset used.
Simplified Transformer architecture from a perspective of time series. The raw series enters at the bottom left of the figure, flows through an embedding layer and positional encoding before going into the decoder. Then, the output comes from the decoder one value at a time until the entire horizon is predicted.
Visualizing the result of feeding a time series through an embedding layer. The input is first tokenized, and an embedding is learned. The result is the abstract representation of the input made by the model.
Visualizing positional encoding. Note that the positional encoding matrix must be of the same size as the embedding. Also note that sine is used in even positions, while cosine is used on odd positions. The length of the input sequence is vertical in this figure.
We can see that the encoder is actually made of many encoders which all share the same architecture. An encoder is made of a self-attention mechanism and a feed forward layer.
Visualizing the self-attention mechanism. This is where the model learns relationships between the current token (dark circle) and past tokens (light circles) in the same embedding. In this case, the model assigns more importance (depicted by thicker connecting lines) to closer data points than those farther away.
Visualizing the decoder. Like the encoder, the decoder is actually a stack of many decoders. Each is composed of a masked multi-headed attention layer, followed by a normalization layer, a multi-headed attention layer, another normalization layer, a feed forward layer, and a final normalization layer. The normalization layers are there to keep the model stable during training.
Visualizing the decoder in detail. We see that the output of the encoder is fed to the second attention layer inside the decoder. This is how the decoder can generate predictions using information learned by the encoder.

Summary

  • A foundation model is a very large machine learning model trained on massive amounts of data that can be applied on a wide variety of tasks.
  • Derivatives of the Transformer architecture are what powers most foundation models.
  • Advantages of using foundation models include simpler forecasting pipelines, a lower entry barrier to forecasting, and the possibility to forecast even when few data points are available.
  • Drawbacks of using foundation models include privacy concerns, and the fact that we do not control the model’s capabilities. Also, it might not be the best solution to our problem.
  • Some forecasting foundation models were designed with time series in mind, while others repurpose available large language models for time series tasks.

References

FAQ

What is a foundation model and how does it differ from traditional ML models?A foundation model is trained on very large, diverse datasets and is designed to work across many tasks. Unlike traditional, data-specific models built for a single use case, the same foundation model can be reused for multiple scenarios and adapted via fine-tuning. These models are typically large (millions to billions of parameters) and capture broad patterns that transfer to new problems.
What’s the difference between an algorithm and a model?An algorithm is the procedure or recipe for learning (for example, linear regression’s optimization steps). A model is the outcome of running that algorithm on a particular dataset—the learned parameters that make predictions. Same algorithm + different data = different model.
What criteria make a model a “foundation model”?Four elements typically apply: (1) training data that is very large and diverse, (2) a large-capacity model with many parameters, (3) the ability to perform multiple tasks, and (4) adaptability via fine-tuning to a specific scenario.
Why are foundation models useful for time series forecasting?A single model can forecast series with different frequencies, trends, and seasonal patterns without retraining from scratch. Advanced models can also handle related tasks such as anomaly detection and time series classification. This reusability simplifies workflows and speeds up experimentation across domains.
What’s the difference between zero-shot forecasting and fine-tuning?Zero-shot forecasting uses the foundation model as-is to produce predictions on new data based on patterns it has already learned. Fine-tuning adapts the model to your specific data by updating a subset of parameters, often improving accuracy. Use zero-shot when you have little data or need speed; fine-tune when you have representative data and need higher performance.
What are tokens and embeddings in time series Transformers?Tokens are units the model processes—individual values or short windows from the time series. The embedding layer converts these tokens into learned numerical vectors that capture relevant features in a form the model can use. These embeddings are the input to subsequent Transformer layers.
What is positional encoding and why is it important?Positional encoding injects information about the ordering of time steps into the embeddings, which is crucial for sequences. Commonly, fixed sinusoidal encodings (sine and cosine at multiple frequencies) are added so the model knows where each token occurs and can distinguish identical values at different times.
How do the encoder and decoder of a Transformer forecast a time series?The encoder uses self-attention (often multi-headed) to learn dependencies within the historical sequence and produce a rich representation. The decoder generates forecasts step by step using masked attention (to avoid peeking into the future) and cross-attention to the encoder’s output. A final projection layer maps the decoder’s representation to actual forecast values, which are then fed back to predict the next step.
What are the main advantages and drawbacks of foundation models for forecasting?Advantages: fast, out-of-the-box pipelines; lower expertise required to get started; viable with limited data; reusable across tasks. Drawbacks: potential privacy concerns with hosted models; limited control over capabilities and constraints (e.g., horizon, multivariate support, exogenous features); may be suboptimal for some use cases; high storage and compute needs.
What practical considerations matter when deploying a foundation forecasting model?Plan for resource requirements (storage, memory, and often GPUs) or consider API-based access to reduce infrastructure overhead. Address data privacy and governance—use private or self-hosted deployments if needed. Understand model limits (supported horizons, features, and modalities) and rigorously evaluate against domain-specific baselines before adoption.

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free
choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • renews monthly, pause or cancel renewal anytime
  • renews annually, pause or cancel renewal anytime
  • Time Series Forecasting Using Foundation Models ebook for free