table of content

Part 1 The rise of foundation machine learning models

1 Understanding foundation models

1.1 Defining a foundation model

1.2 Exploring the transformer architecture

1.2.1 Feeding the encoder

1.2.2 Inside the encoder

1.2.3 Making predictions

1.3 Advantages and disadvantages of foundation models

1.3.1 Benefits of foundation forecasting models

1.3.2 Drawbacks of foundation forecasting models

1.4 Next steps

2 Building a foundation model

2.1 Exploring the architecture of N-BEATS

2.1.1 Basis expansion

2.2 Architecture of N-BEATS

2.2.1 A block in N-BEATS

2.2.2 A stack in N-BEATS

2.2.3 Assembling N-BEATS

2.3 Pretraining our model

2.4 Pretraining N-BEATS

2.5 Transfer learning with our pretrained model

2.6 Fine-tuning our pretrained model

2.7 Evaluating each approach

2.8 Forecasting at another frequency

2.9 Understanding the challenges of building a foundation model

2.10 Next steps

Part 2 Foundation models developed for forecasting

3 Forecasting with TimeGPT

3.1 Defining generative pretrained transformers

3.2 Exploring TimeGPT

3.2.1 Training TimeGPT

3.2.2 Quantifying uncertainty in TimeGPT

3.3 Forecasting with TimeGPT

3.3.1 Initial setup

3.3.2 Zero-shot forecasting

3.3.3 Performance evaluation

3.4 Fine-tuning with TimeGPT

3.4.1 Fine-tuning TimeGPT

3.4.2 Evaluating the fine-tuned model

3.4.3 Controlling the depth of fine-tuning

3.5 Forecasting with exogenous variables

3.5.1 Preparing the exogenous features

3.5.2 Forecasting with exogenous variables

3.5.3 Explaining the effect of exogenous features with Shapley values

3.5.4 Evaluating forecasts with exogenous features

3.6 Cross-validating with TimeGPT

3.7 Forecasting on a long horizon with TimeGPT

3.8 Detecting anomalies with TimeGPT

3.8.1 Detecting anomalies

3.8.2 Evaluating anomaly detection

3.9 Next steps

4 Zero-shot probabilistic forecasting with Lag-Llama

4.1 Exploring Lag-Llama

4.1.1 Viewing the architecture of Lag-Llama

4.1.2 Pretraining Lag-Llama

4.2 Forecasting with Lag-Llama

4.2.1 Setting up Lag-Llama

4.2.2 Zero-shot forecasting with Lag-Llama

4.2.3 Changing the context length in Lag-Llama

4.3 Fine-tuning Lag-Llama

4.3.1 Handling initial setup

4.3.2 Reading and splitting the data in Colab

4.3.3 Launching the fine-tuning procedure

4.3.4 Forecasting with a fine-tuned model

4.3.5 Evaluating the fine-tuned model

4.4 Model comparison table

4.5 Next steps

5 Learning the language of time with Chronos

5.1 Discovering the T5 family

5.2 Exploring Chronos

5.3 Using tokenization in Chronos

5.4 Training a model with Chronos

5.4.1 Tackling data scarcity with augmentation techniques

5.4.2 Examining the pretrained Chronos models

5.4.3 Selecting the appropriate Chronos model

5.5 Forecasting with Chronos

5.5.1 Initial setup

5.5.2 Predictions

5.6 Cross-validating with Chronos

5.6.1 Running cross-validation

5.6.2 Evaluating Chronos

5.7 Fine-tuning Chronos

5.7.1 Performing initial setup

5.7.2 Configuring the fine-tuning parameters

5.7.3 Launching the fine-tuning procedure

5.7.4 Forecasting with a fine-tuned model

5.7.5 Evaluating the fine-tuned model

5.8 Detecting anomalies with Chronos

5.9 Next steps

6 Moirai: A universal forecasting transformer

6.1 Exploring Moirai

6.1.1 Viewing the architecture of Moirai

6.1.2 Pretraining Moirai

6.1.3 Selecting the appropriate model

6.2 Discovering Moirai-MoE

6.2.1 Patching and embedding

6.2.2 Studying the decoder-only transformer

6.2.3 Pretraining Moirai-MoE

6.3 Forecasting with Moirai

6.3.1 Zero-shot forecasting with Moirai

6.3.2 Cross-validation with Moirai

6.3.3 Forecasting with exogenous features

6.4 Detecting anomalies with Moirai

6.5 Next steps

7 Deterministic forecasting with TimesFM

7.1 Examining TimesFM

7.1.1 Architecture of TimesFM

7.1.2 Pretraining TimesFM

7.2 Forecasting with TimesFM

7.2.1 Zero-shot forecasting with TimesFM

7.2.2 Cross-validation with TimesFM

7.2.3 Forecasting with exogenous features

7.3 Fine-tuning TimesFM and anomaly detection

7.4 Next steps

Part 3 Using LLMs for time-series forecasting

8 Forecasting as a language task

8.1 Overview of LLMs and prompting techniques

8.1.1 Exploring Flan-T5 and Llama-3.2

8.1.2 Understanding the basics of prompting

8.2 Forecasting with Flan-T5

8.2.1 Function to forecast with Flan-T5

8.2.2 Forecast with Flan-T5

8.3 Cross-validation with Flan-T5

8.3.1 Running cross-validation

8.3.2 Evaluating Flan-T5

8.4 Forecasting with exogenous features with Flan-T5

8.4.1 Including exogenous features with Flan-T5

8.4.2 Extracting future values of exogenous variables

8.4.3 Cross-validating with external features

8.4.4 Evaluating Flan-T5 forecasts with exogenous features

8.5 Detecting anomalies with Flan-T5

8.5.1 Defining a function for anomaly detection with Flan-T5

8.5.2 Running anomaly detection

8.6 Forecasting with Llama-3.2

8.6.1 Performing initial setup

8.6.2 Creating a function to forecast via API call

8.6.3 Making predictions

8.7 Cross-validating with Llama-3.2

8.8 Detecting anomalies with Llama-3.2

8.8.1 Modifying the system prompt

8.8.2 Defining a function for anomaly detection

8.8.3 Running anomaly detection with Llama-3.2

8.8.4 Evaluating anomaly detection

8.9 Next steps

9 Reprogramming an LLM for forecasting

9.1 Discovering Time-LLM

9.1.1 Patch reprogramming

9.1.2 Discovering Prompt-as-Prefix

9.1.3 Making predictions

9.2 Forecasting with Time-LLM

9.2.1 Performing initial setup

9.2.2 Generating forecasts

9.3 Cross-validating with Time-LLM

9.4 Evaluating Time-LLM

9.5 Detecting anomalies with Time-LLM

9.5.1 Detecting anomalies

9.5.2 Evaluating anomaly detection

9.6 Next steps

Part 4 Capstone project

10 Capstone project: Forecasting daily visits to a blog

10.1 Introducing the use case

10.2 Walking through the project

10.2.1 Setting the constants

10.2.2 Forecasting with a seasonal naïve model

10.2.3 Forecasting with ARIMA

10.2.4 Forecasting with TimeGPT

10.2.5 Forecasting with Chronos

10.2.6 Forecasting with Moirai

10.2.7 Forecasting with TimesFM

10.2.8 Forecasting with Time-LLM

10.2.9 Evaluating all models

10.3 Staying up to date

Appendixes

Appendix A: references

Overview

2 Building a foundation model

This chapter provides a hands-on walkthrough of building a small-scale time series foundation model to illustrate the core ideas behind pretraining, transfer learning, and fine-tuning. Instead of starting with complex Transformers, it adopts the lightweight and generalizable N-BEATS architecture to make the process accessible and fast while still showcasing how foundation models learn broad patterns and adapt to new tasks. Along the way, it highlights practical considerations—like data frequency, forecast horizons, evaluation metrics, and compute/data constraints—that shape how robust foundation models are developed and deployed.

The chapter first demystifies N-BEATS, emphasizing neural basis expansion and a simple, fully connected design organized into blocks and stacks. Each block generates a forecast and a backcast, enabling sequential residual learning across stacks to capture information missed earlier. With this foundation, the model is pretrained on the monthly M3 dataset (learning from diverse series and domains), then saved and reused via transfer learning. Applied to a new monthly series, three approaches are compared—zero-shot forecasting with the pretrained model, fine-tuning that model on the target series, and training a model from scratch. Evaluated with MAE and sMAPE, the zero-shot pretrained model performs best in this monthly setting, underscoring the power of generalization from broad pretraining.

The chapter then tests robustness across frequencies by applying the monthly-pretrained model to a daily temperature series. Here, zero-shot performance degrades relative to a model trained specifically on the daily data, demonstrating that frequency shifts introduce different temporal patterns the pretrained model did not learn. This motivates practical lessons: data frequency matters, horizon limits can be restrictive (suggesting generative approaches for longer forecasts), fine-tuning depth and steps must be chosen carefully, and real foundation models demand substantial, high-variety datasets and significant compute. The chapter closes by reflecting on these challenges and setting up the exploration of larger, purpose-built foundation models in subsequent chapters.

Visualizing the impact of basis expansion. On the left, without basis expansion, we are stuck with using a linear model. On the right, after a second-degree polynomial basis expansion, we can fit a quadratic line and better fit our data.

Architecture of N-BEATS. N-BEATS is made of stacks, which are made of blocks, where each block outputs a forecast and a backcast.

Plotting the first four series of the monthly M3 dataset. The title refers to a label for each series. Note that each series is part of the M3 dataset.

Monthly volume of antidiabetic drug prescription in Australia, from 1991 to 2008.

Plotting the predictions against the actual values. We can see that the predictions mostly overlap with the actual values but coming from the pretrained model with zero-shot forecasting seem to be the closest.

Predictions on daily data using the pretrained model and a data-specific model. Here, we see that the pretrained model fails to make accurate predictions, since the dashed line (zero-shot forecasts) never overlaps the solid line (actual values). However, the forecasts from the trained model (dotted lines) is closer to the actual values.

Summary

A pretrained model is trained on large amounts of data such that it can be used in another scenario.
Transfer learning is when we use a pretrained model on a dataset that was not previously seen by that model.
Pretrained models can generate zero-shot predictions or be fine-tuned. With zero-shot predictions, the model never trains on the new dataset, while fine-tuning allows the model to specialize in the task at hand by training for a few steps.
We can easily build a pretrained model, but it will not perform well on other frequencies, and we are limited in the forecast horizon.
Building foundation models is hard. They require massive amounts of data and expensive resources that are not available to the vast majority of practitioners.

References

FAQ

What is N-BEATS, and why is it a good choice for a tiny foundation model?

N-BEATS (Neural Basis Expansion Analysis for Interpretable Time Series forecasting) is a simple, fully connected deep-learning architecture that avoids handcrafted time-series components (like explicit trend/seasonality). It learns complex patterns via neural basis expansion, generalizes well across series, and trains quickly—even on CPU—making it ideal for pretraining and transfer learning at small scale.

What is basis expansion, and how does N-BEATS use it?

Basis expansion maps inputs into a richer feature space (e.g., polynomial, logarithmic) to capture nonlinear relationships. In N-BEATS, the model learns expansion coefficients (Θ) and basis functions (g) directly (hence “neural basis”). Each block outputs a forecast (future values) and a backcast (reconstruction of the input) using these learned bases.

How do blocks, stacks, forecasts, backcasts, and residuals work in N-BEATS?

Each block has fully connected layers that produce a forecast and a backcast. The backcast (what the block learned from the input) is subtracted from the input to form residuals, which are passed to the next block so it learns what was missed. A stack sums block forecasts to a partial prediction; the full model sums stack predictions for the final forecast.

What does pretraining mean in this context, and how was it done?

Pretraining means training on a large, diverse dataset to learn general patterns. Here, N-BEATS was pretrained on the monthly portion of the M3 dataset (1,428 series) with horizon h=12 and input_size=24, for 1,000 training steps. The model was then saved and later loaded for transfer learning. Data for NeuralForecast must include columns: unique_id (series ID), ds (timestamp), y (value).

Why train by “steps” instead of “epochs,” and how many steps should I use?

A step is a gradient update; an epoch is a full pass over the dataset. Steps are more consistent across datasets of different sizes. Too few steps may undertrain the model. In the chapter’s setup, 1,000 steps was a good default for pretraining; fine-tuning used far fewer (e.g., 10) to avoid overfitting and long runtimes.

What is transfer learning and zero-shot forecasting, and when should I use them?

Transfer learning applies a pretrained model to new data. Zero-shot forecasting uses the pretrained model directly (no additional training), ideal when you have limited target data and need fast predictions. In the monthly antidiabetic prescription example, zero-shot forecasts from the pretrained model performed best among tested approaches.

What is fine-tuning, and how was it applied to N-BEATS here?

Fine-tuning trains the pretrained model on the target dataset to specialize it. Typically, you train only the last layers to save time and reduce overfitting, but here the entire small model was fine-tuned briefly (e.g., 10 steps) using NeuralForecast’s fit after reducing max_steps. Fine-tuning may help, but its benefit depends on data and can risk overfitting if overdone.

How were approaches evaluated on monthly data, and what were the results?

Using MAE and sMAPE on a 12-step test set: the pretrained model in zero-shot mode achieved the best scores (MAE ≈ 1.59, sMAPE ≈ 3.56%), outperforming both the fine-tuned model and a model trained from scratch on the target series.

Can a model pretrained on monthly data forecast daily data well?

Generally no. Frequency mismatch degrades performance because daily series exhibit different, more granular patterns (e.g., weekly seasonality). In the experiment, the pretrained monthly model performed poorly on daily data (MAE ≈ 2.59, sMAPE ≈ 8.69%), while a daily data–specific N-BEATS trained from scratch performed much better (MAE ≈ 1.34, sMAPE ≈ 4.87%).

What are key challenges in building robust foundation forecasting models?

- Frequency generalization: models must handle diverse sampling rates (daily, monthly, etc.). - Horizon flexibility: fixed horizons limit use cases; generative approaches can extend horizons. - Compute and scale: large models require substantial GPU time and engineering. - Data scarcity and aggregation: time-series data is less abundant and harder to unify across domains compared to text, making broad, diverse pretraining datasets difficult to assemble.

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more

eBook

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

eBook

pdf, ePub, online

$47.99 $33.59

you save $14.40 (30%)

include audio $24.99 $17.49

pro $24.99 per month

access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
choose one free eBook per month to keep
exclusive 50% discount on all purchases
renews monthly, pause or cancel renewal anytime

lite $19.99 per month

access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more