Oct. 14, 2023
Abhimanyu Das, Weihao Kong, Rajat Sen, Yichen Zhou
ICML 2024
https://arxiv.org/pdf/2310.10688
1. Motivation
- Inspired by the success of LLM in NLP, the authors aimed to design a foundation model for time series forecasting.
2. Contribution
- Unlike previous approaches that train individual models for each dataset, this work demonstrates that a single foundation model can achieve strong zero-shot performance across diverse datasets.
3. Model
- The model uses a patched decoder-style attention architecture.
- Despite having a relatively small 225M parameter count, it exhibits robust performance across varying forecast lengths and time scales.
1) Training Data
- A large time series corpus of around 1B data points was constructed, primarily from Google Trends data, and used for pretraining.
2) Input Data Preprocessing
- The time series data is split into patches of a fixed length (e.g. 16, 32, etc.).
- Each patch is converted into an embedding vector that encodes positional information.
3) Model Architecture
- Follows the standard transformer decoder architecture (multi-head self-attention, feed-forward)
- The input is the embeddings of past time series patches, and the output is the embeddings of future patches.
4) Forecasting Process
- The decoder takes the embeddings of past patches as input and generates the next future patch.
- The generated patch is then used as input to the decoder again to predict the next future patch consecutively.
- This process is repeated for the desired forecast length.
5) Key Advantages of the patched decoder architecture
(1) It efficiently processes long sequences while reducing memory overhead.
(2) It maintains sequence order via positional encoding.
(3) It is inspired by the Vision Transformer, enabling a unified approach to processing both image and time series data.
4. Evaluation
- Zero-shot performance was evaluated on various public datasets, achieving accuracy comparable to fully supervised models trained on each dataset individually.