Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series

Philosophy

Artificial Intelligence

Epidemiology

Authors

Vijay Ekambaram,Arindam Jati

Nam Nguyen,Pankaj Dayama,Chandra Reddy,Wesley Gifford

+4 authors

,Jayant Kalagnanam

Journal

arXiv (Cornell University)

Published

Jan 1, 2024

Peer Review

(1)

Save

Tip

Document

Download

Flag content

Tip

Save

Document

Download

Flag content

Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot
Forecasting of Multivariate Time Series
Vijay Ekambaram , Arindam Jati , Nam H. Nguyen ,
Pankaj Dayama , Chandra Reddy , Wesley M. Gifford and Jayant Kalagnanam
IBM Research
vijaye12@in.ibm.com, arindam.jati@ibm.com, nnguyen@us.ibm.com,
pankajdayama@in.ibm.com,{creddy,wmgifford,jayant}@us.ibm.com
Abstract
Large pre-trained models for zero/few-shot learn-
ing excel in language and vision domains but en-
counter challenges in multivariate time series (TS)
due to the diverse nature and scarcity of publicly
available pre-training data. Consequently, there
has been a recent surge in utilizing pre-trained
large language models (LLMs) with token adap-
tations for TS forecasting. These approaches em-
ploy cross-domain transfer learning and surpris-
ingly yield impressive results. However, these
models are typically very slow and large (∼billion
parameters) and do not consider cross-channel cor-
relations. To address this, we present Tiny Time
Mixers (TTM), a signiﬁcantly small model based
on the lightweight TSMixer architecture. TTM
marks the ﬁrst success in developing fast and tiny
general pre-trained models (≤1M parameters), ex-
clusively trained on public TS datasets, with effec-
tive transfer learning capabilities for forecasting.
To tackle the complexity of pre-training on multiple
datasets with varied temporal resolutions, we intro-
duce several novel enhancements such as adaptive
patching, dataset augmentation via downsampling,
and resolution preﬁx tuning. Moreover, we em-
ploy a multi-level modeling strategy to effectively
model channel correlations and infuse exogenous
signals during ﬁne-tuning, a crucial capability lack-
ing in existing benchmarks. TTM shows signiﬁ-
cant accuracy gains (12-38%) over popular bench-
marks in few/zero-shot forecasting. It also dras-
tically reduces the compute needs as compared to
LLM-TS methods, with a 14X cut in learnable pa-
rameters, 106X less total parameters, and substan-
tial reductions in ﬁne-tuning (65X) and inference
time (54X). In fact, TTM’s zero-shot often sur-
passes the few-shot results in many popular bench-
marks, highlighting the efﬁcacy of our approach.
Code and pre-trained models will be open-sourced.
1 Introduction
Multivariate time series (TS) forecasting entails predicting fu-
ture values for multiple interrelated time series based on their
historical data. This ﬁeld has advanced signiﬁcantly, apply-
ing statistical and machine learning (ML) methods [Hyndman
and Athanasopoulos, 2021] across domains like weather, traf-
ﬁc, retail, and energy. In general, each time series represents a
variable or channel1. In certain applications, non-forecasting
variables, categorized as controllable and uncontrollable ex-
ternal factors, impact the variables to forecast. We term these
non-forecasting variables as exogenous, and the variables re-
quiring forecast as target variables.
Related Work: Recent advances in multivariate fore-
casting have been marked by the advent of transformer-
based [Vaswani et al., 2017] approaches, exempliﬁed by
models like PatchTST [Nie et al., 2023], Autoformer [Wu
et al., 2021], Informer [Zhou et al., 2021], and FED-
Former [Zhou et al., 2022]. These models have demon-
strated notable improvements over traditional statistical and
ML methods. Furthermore, architectures based on MLP-
Mixer [Tolstikhin et al., 2021], such as TSMixer [Ekambaram
et al., 2023], have emerged as efﬁcient transformer alterna-
tives, boasting 2-3X reduced compute and memory require-
ments with no accuracy compromise compared to their trans-
former counterparts. However, none of these advanced ap-
proaches have successfully demonstrated the ability to cre-
ate general pre-trained models that can successfully transfer
the learning to unseen target TS dataset, in a similar way as
popularly witnessed in NLP and vision tasks. This is very
challenging in the TS domain due to the diverse nature of the
datasets across applications and the limited public availability
of TS data for pre-training. There are existing self-supervised
pre-training TS approaches using masked modeling and con-
trastive learning techniques such as SimMTM [Dong et al.,
2023] and TF-C [Zhang et al., 2022] that offer transfer learn-
ing between two datasets when carefully selected based on
the dataset properties. However, they fail to provide universal
transfer learning capabilities across datasets. Consequently,
there has been a recent growing trend to employ pre-trained
large language models (LLMs) for TS forecasting, treating
it as a cross-domain transfer learning task. These universal
cross-transfer approaches, speciﬁcally recent works such as
LLMTime [Gruver et al., 2023] and GPT4TS [Zhou et al.,
2023] yield promising results in few/zero-shot forecasting ap-
1“Channel” refers to the individual time series in multivariate
data (i.e., a multivariate TS is a multi-channel signal).
arXiv:2401.03955v4 [cs.LG] 6 Apr 2024

proaches. These models are bootstrapped from GPT-2/3 or
LLAMA-2 with suitable tokenization strategies to adapt to
time-series domains.
However, these LLM based TS approaches do not explic-
itly handle channel correlations and exogenous support in the
context of multivariate forecasting. Moreover, these large
models, with billions of parameters, demand signiﬁcant com-
putational resources and runtime. Hence, in this paper, we
focus on building pre-trained models from scratch solely us-
ing TS data. Unlike language, which has abundant public
pre-training data in terabytes, time-series data is relatively
scarce, very diverse and publicly limited. Its scarcity leads to
overﬁtting when pre-training “large” models solely on time-
series data. This prompts a question: Can smaller models
pre-trained purely on limited public diverse TS datasets give
better zero/few-shot forecasting accuracy? Surprisingly, the
answer is yes! Toward this, we propose Multi-level Tiny Time
Mixers (TTM), a signiﬁcantly smaller model (≤1M parame-
ters) based on the lightweight TSMixer architecture, exclu-
sively trained on diverse TS corpora for effective zero/few-
shot multivariate TS forecasting via transfer learning.
In particular, TTM is pre-trained using multiple public
datasets (∼ 244M samples) from the Monash data reposi-
tory2 [Godahewa et al., 2021]). Note that the datasets ex-
hibit considerable diversity in terms of characteristics, such
as the different domains, temporal resolution3 (spanning from
second to daily), lengths, and number of channels. Pre-
training on such heterogeneous datasets cannot be handled
directly by TSMixer or existing state-of-the-art (SOTA) mod-
els. Hence, TTM proposes the following enhancements to
the TSMixer architecture: (i) Adaptive Patching across lay-
ers, considering the varied suitability of patch lengths for
different datasets, (ii) Dataset Augmentation via Down-
sampling to increase coverage and samples across differ-
ent resolutions, (iii) Resolution Preﬁx Tuning to explic-
itly embed resolution information in the ﬁrst patch, facili-
tating resolution-conditioned modeling, particularly beneﬁ-
cial in scenarios with short history lengths. Moreover, our
approach leverages multi-level modeling, where TTMs are
ﬁrst pre-trained in a channel-independent way and then seam-
lessly integrate channel mixing during ﬁne-tuning to model
target data-speciﬁc channel-correlations and exogenous infu-
sion
Below, we outline the paper’s key contributions:
• Amidst the prevalence of large pre-trained models de-
manding signiﬁcant compute and training time (in
weeks), our work is the ﬁrst to showcase the efﬁcacy of
building Fast and Tiny Pre-trained models (≤1M pa-
rameters) exclusively trained on Public TS datasets in a
ﬂash of just few hours (4-8 hours, 6 A100 GPUs). TTM
successfully demonstrates transfer learning to diverse,
unseen target datasets for zero/few-shot forecasting, ad-
dressing the data scarcity issues prevalent in time series.
• Pre-training on heterogeneous multi-resolution datasets
cannot be handled effectively by TSMixer or other
2Accessible at https://forecastingdata.org/
3Resolution refers to the sampling rate of the input time series
(e.g., hourly, 10 minutes, 15 minutes, etc.)
SOTA models. Hence, we propose various architec-
tural and training enhancements, such as adaptive
patching, data augmentation via downsampling, and (an
optional) resolution preﬁx tuning for robust pre-training.
• TTM employs a multi-level modeling strategy to ex-
plicitly model channel-correlations, and incorporates ex-
ogenous signals – a crucial capability lacking in LLMs-
based TS approaches.
• With extensive evaluation on 11 datasets, TTM shows
signiﬁcant accuracy gains over popular benchmarks (12-
38% in few/zero-shot forecasting). It also drastically re-
duces the compute needs as compared to LLM-TS meth-
ods, with a 14X cut in learnable parameters, 106X less
total parameters, and substantial reductions in ﬁnetuning
(65X), inference time (54X), and memory usage (27X).
• The zero-shot results of TTM often surpass the few-shot
results of many SOTA approaches, highlighting the ef-
fectiveness of our approach.
2 TTM Components
Let X ∈ Rc×sl be a multivariate time series of length sl and
c number of channels. The forecasting task can be formally
deﬁned as predicting the future values Y ∈ Rc′×f l given the
history X. Here, f l denotes the forecast horizon and c′ de-
notes number of forecast channels, where c′ ≤ c. The predic-
tions from the model are denoted by ˆY ∈ Rc′×f l. In a general
multivariate forecasting task, each channel or variable falls
into one of the following categories: (a) Target variables
(mandatory): corresponding to the channels for which fore-
casts are required, (b) Exogenous variables (optional): en-
compassing (i) uncontrolled variables that may inﬂuence the
forecasts and assumed to be known or estimated for the fore-
cast period (e.g. weather), and (ii) control variables whose
future values during the forecast horizon can be manipulated
to govern the behavior of the target variables. (e.g. discount
in sales forecasting, operator controls in industrial applica-
tions). In TTM, uncontrolled and control variables are treated
similarly, as both are considered available during forecasting.
2.1 Multi-level Modeling
TTM follows a multi-level architecture consisting of four key
components (see Figure 1(a)): (1) The TTM Backbone is
assembled using building blocks derived from the efﬁcient
TSMixer architecture [Ekambaram et al., 2023]. TSMixer is
based on simple MLP blocks that enable mixing of features
within patches, across patches and channels, surpassing exist-
ing transformer-based TS approaches with minimal computa-
tional requirements. Since TSMixer is not targeted to handle
multi-resolution data, we introduce various novel enhance-
ments to it as explained later. (2) TTM Decoder follows
the same backbone architecture but is considerably smaller
in size, approximately 10-20% of the size of the backbone,
(3) Forecast Head consists of a linear head designed to pro-
duce the forecast output, and (4) Optional Exogenous Mixer
serves to fuse exogenous data into the model’s forecasting
process. This multi-level model refactoring is required to dy-
namically change the working behavior of various compo-

Figure 1: Overview of Multilevel Tiny Time Mixers (TTM): (a) Refer to Section 2 and 3, (b) Refer to Section 3.1, (c) Refer to Section 3.2
nents based on the workﬂow type, as explained in Section 3.
In addition to the above primary components, we also have a
preprocessing component as explained next.
2.2 Pre-processing
As shown in Figure 1(a) with colorless blocks, the historical
time series X is ﬁrst normalized per instance to have zero
mean and unit standard deviation for each channel dimension,
to tackle any possible distribution shifts [Nie et al., 2023;
Ekambaram et al., 2023]. This process is reversed at the end
before computing the loss. The normalized data X is sub-
sequently patched Xp ∈ Rc×n×pl into n non-overlapping
windows, each of length pl and then, passed to the TTM
backbone. Patching, as introduced in [Nie et al., 2023], has
proven to be highly valuable for forecasting. Its effectiveness
lies in preserving local semantic information, accommodat-
ing longer history, and reducing computation.
3 TTM Workﬂows
TTM works in 2 stages: pre-train and ﬁne-tune (Figure 1(a)).
3.1 Pre-training Workﬂow
In the pre-training stage, we train the model on a large collec-
tion of public datasets from the Monash data repository[Go-
dahewa et al., 2021]. Since the primary focus of TTM is
forecasting, pre-training is modeled with a direct forecast-
ing objective. TTM is ﬁrst pre-trained in a univariate fashion
with independent channels on all the existing datasets. Due
to varied channel counts in pre-training datasets, modeling
multivariate correlations is not feasible here; it is addressed
later during ﬁne-tuning. Multivariate pre-training datasets
are initially transformed into independent univariate time se-
ries (X1, · · · , XN ) ∈ Rc(=1)×sl. These are pre-processed
(Section 2.2), and subsequently fed into the TTM backbone
for multi-resolution pre-training. The output of the back-
bone XL
h ∈ R(c=1)×n×hf is passed through the decoder
and forecast head to produce the forecast ˆY ∈ R(c=1)×f l
which is then reverse-normalized to bring back to the orig-
inal scale. We pre-train the TTM with mean squared er-
ror (MSE) loss function calculated over the forecast horizon:
L = ||Y − ˆY ||2
2. Thus for a given input context length sl
and forecast length f l, we get a pre-trained model capturing
the common temporal forecasting dynamics and seasonal pat-
terns as observed in the overall pre-training data.
Multi-Resolution Pre-training via TTM Backbone
The majority of the pre-training happens in the TTM back-
bone. The primary challenge with the proposed pre-training
technique is that the pre-training data is diverse and has multi-
ple resolutions. There are two main options for pre-training:
conducting separate pre-training for each resolution type or
pre-training using all resolution data collectively. While it’s
common to train a model per resolution type to overcome
challenges in learning diverse seasonal patterns, this leads
to diminished training data for each resolution due to lim-
ited data availability. Consequently, this motivated the ex-
ploration of pre-training a single model using datasets from
all resolutions. To achieve this, we propose the following 3
enhancements.
Adaptive Patching: The TTM backbone is crafted with
an adaptive patching architecture where different layers of
the backbone operate at varying patch lengths and numbers
of patches. Since each dataset in the pre-training corpora
may perform optimally at a speciﬁc patch length, this ap-
proach greatly aids in generalization when diverse datasets
with different resolutions are introduced. As shown in Fig-

ure 1(b), the patched data Xp ∈ Rc×n×pl is passed through a
embedding layer to project it to the patch hidden dimension,
Xh ∈ Rc×n×hf . Optionally, if the resolution preﬁx tuning
module is activated (as explained later), the resolution pre-
ﬁx is concatenated with Xh. For notational simplicity, we
denote the concatenated tensor with Xh as well.
The TTM backbone consists of L levels, each compris-
ing M TTM blocks with identical patch conﬁgurations. A
TTM block in the i-th level, i = 1, . . . , L, receives the pro-
cessed data X(i−1)
h ∈ Rc×n×hf from the earlier block. Each
TTM block is further comprised of a patch partition model, a
vanilla TSMixer block, and a patch merging block. Patch Par-
tition block at every level i increases the number of patches
by a factor of Ki and reduces the patch dimension size by
the same factor by reshaping X(i−1)
h ∈ Rc×n×hf to Xi
h ∈
Rc×(n·Ki)×(hf /Ki), where Ki = 2(L−i). Note that, we set
hf = 2m for some integer m ≥ L. Then, TSMixer is applied
to the adapted data Xi
h. Finally, the output from TSMixer
is again reshaped to its original shape (i.e., Rc×n×hf ) in the
patch merging block. Note that, as we go deeper into the net-
work, the number of patches decreases while the patch di-
mension size increases leading to adaptive patching which
helps in better generalization as we pre-train with multiple
datasets together. This idea of adaptive patching is popular
and very successful in the vision domain (E.g. Swin trans-
formers [Liu et al., 2021] and we are the ﬁrst to port it suc-
cessfully to the time-series domain to resolve multi-resolution
issues in pre-training with diverse TS datasets. Figure 1(b)
shows the TTM backbone for L = 3 and M = 2. Please note
that adaptive patching is enabled only in the backbone and
not in the decoder, which is designed to be very lightweight.
Data Augmentation via Downsampling: A signiﬁcant
challenge in TS pre-training datasets is the scarcity of public
datasets at speciﬁc resolutions. To overcome this, we employ
a downsampling technique for high-resolution datasets, gen-
erating multiple datasets at lower resolutions. For example,
from a one-second resolution dataset, we derive datasets at
minute and hour resolutions. Note that, the original high-
resolution dataset remains within the pool of pre-training
datasets. This methodology signiﬁcantly augments the num-
ber of datasets for each resolution which greatly improves the
model performance (Section 4.5).
Resolution Preﬁx Tuning: This technique explicitly
learns and incorporates a new patch embedding as a preﬁx
into the input data based on the input resolution type (see
Figure 1(b)). Similar to the concept of preﬁx tuning [Li and
Liang, 2021], this approach provides an explicit signal to the
model about the resolution type for resolution-conditioned
modeling. First, we map every resolution to a unique integer,
which is then passed through an embedding layer to project
it to the hidden dimension, hf . Subsequently, we expand
the embedding across all channels to have a representation of
shape c × 1 × hf . This module is optional for the TTM back-
bone, particularly beneﬁcial when the context length (sl) is
short. In these scenarios, automatically detecting the resolu-
tion becomes a challenge for the model. Hence, by explicitly
fusing the resolution information as a preﬁx, we can enhance
the model’s ability to learn effectively across resolutions.
3.2 Fine-tuning Workﬂow
In the ﬁne-tuning workﬂow, we deal with data from the tar-
get domain that has no overlap with the pre-training datasets.
We have three options here: (a) In Zero-shot forecasting,
we directly use the pre-trained model to evaluate on the test
part of the target data, (b) In Few-shot forecasting, we utilize
only a tiny portion (5-10%) of the train part of the target data
to quickly update the pre-trained weights of the decoder and
head, and subsequently, evaluate it on the test part, (c) In Ful-
l-shot forecasting, we ﬁne-tune the pre-trained weights of the
decoder and head on the entire train part of the target data,
and then, evaluate on the test part.
The backbone is completely frozen during ﬁne-tuning, and
still operates in a channel-independent univariate fashion.
However, the TTM decoder can be ﬁne-tuned via channel-
mixing (for multivariate) or a channel-independent (for uni-
variate) way based on the nature of the target data. If
pure multivariate modeling is needed, then the channel-mixer
block in all the TSMixer components (see Figure 1(b)) in the
decoder gets enabled to explicitly capture the channel cor-
relation between the channels. The forecast head and re-
verse normalization perform similar operations as in the pre-
training stage. The ﬁne-tuning also optimizes the forecasting
objective with MSE loss. This thoughtful multi-level design
choice ensures our backbone excels in channel-independent
pre-training, enabling effective temporal correlation model-
ing across diverse datasets. Simultaneously, the decoder han-
dles target-data-speciﬁc tasks like channel-correlation model-
ing and ﬁne-tuning. In addition, if the target data has exoge-
nous variables, then an exogenous mixer block is applied to
the actual forecasts as explained next.
Exogenous Mixer Block: As described in Section 2, the
future values of the exogenous channels are known in ad-
vance. Let the forecast from the forecast head be ˆY ∈ Rc×f l.
Let the channels x0, · · · , xc′ denote the target variables and
xc′+1, · · · , xc denote all exogenous variables with their fu-
ture values known. First, we replace the forecast values for
the exogenous channels with the true future values (Y ) and
transpose it: ˆYe = [ ˆy0, · · · , ˆyc′ , yc′+1, · · · , yc] ∈ Rf l×c.
Next, to learn inter-channel lagged correlations, we patch
ˆYe into a series of overlapped windows (i.e., patching with
stride= 1) to create a new tensor: ˆYe,p ∈ Rf l×∆×c, where
∆ = 2 · l + 1 with l being the context length to incor-
porate on either side of a time point4. Subsequently, we
pass ˆYe,p through a vanilla TSMixer block with channel mix-
ing enabled. Thus, the lagged dependency of the forecasts
for the target channels on the exogenous channels is seam-
lessly learned. Finally, we attach a linear head to produce
the forecasts for the target channels which is then reshaped as
ˆY ∈ Rc′×f l. Figure 1(c) depicts this procedure.
4 Experiments and Results
4.1 Experimental Setting
Datasets: Pre-training employs a subset of the Monash data
hub [Godahewa et al., 2021] of size ∼ 244M samples.
4This needs padding ˆYe with zeros of length l on both sides.

100%

Tiny Time Mixers (TTMs): Fast Pre-trained Models for Enhanced Zero/Few-Shot Forecasting of Multivariate Time Series

Scan to connect with one of our mobile apps

Coinbase Wallet app

Coinbase app

Or try the Coinbase Wallet browser extension