Five Architectures for Time Series Forecasting with Large Language Models

Large Language Models are increasingly being applied to time series forecasting. Not as chatbots, but as prediction engines that leverage the pattern recognition capabilities of transformer architectures to forecast numerical sequences. Over the past two years, several distinct approaches have emerged, each with a fundamentally different strategy for bridging the gap between language and numbers.

This post compares five architectures that represent the current landscape: LLMTIME, TIME-LLM, Chronos, TimesFM, and Moirai. Each takes a different approach to the same core question: how do you make a model designed for text understand temporal patterns?

The five strategies

At a high level, the five models can be grouped by how they handle the modality gap between text and time series:

ModelStrategyCore idea
LLMTIMEDirect text encodingTreat numbers as literal text strings
TIME-LLMModel reprogrammingTranslate time series into text prototypes
ChronosDiscrete tokenizationQuantize values into a fixed vocabulary
TimesFMDecoder-only foundation modelPre-train a dedicated model on 100B data points
MoiraiUniversal masked encoderHandle any-variate series via flattening

LLMTIME: numbers as text

The simplest approach. LLMTIME (Gruver et al., NeurIPS 2023) serializes time series values directly as text strings and feeds them to GPT-3 or LLaMA-2. The LLM then predicts the next tokens, which are parsed back into numbers. No fine-tuning, no special architecture. Pure zero-shot prediction.

The elegance is in its simplicity: if an LLM can predict the next word in a sentence, maybe it can predict the next number in a sequence. And for simple patterns (autoregressive processes, basic trends), it works surprisingly well.

The catch: a replication study by Cao and Wang showed that LLMTIME “lacks ability for zero-shot time series forecasting” on complex patterns with combined trend and seasonal components. The approach works on toy examples but breaks down on real-world data with multiple overlapping patterns.

TIME-LLM: teaching an LLM to read time series

TIME-LLM (Jin et al., ICLR 2024) takes a more sophisticated approach. Instead of converting numbers to text, it reprograms the time series into a representation that the LLM already understands.

The process works in three steps:

  1. Patch embedding: The time series is split into patches (segments) and embedded into vectors.
  2. Reprogramming: A cross-attention layer maps these patches onto “text prototypes” from the LLM’s vocabulary. The model learns which word-like representations best capture the temporal patterns. The LLM backbone stays frozen.
  3. Prompt-as-Prefix: Statistical features of the time series (min, max, median, lags, trends) are encoded as a natural language prefix that gives the LLM domain context.

The results are strong: 57-92% better MSE than AutoARIMA across standard benchmark datasets. But the model is large (6.6B parameters for the LLaMA-7B backbone) and the reprogramming layer adds complexity.

Chronos: a vocabulary for numbers

Chronos (Ansari et al., Amazon Science 2024) builds its own language for time series. It discretizes continuous values into 4096 bins through scaling and quantization, creating a fixed vocabulary of “number tokens”. Then it trains a T5 transformer with standard cross-entropy loss, exactly like a language model.

The key innovation is making time series forecasting a sequence-to-sequence problem in the model’s native format. The model learns to “speak numbers” the same way a language model learns to speak English.

Performance: 35-40% better than AutoARIMA on complex datasets. Zero-shot performance is comparable to or better than dataset-specific models on in-domain benchmarks. Model sizes range from Mini (20M parameters, $252 training cost) to Large (710M parameters, $2,066 training cost on AWS).

TimesFM: the GPT of time series

TimesFM (Das et al., Google Research, ICML 2024) is a decoder-only foundation model built from scratch for time series. No language model repurposing. It’s a 200M parameter transformer pre-trained on 100 billion real-world time points from Google Trends, Wiki Pageviews, and synthetic data.

The architecture uses 32-point input patches and 128-point output patches, allowing it to generate longer forecasts efficiently. Like GPT for text, it predicts “what comes next” by learning temporal patterns at scale.

The advantage of building a dedicated model rather than repurposing an LLM is efficiency: 200M parameters is tiny compared to TIME-LLM’s 6.6B. The disadvantage is cost: pre-training on 100B data points requires significant compute (estimated $1-5M).

Moirai: one model for any time series

Moirai (Woo et al., Salesforce AI Research 2024) tackles a different problem: universality. Most forecasting models assume a fixed number of variables. Moirai handles any-variate time series (univariate, multivariate, or mixed) by flattening all variables into a single sequence and using multi-patch-size projection.

It uses a masked encoder architecture (similar to BERT) rather than autoregressive generation. The model was trained on the LOTSA dataset: 27 billion observations from 9 different domains.

Results show a 70% win rate in zero-shot scenarios across benchmarks. However, ARIMA still outperforms on specific well-structured datasets (like M4 Monthly), and inference is 3-10x slower than statistical models. The break-even point is roughly 10+ different forecasting tasks, where the cost of fitting individual models exceeds using a single universal one.

Comparison

DimensionLLMTIMETIME-LLMChronosTimesFMMoirai
ArchitectureExisting LLMFrozen LLM + reprog.T5 (enc-dec)Decoder-onlyMasked encoder
Parameters7-175B (GPT)6.6B20M-710M200M14M-311M
Pre-trainingNone (zero-shot)MinimalModerateMassive (100B pts)Massive (27B obs)
Variate supportUnivariateUnivariateUnivariateUnivariateAny-variate
OutputPoint forecastPoint forecastProbabilisticPoint + quantilesProbabilistic
Key strengthZero setupStrong accuracyOwn vocabularyEfficiencyUniversality
Key weaknessFails on complex dataVery largeFixed bin countTraining costSlower inference

What this means in practice

An interesting finding from the literature is that a simple architecture called PAttn (patching + attention, just 0.245M parameters) achieves comparable results to TIME-LLM while being over 1000x faster. This suggests that for many practical forecasting tasks, the LLM component may be unnecessary overhead.

The practical takeaway depends on your situation:

  • Quick prototype, no training data: Chronos (zero-shot, well-tested, free to use)
  • Many different forecasting tasks: Moirai or TimesFM (amortize model cost over tasks)
  • Best accuracy on a single dataset: Fine-tune a dedicated model (ARIMA, Prophet, or LightGBM often still win)
  • Probabilistic forecasts needed: Chronos or Moirai (native uncertainty quantification)

ARIMA is not dead. For well-structured, single-domain time series with clear seasonal patterns, statistical models remain competitive or better. Foundation models shine when you have many heterogeneous series, limited historical data, or need zero-shot predictions on new domains.

References

  • Gruver et al. (2023). Large Language Models Are Zero-Shot Time Series Forecasters. NeurIPS 2023. arXiv:2310.07820
  • Jin et al. (2024). TIME-LLM: Time Series Forecasting by Reprogramming Large Language Models. ICLR 2024. arXiv:2310.01728
  • Ansari et al. (2024). Chronos: Learning the Language of Time Series. Amazon Science. arXiv:2403.07815
  • Das et al. (2024). A Decoder-Only Foundation Model for Time-Series Forecasting. ICML 2024. arXiv:2310.10688
  • Woo et al. (2024). Unified Training of Universal Time Series Forecasting Transformers. Salesforce AI Research. arXiv:2402.02592
  • Cao & Wang (2024). An Evaluation of Standard Statistical Models and LLMs on Time Series Forecasting.

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.