fusionlab.nn.transformers.TimeSeriesTransformer

class fusionlab.nn.transformers.TimeSeriesTransformer[source]

Bases: Model, NNLearner

A standard Transformer model for multi-horizon time series forecasting.

This class implements the classic encoder-decoder Transformer architecture, as introduced by Vaswani et al., but specifically tailored for multi-variate, multi-horizon time series forecasting. It leverages self-attention and cross-attention mechanisms to capture complex long-range dependencies in sequential data.

The model is “pure” in the sense that it does not use any recurrent (LSTM/GRU) or convolutional layers, relying solely on attention to process temporal information. It is designed to handle three distinct types of input features: static, dynamic past-observed, and known future covariates.

Parameters:
  • static_input_dim (int) – The number of features in the static input tensor. These are time-invariant features like sensor ID or location. Can be 0 if no static features are used.

  • dynamic_input_dim (int) – The number of features in the dynamic input tensor, which contains past-observed, time-varying data.

  • future_input_dim (int) – The number of features in the future input tensor, containing covariates with known values in the forecast horizon, such as day of the week or scheduled events.

  • embed_dim (int, default 64) – The core dimensionality of the model, \(d_{model}\). This is the size of all embedding vectors and the internal dimension of the attention layers.

  • num_heads (int, default 4) – The number of attention heads in each multi-head attention layer. embed_dim must be divisible by num_heads.

  • ffn_dim (int, default 128) – The dimensionality of the inner layer of the feed-forward network (FFN) that follows the attention mechanism in each encoder and decoder block.

  • num_encoder_layers (int, default 3) – The number of identical encoder layers to stack.

  • num_decoder_layers (int, default 3) – The number of identical decoder layers to stack.

  • forecast_horizon (int, default 1) – The number of future time steps to predict (\(H\)). This defines the length of the output sequence.

  • output_dim (int, default 1) – The number of target variables to forecast at each time step.

  • dropout_rate (float, default 0.1) – The dropout rate applied within the attention mechanisms and feed-forward networks for regularization.

  • input_dropout_rate (float, default 0.1) – The dropout rate applied to the sum of the input embeddings and positional encodings.

  • max_seq_len_encoder (int, default 100) – The maximum expected sequence length for the encoder’s input. Used to pre-compute positional encodings.

  • max_seq_len_decoder (int, default 50) – The maximum expected sequence length for the decoder’s input (typically forecast_horizon). Used for positional encodings.

  • quantiles (list of float, optional) – A list of quantiles (e.g., [0.1, 0.5, 0.9]) for probabilistic forecasting. If None, the model produces deterministic point forecasts.

  • use_grn_for_static (bool, default False) – If True, processes the static features through a GatedResidualNetwork (GRN). If False, uses a standard Dense layer.

  • static_integration_mode ({{'add_to_encoder_input', 'add_to_decoder_input', 'none'}}, default 'add_to_decoder_input') – Defines how the processed static context vector is integrated into the model: * 'add_to_encoder_input': Adds it to the encoder’s input embeddings. * 'add_to_decoder_input': Adds it to the decoder’s input embeddings. * 'none': The static context is not explicitly injected.

  • activation (str or callable, default 'relu') – The activation function for the feed-forward networks.

  • layer_norm_epsilon (float, default 1e-6) – The epsilon value for the Layer Normalization layers to prevent division by zero.

  • name (str, optional) – The name of the Keras model.

  • **kwargs – Additional keyword arguments passed to the tf.keras.Model constructor.

Notes

This model adheres to the standard Transformer architecture, which consists of an encoder-decoder stack.

Encoder

The encoder is composed of a stack of num_encoder_layers. Each layer contains two sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network. It processes the entire sequence of past dynamic features, allowing each position to attend to all other positions to build a rich contextual representation.

Decoder

The decoder is similarly composed of a stack of num_decoder_layers. Each decoder layer has three sub-layers: 1. Masked Multi-Head Self-Attention: This is the key to

autoregressive generation. It applies a causal mask to the decoder’s inputs to ensure that the prediction for a time step \(i\) can only depend on known outputs at steps less than \(i\), preventing the model from looking ahead.

  1. Multi-Head Cross-Attention: This layer allows the decoder to attend to the output of the encoder. It acts as the bridge between the processed past information and the future forecast, allowing the decoder to focus on the most relevant parts of the historical context.

  2. Feed-Forward Network: The same type of FFN as in the encoder.

Residual connections and layer normalization are applied around each sub-layer to ensure stable training.

See also

fusionlab.nn.components.TransformerEncoderLayer

The core encoder block.

fusionlab.nn.components.TransformerDecoderLayer

The core decoder block.

fusionlab.nn.models.BaseAttentive

A more complex hybrid model foundation.

References

Examples

>>> import tensorflow as tf
>>> from fusionlab.nn.transformers import TimeSeriesTransformer
>>> # 1. Model Configuration
>>> BATCH_SIZE = 32
>>> PAST_STEPS = 24
>>> HORIZON = 12
>>> STATIC_DIM, DYNAMIC_DIM, FUTURE_DIM = 5, 6, 4
>>> model = TimeSeriesTransformer(
...     static_input_dim=STATIC_DIM,
...     dynamic_input_dim=DYNAMIC_DIM,
...     future_input_dim=FUTURE_DIM,
...     embed_dim=32,
...     num_heads=4,
...     ffn_dim=64,
...     num_encoder_layers=2,
...     num_decoder_layers=2,
...     forecast_horizon=HORIZON,
...     output_dim=1,
...     quantiles=[0.1, 0.5, 0.9]
... )
>>> # 2. Prepare Dummy Input Data
>>> static_input = tf.random.normal([BATCH_SIZE, STATIC_DIM])
>>> dynamic_input = tf.random.normal([BATCH_SIZE, PAST_STEPS, DYNAMIC_DIM])
>>> future_input = tf.random.normal([BATCH_SIZE, HORIZON, FUTURE_DIM])
>>> # 3. Get Model Output
>>> # Inputs are passed as a list: [static, dynamic, future]
>>> predictions = model([static_input, dynamic_input, future_input])
>>> # 4. Check Output Shape
>>> # Shape is (Batch, Horizon, Quantiles) since output_dim=1
>>> print(f"Output prediction shape: {predictions.shape}")
Output prediction shape: (32, 12, 3)
__init__(static_input_dim, dynamic_input_dim, future_input_dim, embed_dim=64, num_heads=4, ffn_dim=128, num_encoder_layers=3, num_decoder_layers=3, forecast_horizon=1, output_dim=1, dropout_rate=0.1, input_dropout_rate=0.1, max_seq_len_encoder=100, max_seq_len_decoder=50, quantiles=None, use_grn_for_static=False, static_integration_mode='add_to_decoder_input', activation='relu', layer_norm_epsilon=1e-06, name='TimeSeriesTransformer', **kwargs)[source]
Parameters:
  • static_input_dim (int)

  • dynamic_input_dim (int)

  • future_input_dim (int)

  • embed_dim (int)

  • num_heads (int)

  • ffn_dim (int)

  • num_encoder_layers (int)

  • num_decoder_layers (int)

  • forecast_horizon (int)

  • output_dim (int)

  • dropout_rate (float)

  • input_dropout_rate (float)

  • max_seq_len_encoder (int)

  • max_seq_len_decoder (int)

  • quantiles (List[float] | None)

  • use_grn_for_static (bool)

  • static_integration_mode (str)

  • activation (str)

  • layer_norm_epsilon (float)

  • name (str | None)

Methods

__init__(static_input_dim, ...[, embed_dim, ...])

add_loss(loss)

Can be called inside of the call() method to add a scalar loss.

add_metric(*args, **kwargs)

add_variable(shape, initializer[, dtype, ...])

Add a weight variable to the layer.

add_weight([shape, initializer, dtype, ...])

Add a weight variable to the layer.

build(input_shape)

build_from_config(config)

Builds the layer's states with the supplied config dict.

call(inputs[, training])

Forward pass for the TimeSeriesTransformer.

compile([optimizer, loss, loss_weights, ...])

Configures the model for training.

compile_from_config(config)

Compiles the model with the information given in config.

compiled_loss(y, y_pred[, sample_weight, ...])

compute_loss([x, y, y_pred, sample_weight, ...])

Compute the total loss, validate it, and return it.

compute_mask(inputs, previous_mask)

compute_metrics(x, y, y_pred[, sample_weight])

Update metric states and collect all metrics to be returned.

compute_output_shape(*args, **kwargs)

compute_output_spec(*args, **kwargs)

count_params()

Count the total number of scalars composing the weights.

evaluate([x, y, batch_size, verbose, ...])

Returns the loss value & metrics values for the model in test mode.

export(filepath[, format, verbose, ...])

Export the model as an artifact for inference.

fit([x, y, batch_size, epochs, verbose, ...])

Trains the model for a fixed number of epochs (dataset iterations).

from_config(config[, custom_objects])

Creates an operation from its config.

get_build_config()

Returns a dictionary with the layer's input shape.

get_compile_config()

Returns a serialized config with information for compiling the model.

get_config()

Returns the config of the object.

get_layer([name, index])

Retrieves a layer based on either its name (unique) or index.

get_metrics_result()

Returns the model's metrics values as a dict.

get_params([deep])

Get the parameters for this learner.

get_state_tree([value_format])

Retrieves tree-like structure of model variables.

get_weights()

Return the values of layer.weights as a list of NumPy arrays.

help(**kwargs)

load(file_path[, format])

Load the learner's state from a specified file in the desired format.

load_own_variables(store)

Loads the state of the layer.

load_weights(filepath[, skip_mismatch])

Load the weights from a single file or sharded files.

loss(y, y_pred[, sample_weight])

make_predict_function([force])

make_test_function([force])

make_train_function([force])

predict(x[, batch_size, verbose, steps, ...])

Generates output predictions for the input samples.

predict_on_batch(x)

Returns predictions for a single batch of samples.

predict_step(data)

quantize(mode[, config])

Quantize the weights of the model.

quantized_build(input_shape, mode)

quantized_call(*args, **kwargs)

rematerialized_call(layer_call, *args, **kwargs)

Enable rematerialization dynamically for layer's call method.

reset_metrics()

save(filepath[, overwrite, zipped])

Saves a model as a .keras file.

save_own_variables(store)

Saves the state of the layer.

save_weights(filepath[, overwrite, ...])

Saves all weights to a single file or sharded files.

set_params(**params)

Set the parameters of this learner.

set_state_tree(state_tree)

Assigns values to variables of the model.

set_weights(weights)

Sets the values of layer.weights from a list of NumPy arrays.

stateless_call(trainable_variables, ...[, ...])

Call the layer without any side effects.

stateless_compute_loss(trainable_variables, ...)

summary([line_length, positions, print_fn, ...])

Prints a string summary of the network.

symbolic_call(*args, **kwargs)

test_on_batch(x[, y, sample_weight, return_dict])

Test the model on a single batch of samples.

test_step(data)

to_json(**kwargs)

Returns a JSON string containing the network configuration.

train_on_batch(x[, y, sample_weight, ...])

Runs a single gradient update on a single batch of data.

train_step(data)

Attributes

compiled_metrics

compute_dtype

The dtype of the computations performed by the layer.

distribute_reduction_method

distribute_strategy

dtype

Alias of layer.variable_dtype.

dtype_policy

input

Retrieves the input tensor(s) of a symbolic operation.

input_dtype

The dtype layer inputs should be converted to.

input_spec

jit_compile

layers

losses

List of scalar losses from add_loss, regularizers and sublayers.

metrics

List of all metrics.

metrics_names

metrics_variables

List of all metric variables.

my_params

non_trainable_variables

List of all non-trainable layer state.

non_trainable_weights

List of all non-trainable weight variables of the layer.

output

Retrieves the output tensor(s) of a layer.

path

The path of the layer.

quantization_mode

The quantization mode of this layer, None if not quantized.

run_eagerly

supports_masking

Whether this layer supports computing a mask using compute_mask.

trainable

Settable boolean, whether this layer should be trainable or not.

trainable_variables

List of all trainable layer state.

trainable_weights

List of all trainable weight variables of the layer.

variable_dtype

The dtype of the state (weights) of the layer.

variables

List of all layer state, including random seeds.

weights

List of all weight variables of the layer.

__init__(static_input_dim, dynamic_input_dim, future_input_dim, embed_dim=64, num_heads=4, ffn_dim=128, num_encoder_layers=3, num_decoder_layers=3, forecast_horizon=1, output_dim=1, dropout_rate=0.1, input_dropout_rate=0.1, max_seq_len_encoder=100, max_seq_len_decoder=50, quantiles=None, use_grn_for_static=False, static_integration_mode='add_to_decoder_input', activation='relu', layer_norm_epsilon=1e-06, name='TimeSeriesTransformer', **kwargs)[source]
Parameters:
  • static_input_dim (int)

  • dynamic_input_dim (int)

  • future_input_dim (int)

  • embed_dim (int)

  • num_heads (int)

  • ffn_dim (int)

  • num_encoder_layers (int)

  • num_decoder_layers (int)

  • forecast_horizon (int)

  • output_dim (int)

  • dropout_rate (float)

  • input_dropout_rate (float)

  • max_seq_len_encoder (int)

  • max_seq_len_decoder (int)

  • quantiles (List[float] | None)

  • use_grn_for_static (bool)

  • static_integration_mode (str)

  • activation (str)

  • layer_norm_epsilon (float)

  • name (str | None)

call(inputs, training=False)[source]

Forward pass for the TimeSeriesTransformer.

Parameters:
  • inputs (A list or tuple of tensors. The elements are:) –

    1. static_input (Batch, static_input_dim) (Can be None if self.static_input_dim is 0).

    2. dynamic_input (Batch, T_past, dynamic_input_dim)

    3. future_input (Batch, T_decode_seq, future_input_dim) (T_decode_seq is typically self.forecast_horizon.

      Can be None if self.future_input_dim is 0).

    The order must be consistent if some inputs are None. It’s safer if the model expects a dict or if caller ensures correct list even with Nones. This call method expects a list/tuple that will be passed to prepare_model_inputs.

  • training (Boolean, whether the model is in training mode.)

Return type:

A tensor with forecast predictions.

get_config()[source]

Returns the config of the object.

An object config is a Python dictionary (serializable) containing the information needed to re-instantiate it.

help(**kwargs)
my_params = TimeSeriesTransformer(     static_input_dim,     dynamic_input_dim,     future_input_dim,     embed_dim=64,     num_heads=4,     ffn_dim=128,     num_encoder_layers=3,     num_decoder_layers=3,     forecast_horizon=1,     output_dim=1,     dropout_rate=0.1,     input_dropout_rate=0.1,     max_seq_len_encoder=100,     max_seq_len_decoder=50,     quantiles=None,     use_grn_for_static=False,     static_integration_mode='add_to_decoder_input',     activation='relu',     layer_norm_epsilon=1e-06,     name='TimeSeriesTransformer' )
classmethod from_config(config, custom_objects=None)[source]

Creates an operation from its config.

This method is the reverse of get_config, capable of instantiating the same operation from the config dictionary.

Note: If you override this method, you might receive a serialized dtype config, which is a dict. You can deserialize it as follows:

```python if “dtype” in config and isinstance(config[“dtype”], dict):

policy = dtype_policies.deserialize(config[“dtype”])

```

Parameters:

config – A Python dictionary, typically the output of get_config.

Returns:

An operation instance.