fusionlab.nn.transformers.DummyTFT¶

class fusionlab.nn.transformers.DummyTFT[source]¶

Bases: Model, NNLearner

DummyTFT: Simplified TFT variant using only Static and Dynamic inputs.

Notes

The Temporal Fusion Transformer (TFT) model combines the strengths of sequence-to-sequence models and attention mechanisms to handle complex temporal dynamics. It provides interpretability by allowing examination of variable importance and temporal attention weights.

Variable Selection Networks (VSNs):

VSNs select relevant variables by applying Gated Residual Networks (GRNs) to each variable and computing variable importance weights via a softmax function. This allows the model to focus on the most informative features.

Gated Residual Networks (GRNs):

GRNs allow the model to capture complex nonlinear relationships while controlling information flow via gating mechanisms. They consist of a nonlinear layer followed by gating and residual connections.

Static Enrichment Layer:

Enriches temporal features with static context, enabling the model to adjust temporal dynamics based on static information. This layer combines static embeddings with temporal representations.

Temporal Attention Layer:

Applies multi-head attention over the temporal dimension to focus on important time steps. This mechanism allows the model to weigh different time steps differently when making predictions.

Mathematical Formulation:

Let:

\(\mathbf{x}_{ ext{static}} \in \mathbb{R}^{n_s imes d_s}\) be the static inputs,
\(\mathbf{x}_{ ext{dynamic}} \in \mathbb{R}^{T imes n_d imes d_d}\) be the dynamic inputs,
\(n_s\) and \(n_d\) are the numbers of static and dynamic variables,
\(d_s\) and \(d_d\) are their respective input dimensions,
\(T\) is the number of time steps.

Variable Selection Networks (VSNs):

For static variables:

\[\mathbf{e}_{ ext{static}} = \sum_{i=1}^{n_s} lpha_i \cdot ext{GRN}(\mathbf{x}_{ ext{static}, i})\]

For dynamic variables:

\[\mathbf{E}_{ ext{dynamic}} = \sum_{j=1}^{n_d} eta_j \cdot ext{GRN}(\mathbf{x}_{ ext{dynamic}, :, j})\]

where \(lpha_i\) and \(eta_j\) are variable importance weights computed via softmax.

LSTM Encoder:

Processes dynamic embeddings to capture sequential dependencies:

\[\mathbf{H} = ext{LSTM}(\mathbf{E}_{ ext{dynamic}})\]

Static Enrichment Layer:

Combines static context with temporal features:

\[\mathbf{H}_{ ext{enriched}} = ext{StaticEnrichment}( \mathbf{e}_{ ext{static}}, \mathbf{H})\]

Temporal Attention Layer:

Applies attention over time steps:

\[\mathbf{Z} = ext{TemporalAttention}(\mathbf{H}_{ ext{enriched}})\]

Position-wise Feedforward Layer:

Refines the output:

\[\mathbf{F} = ext{GRN}(\mathbf{Z})\]

Final Output:

For point forecasting:

\[\hat{y} = ext{OutputLayer}(\mathbf{F}_{T})\]

For quantile forecasting (if quantiles are specified):

\[\hat{y}_q = ext{OutputLayer}_q(\mathbf{F}_{T}), \quad q \in ext{quantiles}\]

where \(\mathbf{F}_{T}\) is the feature vector at the last time step.

Examples

>>> from fusionlab.nn.transformers import TemporalFusionTransformer
>>> # Define model parameters
>>> model = TemporalFusionTransformer(
...     static_input_dim=1,
...     dynamic_input_dim=1,
...     hidden_units=64,
...     num_heads=4,
...     dropout_rate=0.1,
...     forecast_horizon=1,
...     quantiles=[0.1, 0.5, 0.9],
...     activation='relu',
...     use_batch_norm=True,
...     num_lstm_layers=2,
...     lstm_units=[64, 32]
... )
>>> model.compile(optimizer='adam', loss='mse')
>>> # Assume `static_inputs`, `dynamic_inputs`, and `labels` are prepared
>>> model.fit(
...     [static_inputs, dynamic_inputs],
...     labels,
...     epochs=10,
...     batch_size=32
... )

Notes

When using quantile regression by specifying the quantiles parameter, ensure that your loss function is compatible with quantile prediction, such as the quantile loss function. Additionally, the model output will have multiple predictions per time step, corresponding to each quantile.

See also

VariableSelectionNetwork: Selects relevant variables.
GatedResidualNetwork: Processes inputs with gating mechanisms.
StaticEnrichmentLayer: Enriches temporal features with static context.
TemporalAttentionLayer: Applies attention over time steps.

References

The DummyTFT combines high-performance multi-horizon forecasting with interpretable insights into temporal dynamics [1]. It integrates several advanced mechanisms, including:

Variable Selection Networks (VSNs) for static and dynamic features.
Gated Residual Networks (GRNs) for processing inputs.
Static Enrichment Layer to incorporate static features into temporal processing.
LSTM Encoder for capturing sequential dependencies.
Temporal Attention Layer for focusing on important time steps.
Position-wise Feedforward Layer.
Final Output Layer for prediction.

Parameters:

static_input_dim (int) – The input dimension per static variable. Typically 1 for scalar features or higher for embeddings. This defines the number of features for each static variable. For example, if static variables are represented using embeddings of size 16, then static_input_dim would be 16.
dynamic_input_dim (int) – The input dimension per dynamic variable. This defines the number of features for each dynamic variable at each time step. For instance, if dynamic variables are represented using embeddings or multiple features, specify the appropriate dimension.

hidden_units: int

The number of hidden units in the model’s layers. This parameter defines the size of the hidden layers throughout the model, including Gated Recurrent Networks (GRNs), Long Short-Term Memory (LSTM) layers, and fully connected layers. Increasing the value of hidden_units enhances the model’s capacity to capture more complex relationships and patterns from the data. However, it also increases computational costs due to a higher number of parameters. The choice of hidden units should balance model capacity and computational feasibility, depending on the complexity of the problem and available resources.

num_heads: int

The number of attention heads in the multi-head attention mechanism. Multiple attention heads allow the model to focus on different aspects of the input data, capturing more complex relationships within the data. More heads provide better representation power but increase computational costs. This parameter is crucial in self-attention mechanisms where each head can attend to different parts of the input data in parallel, improving the model’s ability to capture diverse features. For example, in natural language processing, multiple heads allow the model to attend to different semantic aspects of the text. Using more heads can increase the model’s capacity to learn complex features, but it also requires more memory and computational power.

dropout_rate: float, optional

The dropout rate applied during training to prevent overfitting. Dropout is a regularization technique where a fraction of input units is randomly set to zero at each training step to prevent the model from relying too heavily on any one feature. This helps improve generalization and can make the model more robust. Dropout is particularly effective in deep learning models where overfitting is a common issue. The value should be between 0.0 and 1.0, where a value of 0.0 means no dropout is applied and a value of 1.0 means that all units are dropped. A typical value for dropout_rate ranges from 0.1 to 0.5.

forecast_horizonint, optional

The number of time steps to forecast. Default is 1. This parameter defines the number of future time steps the model will predict. For multi-step forecasting, set forecast_horizon to the desired number of future steps.

quantiles: list of float or None, optional

A list of quantiles to predict for each time step. For example, specifying [0.1, 0.5, 0.9] would result in the model predicting the 10th, 50th, and 90th percentiles of the target variable at each time step. This is useful for estimating prediction intervals and capturing uncertainty in forecasting tasks. If set to None, the model performs point forecasting and predicts a single value (e.g., the mean or most likely value) for each time step. Quantile forecasting is commonly used for applications where it is important to predict not just the most likely outcome, but also the range of possible outcomes.

activation: str, optional

The activation function to use in the Gated Recurrent Networks (GRNs). The activation function defines how the model’s internal representations are transformed before being passed to the next layer. Supported values include:

'elu': Exponential Linear Unit (ELU), a variant of ReLU that improves training performance by preventing dying neurons. ELU provides a smooth output for negative values, which can help mitigate the issue of vanishing gradients. The mathematical formulation for ELU is:

\[f(x) = egin{cases} x & ext{if } x > 0 \ lpha (\exp(x) - 1) & ext{if } x \leq 0 \end{cases}\]

where (lpha) is a constant (usually 1.0).
'relu': Rectified Linear Unit (ReLU), a widely used activation function that outputs zero for negative input and the input itself for positive values. It is computationally efficient and reduces the risk of vanishing gradients. The mathematical formulation for ReLU is:

\[f(x) = \max(0, x)\]

where (x) is the input value.
'tanh': Hyperbolic Tangent, which squashes the outputs into a range between -1 and 1. It is useful when preserving the sign of the input is important, but can suffer from vanishing gradients for large inputs. The mathematical formulation for tanh is:

\[f(x) =\]

rac{2}{1 + exp(-2x)} - 1

'sigmoid': Sigmoid function, commonly used for binary classification tasks, maps outputs between 0 and 1, making it suitable for probabilistic outputs. The mathematical formulation for sigmoid is:

\[f(x) =\]

rac{1}{1 + exp(-x)}

'linear': No activation (identity function), often used in regression tasks where no non-linearity is needed. The output is simply the input value:

\[f(x) = x\]

The default activation function is 'elu'.

use_batch_norm: bool, optional: Whether to use batch normalization in the Gated Recurrent Networks (GRNs). Batch normalization normalizes the input to each layer, stabilizing and accelerating the training process. When set to True, it normalizes the activations by scaling and shifting them to maintain a stable distribution during training. This technique can help mitigate issues like vanishing and exploding gradients, making it easier to train deep networks. Batch normalization also acts as a form of regularization, reducing the need for other techniques like dropout. By default, batch normalization is turned off (False).
num_lstm_layersint, optional: Number of LSTM layers in the encoder. Default is 1. Adding more layers can help the model capture more complex sequential patterns. Each additional layer processes the output of the previous LSTM layer.
lstm_unitslist of int or None, optional: List containing the number of units for each LSTM layer. If None, all LSTM layers have hidden_units units. Default is None. This parameter allows customizing the size of each LSTM layer. For example, to specify different units for each layer, provide a list like [64, 32].

call(inputs, training=False)[source]¶

Forward pass of the model.

Parameters:

inputs (tuple of tensors) –
A tuple containing (static_inputs, dynamic_inputs).
- static_inputs: Tensor of shape (batch_size, num_static_vars, static_input_dim) representing the static features.
- dynamic_inputs: Tensor of shape (batch_size, time_steps, num_dynamic_vars, dynamic_input_dim) representing the dynamic features.
training (bool, optional) – Whether the model is in training mode. Default is False.

Returns:

The output predictions of the model. The shape depends on the forecast_horizon and whether quantiles are used.

Return type:

Tensor

get_config()[source]¶: Returns the configuration of the model for serialization.

from_config(config)[source]¶: Instantiates the model from a configuration dictionary.

__init__(dynamic_input_dim, static_input_dim, future_input_dim=None, hidden_units=32, num_heads=4, dropout_rate=0.1, forecast_horizon=1, quantiles=None, activation='elu', use_batch_norm=False, num_lstm_layers=1, lstm_units=None, output_dim=1, name=None, **kwargs)[source]¶

Parameters:

dynamic_input_dim (int)
static_input_dim (int)
future_input_dim (Any)
hidden_units (int)
num_heads (int)
dropout_rate (float)
forecast_horizon (int)
quantiles (List[float] | None)
activation (str)
use_batch_norm (bool)
num_lstm_layers (int)
lstm_units (int | List[int] | None)
output_dim (int)
name (str | None)

Methods

`__init__`(dynamic_input_dim, static_input_dim)
`add_loss`(loss)	Can be called inside of the call() method to add a scalar loss.
`add_metric`(args, *kwargs)
`add_variable`(shape, initializer[, dtype, ...])	Add a weight variable to the layer.
`add_weight`([shape, initializer, dtype, ...])	Add a weight variable to the layer.
`build`(input_shape)
`build_from_config`(config)	Builds the layer's states with the supplied config dict.
`call`(inputs[, training])	Forward pass for DummyTFT (Static and Dynamic inputs only).
`compile`([optimizer, loss, loss_weights, ...])	Configures the model for training.
`compile_from_config`(config)	Compiles the model with the information given in config.
`compiled_loss`(y, y_pred[, sample_weight, ...])
`compute_loss`([x, y, y_pred, sample_weight, ...])	Compute the total loss, validate it, and return it.
`compute_mask`(inputs, previous_mask)
`compute_metrics`(x, y, y_pred[, sample_weight])	Update metric states and collect all metrics to be returned.
`compute_output_shape`(args, *kwargs)
`compute_output_spec`(args, *kwargs)
`count_params`()	Count the total number of scalars composing the weights.
`evaluate`([x, y, batch_size, verbose, ...])	Returns the loss value & metrics values for the model in test mode.
`export`(filepath[, format, verbose, ...])	Export the model as an artifact for inference.
`fit`([x, y, batch_size, epochs, verbose, ...])	Trains the model for a fixed number of epochs (dataset iterations).
`from_config`(config)	Creates an operation from its config.
`get_build_config`()	Returns a dictionary with the layer's input shape.
`get_compile_config`()	Returns a serialized config with information for compiling the model.
`get_config`()	Returns the config of the object.
`get_layer`([name, index])	Retrieves a layer based on either its name (unique) or index.
`get_metrics_result`()	Returns the model's metrics values as a dict.
`get_params`([deep])	Get the parameters for this learner.
`get_state_tree`([value_format])	Retrieves tree-like structure of model variables.
`get_weights`()	Return the values of layer.weights as a list of NumPy arrays.
`help`(**kwargs)
`load`(file_path[, format])	Load the learner's state from a specified file in the desired format.
`load_own_variables`(store)	Loads the state of the layer.
`load_weights`(filepath[, skip_mismatch])	Load the weights from a single file or sharded files.
`loss`(y, y_pred[, sample_weight])
`make_predict_function`([force])
`make_test_function`([force])
`make_train_function`([force])
`predict`(x[, batch_size, verbose, steps, ...])	Generates output predictions for the input samples.
`predict_on_batch`(x)	Returns predictions for a single batch of samples.
`predict_step`(data)
`quantize`(mode[, config])	Quantize the weights of the model.
`quantized_build`(input_shape, mode)
`quantized_call`(args, *kwargs)
`rematerialized_call`(layer_call, args, *kwargs)	Enable rematerialization dynamically for layer's call method.
`reset_metrics`()
`save`(filepath[, overwrite, zipped])	Saves a model as a .keras file.
`save_own_variables`(store)	Saves the state of the layer.
`save_weights`(filepath[, overwrite, ...])	Saves all weights to a single file or sharded files.
`set_params`(**params)	Set the parameters of this learner.
`set_state_tree`(state_tree)	Assigns values to variables of the model.
`set_weights`(weights)	Sets the values of layer.weights from a list of NumPy arrays.
`stateless_call`(trainable_variables, ...[, ...])	Call the layer without any side effects.
`stateless_compute_loss`(trainable_variables, ...)
`summary`([line_length, positions, print_fn, ...])	Prints a string summary of the network.
`symbolic_call`(args, *kwargs)
`test_on_batch`(x[, y, sample_weight, return_dict])	Test the model on a single batch of samples.
`test_step`(data)
`to_json`(**kwargs)	Returns a JSON string containing the network configuration.
`train_on_batch`(x[, y, sample_weight, ...])	Runs a single gradient update on a single batch of data.
`train_step`(data)

Attributes

`compiled_metrics`
`compute_dtype`	The dtype of the computations performed by the layer.
`distribute_reduction_method`
`distribute_strategy`
`dtype`	Alias of layer.variable_dtype.
`dtype_policy`
`input`	Retrieves the input tensor(s) of a symbolic operation.
`input_dtype`	The dtype layer inputs should be converted to.
`input_spec`
`jit_compile`
`layers`
`losses`	List of scalar losses from add_loss, regularizers and sublayers.
`metrics`	List of all metrics.
`metrics_names`
`metrics_variables`	List of all metric variables.
`my_params`
`non_trainable_variables`	List of all non-trainable layer state.
`non_trainable_weights`	List of all non-trainable weight variables of the layer.
`output`	Retrieves the output tensor(s) of a layer.
`path`	The path of the layer.
`quantization_mode`	The quantization mode of this layer, None if not quantized.
`run_eagerly`
`supports_masking`	Whether this layer supports computing a mask using compute_mask.
`trainable`	Settable boolean, whether this layer should be trainable or not.
`trainable_variables`	List of all trainable layer state.
`trainable_weights`	List of all trainable weight variables of the layer.
`variable_dtype`	The dtype of the state (weights) of the layer.
`variables`	List of all layer state, including random seeds.
`weights`	List of all weight variables of the layer.

Parameters:

dynamic_input_dim (int)
static_input_dim (int)
future_input_dim (Any)
hidden_units (int)
num_heads (int)
dropout_rate (float)
forecast_horizon (int)
quantiles (List[float] | None)
activation (str)
use_batch_norm (bool)
num_lstm_layers (int)
lstm_units (int | List[int] | None)
output_dim (int)
name (str | None)

call(inputs, training=False, **kwargs)[source]¶: Forward pass for DummyTFT (Static and Dynamic inputs only).

get_config()[source]¶

Returns the config of the object.

An object config is a Python dictionary (serializable) containing the information needed to re-instantiate it.

classmethod from_config(config)[source]¶

Creates an operation from its config.

This method is the reverse of get_config, capable of instantiating the same operation from the config dictionary.

Note: If you override this method, you might receive a serialized dtype config, which is a dict. You can deserialize it as follows:

```python if “dtype” in config and isinstance(config[“dtype”], dict):

policy = dtype_policies.deserialize(config[“dtype”])

```

Parameters:: config – A Python dictionary, typically the output of get_config.
Returns:: An operation instance.

help(**kwargs)¶

my_params = DummyTFT( dynamic_input_dim, static_input_dim, future_input_dim=None, hidden_units=32, num_heads=4, dropout_rate=0.1, forecast_horizon=1, quantiles=None, activation='elu', use_batch_norm=False, num_lstm_layers=1, lstm_units=None, output_dim=1, name=None )¶