fusionlab.nn.utils.prepare_pinn_data_sequences¶

fusionlab.nn.utils.prepare_pinn_data_sequences(df, time_col, subsidence_col, gwl_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, lon_col=None, lat_col=None, group_id_cols=None, time_steps=12, forecast_horizon=3, output_subsidence_dim=1, output_gwl_dim=1, datetime_format=None, normalize_coords=True, cols_to_scale=None, return_coord_scaler=False, mode=None, savefile=None, verbose=0, _logger=None, **kws)[source]¶

Reshapes and prepares time-series data into sequences for PINN models.

This function transforms a Pandas DataFrame into structured NumPy arrays suitable for training Physics-Informed Neural Networks like PIHALNet. It creates sequences of dynamic features, future known features, static features, target variables (subsidence and GWL), and importantly, spatio-temporal coordinates corresponding to the forecast horizon.

Parameters:

df (pandas.DataFrame) – Tidy time‑series table in long format. Each row must correspond to exactly one time stamp and (optionally) one spatial/entity identifier. All feature and target columns listed below must already exist in df.
time_col (str) – Column holding the observation time. May be numeric (year‑fraction, ordinal day, seconds since epoch, …) or any pandas‑recognised datetime dtype; non‑numeric values are converted to a numeric “year + fraction‑of‑year” scale.
subsidence_col (str) – Column names of the target variables: vertical land movement (subsidence) and ground‑water level (GWL). Values are sliced to produce the horizon targets with shapes (N, forecast_horizon, output_✱_dim).
gwl_col (str) – Column names of the target variables: vertical land movement (subsidence) and ground‑water level (GWL). Values are sliced to produce the horizon targets with shapes (N, forecast_horizon, output_✱_dim).
dynamic_cols (list[str]) – Names of past‑covariate columns sampled over the look‑back window of length :pydata:`time_steps`. All columns are stacked in the order supplied to form dynamic_features → shape (N, time_steps, len(dynamic_cols)).
static_cols (list[str] or None, default None) – Columns that are constant within a group (e.g. soil type, aquifer class). Only the first value in each group is stored, resulting in static_features of shape (N, len(static_cols)). Pass None if the data set has no static covariates.
future_cols (list[str] or None, default None) –
Known‑future covariates such as weather forecasts. Their temporal span depends on :pydata:`mode`:
- 'pihal_like' → length = forecast_horizon (decoder only)
- 'tft_like' → length = time_steps + forecast_horizon (encoder + decoder)
When None an empty future_features tensor is returned.
spatial_cols ((str, str) or None, default None) – Tuple (lon, lat) if longitude and latitude are already named explicitly. Overrides lon_col / lat_col. Required unless the spatial columns are supplied separately.
lon_col (str or None, default None) – Fallback column names for longitude and latitude when spatial_cols is None. Both must be provided together.
lat_col (str or None, default None) – Fallback column names for longitude and latitude when spatial_cols is None. Both must be provided together.
group_id_cols (list[str] or None, default None) – Keys that identify independent trajectories (e.g. ['site_id'] or ['site', 'layer']). Each group is treated as an isolated time‑series for rolling‑window extraction. If None the entire DataFrame is processed as one group.
time_steps (int, default 12) – Length of the look‑back window \(T_ ext{past}\) supplied to the encoder. Must be ≥ 1.
forecast_horizon (int, default 3) – Prediction length \(H\). Determines the time dimension of the decoder outputs and of coords.
output_subsidence_dim (int, default 1) – Final dimension of each target tensor. Use > 1 when predicting several layers simultaneously.
output_gwl_dim (int, default 1) – Final dimension of each target tensor. Use > 1 when predicting several layers simultaneously.
datetime_format (str or None, default None) – Custom parsing string forwarded to :pyfunc:`pandas.to_datetime` when time_col is a string column.
normalize_coords (bool, default True) – If True, normalizes the ‘t’, ‘x’, ‘y’ coordinate values (derived from time_col, lon_col, lat_col) to the [0, 1] range based on the min/max within each group. This is often beneficial for neural network training.
cols_to_scale (list of str or "auto" or None, default "auto") –
- If a list of column names: scale exactly those columns.
- If “auto”: select all numeric columns, then: * Exclude time_col, lon_col, lat_col if scale_coords=False. * Exclude any columns whose values are only {0,1} (assumed one-hot).
- If None: no extra columns are scaled.
return_coord_scaler (bool, default False) – When enabled the function returns a three‑tuple (inputs_dict, targets_dict, coord_scaler) where coord_scaler is the fitted scaler instance or None when scaling was disabled.
mode ({'pihal_like', 'tft_like'} or None, default None) –
Selects the temporal schema used to construct the future_features window and consequently how the downstream model will consume it.
- ’pihal_like’ – builds future tensors with length equal to the forecast horizon \(H\) only. These tensors are intended for models that feed known‑future covariates exclusively to the decoder (PIHALNet convention).
- ’tft_like’ – builds future tensors whose time dimension is \(T_ ext{past} + H\), i.e. the look‑back window (time_steps) followed by the horizon. This matches the Temporal Fusion Transformer, where the encoder ingests the overlapping first segment and the decoder receives the horizon segment.
- None – treated as 'pihal_like' for backward compatibility.
The setting affects the allocated size of future_features_arr and the slice indices used during rolling window extraction; make sure it matches the mode expected by the target model.
savefile (str or None, default None) – Path to a .joblib file. When provided, all produced arrays and meta‑data are serialised for reproducibility and faster reloads.
verbose (int, default 0) – Verbosity level for logging (0-10). - 0: Silent. - 1: Basic info. - 2: More details on shapes and processing. - 5: Per-group processing info. - 7: Per-sequence sample data (use with caution for large data).
_logger (Logger | Callable[[str], None] | None)

Returns:

Union[ – Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray]],
Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[MinMaxScaler]]
- ] – A tuple containing two dictionaries: 1. inputs_dict:
  ’coords’: np.ndarray of shape (N, H, 3) for [t, x, y] coordinates over the forecast horizon.
  
  ’static_features’: np.ndarray of shape (N, D_s).
  
  ’dynamic_features’: np.ndarray of shape (N, T_past, D_d).
  
  ’future_features’: np.ndarray of shape (N, H, D_f).
  1. targets_dict: - ‘subsidence’: np.ndarray of shape (N, H, O_subs). - ‘gwl’: np.ndarray of shape (N, H, O_gwl).
  Where N is the total number of sequences, H is forecast_horizon, T_past is time_steps, D_s/d/f are feature dimensions, and O_subs/gwl are output dimensions.

If return_coord_scaler is False (default):: A tuple containing two dictionaries: inputs_dict and targets_dict.
If return_coord_scaler is True:: A tuple containing three elements: inputs_dict, targets_dict, and coord_scaler. coord_scaler is the MinMaxScaler instance used if normalize_coords was True (and thus coordinates were normalized), otherwise it’s None.

Raises:

ValueError – If required columns are missing, or if data is insufficient to create any sequences.
TypeError – If df is not a Pandas DataFrame.

Return type:

Tuple[Dict[str, ndarray], Dict[str, ndarray]] | Tuple[Dict[str, ndarray], Dict[str, ndarray], MinMaxScaler | None]

Notes

A sequence is generated only if a group contains at least

\[T_ ext{past} + H\]

consecutive observations without gaps.
The function is stateless; call it each time the configuration changes.

Examples

Generate rolling windows for two sites with a 6‑step look‑back (:pydata:`time_steps=6`) and a 3‑step horizon (:pydata:`forecast_horizon=3`). Future covariates are prepared TFT‑style (length = 6 + 3).

>>> import numpy as np, pandas as pd
>>> from fusionlab.nn.utils import prepare_pinn_data_sequences
>>>
>>> rng = np.random.default_rng(42)
>>> dates = pd.date_range("2022‑01‑31", periods=18, freq="M")
>>>
>>> df = pd.DataFrame({
...     "date":        np.tile(dates, 2),
...     "site":        ["A"] * 18 + ["B"] * 18,
...     "lon":         np.repeat([113.10, 113.25], 18),
...     "lat":         np.repeat([22.60,  22.75], 18),
...     "subs":        rng.normal(size=36),
...     "gwl":         rng.normal(size=36),
...     "rain":        rng.random(36),
...     "evap":        rng.random(36),
...     "forecast_rain": rng.random(36),
...     "soil_type":   ["sand"] * 18 + ["clay"] * 18,     # static covariate
... })
>>>
>>> inputs, targets = prepare_pinn_data_sequences(
...     df,
...     time_col="date",
...     subsidence_col="subs",
...     gwl_col="gwl",
...     dynamic_cols=["rain", "evap"],
...     static_cols=["soil_type"],
...     future_cols=["forecast_rain"],
...     spatial_cols=("lon", "lat"),
...     group_id_cols=["site"],
...     time_steps=6,
...     forecast_horizon=3,
...     mode="tft_like",
...     verbose=0,
... )
>>>
>>> # Inspect a few key tensors
>>> inputs["dynamic_features"].shape       # N × 6 × 2
(24, 6, 2)
>>> inputs["future_features"].shape        # N × (6+3) × 1
(24, 9, 1)
>>> targets["subsidence"].shape            # N × 3 × 1
(24, 3, 1)
>>> len(inputs["coords"]) == len(targets["gwl"])
True