fusionlab.nn.utils.prepare_pinn_data_sequences¶
- fusionlab.nn.utils.prepare_pinn_data_sequences(df, time_col, subsidence_col, gwl_col, dynamic_cols, static_cols=None, future_cols=None, spatial_cols=None, lon_col=None, lat_col=None, group_id_cols=None, time_steps=12, forecast_horizon=3, output_subsidence_dim=1, output_gwl_dim=1, datetime_format=None, normalize_coords=True, cols_to_scale=None, return_coord_scaler=False, mode=None, savefile=None, verbose=0, _logger=None, **kws)[source]¶
Reshapes and prepares time-series data into sequences for PINN models.
This function transforms a Pandas DataFrame into structured NumPy arrays suitable for training Physics-Informed Neural Networks like PIHALNet. It creates sequences of dynamic features, future known features, static features, target variables (subsidence and GWL), and importantly, spatio-temporal coordinates corresponding to the forecast horizon.
- Parameters:
df (
pandas.DataFrame) – Tidy time‑series table in long format. Each row must correspond to exactly one time stamp and (optionally) one spatial/entity identifier. All feature and target columns listed below must already exist in df.time_col (
str) – Column holding the observation time. May be numeric (year‑fraction, ordinal day, seconds since epoch, …) or any pandas‑recognised datetime dtype; non‑numeric values are converted to a numeric “year + fraction‑of‑year” scale.subsidence_col (
str) – Column names of the target variables: vertical land movement (subsidence) and ground‑water level (GWL). Values are sliced to produce the horizon targets with shapes(N, forecast_horizon, output_✱_dim).gwl_col (
str) – Column names of the target variables: vertical land movement (subsidence) and ground‑water level (GWL). Values are sliced to produce the horizon targets with shapes(N, forecast_horizon, output_✱_dim).dynamic_cols (
list[str]) – Names of past‑covariate columns sampled over the look‑back window of length :pydata:`time_steps`. All columns are stacked in the order supplied to formdynamic_features→ shape(N, time_steps, len(dynamic_cols)).static_cols (
list[str]orNone, defaultNone) – Columns that are constant within a group (e.g. soil type, aquifer class). Only the first value in each group is stored, resulting instatic_featuresof shape(N, len(static_cols)). Pass None if the data set has no static covariates.future_cols (
list[str]orNone, defaultNone) –Known‑future covariates such as weather forecasts. Their temporal span depends on :pydata:`mode`:
'pihal_like'→ length =forecast_horizon(decoder only)'tft_like'→ length =time_steps + forecast_horizon(encoder + decoder)
When None an empty
future_featurestensor is returned.spatial_cols (
(str,str)orNone, defaultNone) – Tuple(lon, lat)if longitude and latitude are already named explicitly. Overrides lon_col / lat_col. Required unless the spatial columns are supplied separately.lon_col (
strorNone, defaultNone) – Fallback column names for longitude and latitude when spatial_cols is None. Both must be provided together.lat_col (
strorNone, defaultNone) – Fallback column names for longitude and latitude when spatial_cols is None. Both must be provided together.group_id_cols (
list[str]orNone, defaultNone) – Keys that identify independent trajectories (e.g.['site_id']or['site', 'layer']). Each group is treated as an isolated time‑series for rolling‑window extraction. If None the entire DataFrame is processed as one group.time_steps (
int, default12) – Length of the look‑back window \(T_ ext{past}\) supplied to the encoder. Must be ≥ 1.forecast_horizon (
int, default3) – Prediction length \(H\). Determines the time dimension of the decoder outputs and ofcoords.output_subsidence_dim (
int, default1) – Final dimension of each target tensor. Use > 1 when predicting several layers simultaneously.output_gwl_dim (
int, default1) – Final dimension of each target tensor. Use > 1 when predicting several layers simultaneously.datetime_format (
strorNone, defaultNone) – Custom parsing string forwarded to :pyfunc:`pandas.to_datetime` when time_col is a string column.normalize_coords (
bool, defaultTrue) – If True, normalizes the ‘t’, ‘x’, ‘y’ coordinate values (derived from time_col, lon_col, lat_col) to the [0, 1] range based on the min/max within each group. This is often beneficial for neural network training.cols_to_scale (
listofstror"auto"orNone, default"auto") –If a list of column names: scale exactly those columns.
If “auto”: select all numeric columns, then: * Exclude time_col, lon_col, lat_col if scale_coords=False. * Exclude any columns whose values are only {0,1} (assumed one-hot).
If None: no extra columns are scaled.
return_coord_scaler (
bool, defaultFalse) – When enabled the function returns a three‑tuple(inputs_dict, targets_dict, coord_scaler)where coord_scaler is the fitted scaler instance or None when scaling was disabled.mode (
{'pihal_like', 'tft_like'}orNone, defaultNone) –Selects the temporal schema used to construct the future_features window and consequently how the downstream model will consume it.
’pihal_like’ – builds future tensors with length equal to the forecast horizon \(H\) only. These tensors are intended for models that feed known‑future covariates exclusively to the decoder (PIHALNet convention).
’tft_like’ – builds future tensors whose time dimension is \(T_ ext{past} + H\), i.e. the look‑back window (time_steps) followed by the horizon. This matches the Temporal Fusion Transformer, where the encoder ingests the overlapping first segment and the decoder receives the horizon segment.
None – treated as
'pihal_like'for backward compatibility.
The setting affects the allocated size of
future_features_arrand the slice indices used during rolling window extraction; make sure it matches the mode expected by the target model.savefile (
strorNone, defaultNone) – Path to a.joblibfile. When provided, all produced arrays and meta‑data are serialised for reproducibility and faster reloads.verbose (
int, default0) – Verbosity level for logging (0-10). - 0: Silent. - 1: Basic info. - 2: More details on shapes and processing. - 5: Per-group processing info. - 7: Per-sequence sample data (use with caution for large data)._logger (Logger | Callable[[str], None] | None)
- Returns:
Union[– Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray]],Tuple[Dict[str, np.ndarray], Dict[str, np.ndarray], Optional[MinMaxScaler]]
]– A tuple containing two dictionaries: 1. inputs_dict:’coords’: np.ndarray of shape (N, H, 3) for [t, x, y] coordinates over the forecast horizon.
’static_features’: np.ndarray of shape (N, D_s).
’dynamic_features’: np.ndarray of shape (N, T_past, D_d).
’future_features’: np.ndarray of shape (N, H, D_f).
targets_dict: - ‘subsidence’: np.ndarray of shape (N, H, O_subs). - ‘gwl’: np.ndarray of shape (N, H, O_gwl).
Where N is the total number of sequences, H is forecast_horizon, T_past is time_steps, D_s/d/f are feature dimensions, and O_subs/gwl are output dimensions.
- If return_coord_scaler is False (default):
A tuple containing two dictionaries: inputs_dict and targets_dict.
- If return_coord_scaler is True:
A tuple containing three elements: inputs_dict, targets_dict, and coord_scaler. coord_scaler is the MinMaxScaler instance used if normalize_coords was True (and thus coordinates were normalized), otherwise it’s None.
- Raises:
ValueError – If required columns are missing, or if data is insufficient to create any sequences.
TypeError – If df is not a Pandas DataFrame.
- Return type:
Tuple[Dict[str, ndarray], Dict[str, ndarray]] | Tuple[Dict[str, ndarray], Dict[str, ndarray], MinMaxScaler | None]
Notes
A sequence is generated only if a group contains at least
\[T_ ext{past} + H\]consecutive observations without gaps.
The function is stateless; call it each time the configuration changes.
Examples
Generate rolling windows for two sites with a 6‑step look‑back (:pydata:`time_steps=6`) and a 3‑step horizon (:pydata:`forecast_horizon=3`). Future covariates are prepared TFT‑style (length =
6 + 3).>>> import numpy as np, pandas as pd >>> from fusionlab.nn.utils import prepare_pinn_data_sequences >>> >>> rng = np.random.default_rng(42) >>> dates = pd.date_range("2022‑01‑31", periods=18, freq="M") >>> >>> df = pd.DataFrame({ ... "date": np.tile(dates, 2), ... "site": ["A"] * 18 + ["B"] * 18, ... "lon": np.repeat([113.10, 113.25], 18), ... "lat": np.repeat([22.60, 22.75], 18), ... "subs": rng.normal(size=36), ... "gwl": rng.normal(size=36), ... "rain": rng.random(36), ... "evap": rng.random(36), ... "forecast_rain": rng.random(36), ... "soil_type": ["sand"] * 18 + ["clay"] * 18, # static covariate ... }) >>> >>> inputs, targets = prepare_pinn_data_sequences( ... df, ... time_col="date", ... subsidence_col="subs", ... gwl_col="gwl", ... dynamic_cols=["rain", "evap"], ... static_cols=["soil_type"], ... future_cols=["forecast_rain"], ... spatial_cols=("lon", "lat"), ... group_id_cols=["site"], ... time_steps=6, ... forecast_horizon=3, ... mode="tft_like", ... verbose=0, ... ) >>> >>> # Inspect a few key tensors >>> inputs["dynamic_features"].shape # N × 6 × 2 (24, 6, 2) >>> inputs["future_features"].shape # N × (6+3) × 1 (24, 9, 1) >>> targets["subsidence"].shape # N × 3 × 1 (24, 3, 1) >>> len(inputs["coords"]) == len(targets["gwl"]) True
See also
fusionlab.nn.utils.reshape_xtft_dataReshapes arrays for TFT‑style input pipelines.
fusionlab.nn.utils.create_sequencesGeneric sliding‑window generator used internally.