Exercise: Physics-Informed Forecasting with PIHALNet & TransFlowSubsNet

Welcome to this comprehensive exercise on using the PIHALNet and TransFlowSubsNet models. This guide will walk you through a complete, real-world workflow for performing a multi-horizon, probabilistic, physics-informed forecast for land subsidence and groundwater level.

This is a deep-dive tutorial that covers every step from raw data ingestion to final visualization, demonstrating how to handle the complex data requirements of these advanced hybrid models.

Learning Objectives:

  • Load and preprocess a complex, real-world style dataset.

  • Perform feature engineering, including one-hot encoding and numerical scaling.

  • Use the prepare_pinn_data_sequences utility to transform a flat DataFrame into the specialized sequence format required by PINNs.

  • Configure and train a hybrid model (PIHALNet or TransFlowSubsNet) with both a data-driven loss and a physics-informed loss component.

  • Set up callbacks for model checkpointing and early stopping.

  • Generate predictions on new data and format them into an interpretable DataFrame.

  • Visualize the multi-target, multi-horizon forecast results using the library’s specialized plotting functions.

Let’s get started!

Prerequisites

Ensure you have fusionlab-learn and its dependencies installed.

pip install fusionlab-learn "keras-tuner>=1.4.0" matplotlib scikit-learn

Step 0: Preamble & Configuration

First, we import all necessary libraries and define the main configuration parameters for our experiment. This central config block allows you to easily change the model, forecast period, or key hyperparameters for your own tests.

 1import os
 2import pandas as pd
 3import numpy as np
 4import tensorflow as tf
 5from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
 6from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
 7
 8# --- FusionLab Imports ---
 9from fusionlab.datasets import fetch_zhongshan_data
10from fusionlab.nn.pinn.models import PIHALNet, TransFlowSubsNet
11from fusionlab.nn.pinn.utils import prepare_pinn_data_sequences
12from fusionlab.nn.losses import combined_quantile_loss
13from fusionlab.nn.models.utils import plot_history_in
14from fusionlab.nn.pinn.utils import format_pihalnet_predictions
15from fusionlab.plot.forecast import plot_forecasts
16
17# --- Main Configuration ---
18MODEL_NAME = 'TransFlowSubsNet'  # or 'PIHALNet'
19TRAIN_END_YEAR = 2022
20FORECAST_START_YEAR = 2023
21FORECAST_HORIZON_YEARS = 2
22TIME_STEPS = 4  # Lookback window
23QUANTILES = [0.1, 0.5, 0.9]
24EPOCHS = 20 # Use more epochs (e.g., 100+) for real results
25BATCH_SIZE = 256
26
27# --- PINN Configuration ---
28PDE_MODE_CONFIG = 'both' if MODEL_NAME == 'TransFlowSubsNet' else 'consolidation'
29LAMBDA_PDE_CONS = 1.0 # Weight for consolidation loss
30LAMBDA_PDE_GW = 1.0   # Weight for groundwater flow loss
31
32# --- Output Directory ---
33RUN_OUTPUT_PATH = f"./pihalnet_exercise_outputs/{MODEL_NAME}_run"
34os.makedirs(RUN_OUTPUT_PATH, exist_ok=True)
35print(f"Configuration set for model: {MODEL_NAME}")
36print(f"Output artifacts will be saved in: {RUN_OUTPUT_PATH}")

Expected Output 0:

Configuration set for model: TransFlowSubsNet
Output artifacts will be saved in: ./pihalnet_exercise_outputs/TransFlowSubsNet_run

Step 1: Load and Inspect the Dataset

We will load the Zhongshan dataset. The logic first tries to find a local, larger version of the file but will fall back to fetching the smaller, built-in version if it’s not found.

 1try:
 2    # For this exercise, we directly use the fetch utility
 3    print("Fetching Zhongshan dataset...")
 4    data_bunch = fetch_zhongshan_data()
 5    df_raw = data_bunch.frame
 6    print(f"Successfully loaded dataset. Shape: {df_raw.shape}")
 7except Exception as e:
 8    raise RuntimeError(f"Failed to fetch dataset: {e}")
 9
10print(df_raw.head())

Expected Output 1:

Fetching Zhongshan dataset...
Dataset 'zhongshan_2000.csv' found in package resource: fusionlab.datasets.data
Using cached version (also found in package): C:\Users\Daniel\fusionlab_data\zhongshan_2000.csv
Successfully loaded full data (1999 rows) from: C:\Users\Daniel\fusionlab_data\zhongshan_2000.csv
Loading full dataset (n_samples is None or '*').
Successfully loaded dataset. Shape: (1999, 14)
    longitude   latitude  ...  rainfall_category  subsidence
0  113.240334  22.476652  ...             Medium       15.51
1  113.215866  22.510025  ...             Medium       31.60
2  113.237984  22.494591  ...             Medium        8.09
3  113.219109  22.513433  ...             Medium       15.49
4  113.210678  22.536232  ...             Medium       14.02

[5 rows x 14 columns]

Step 2: Preprocessing - Feature Selection & Cleaning

We select the features relevant to our model and handle any missing values. For these models, we need coordinates (longitude, latitude), a time column (year), targets (subsidence, GWL), and other covariates.

 1from fusionlab.utils.data_utils import nan_ops
 2
 3TIME_COL, LON_COL, LAT_COL = 'year', 'longitude', 'latitude'
 4SUBSIDENCE_COL, GWL_COL = 'subsidence', 'GWL'
 5
 6# Select relevant features
 7features_to_use = [
 8    LON_COL, LAT_COL, TIME_COL, SUBSIDENCE_COL, GWL_COL,
 9    'rainfall_mm', 'geology', 'normalized_density'
10]
11df_selected = df_raw[features_to_use].copy()
12df_cleaned = nan_ops(df_selected, ops='sanitize', action='fill')
13print(f"NaNs after cleaning: {df_cleaned.isna().sum().sum()}")

Expected Output 2:

NaNs after cleaning: 0

Step 3: Preprocessing - Encoding & Scaling

Next, we convert categorical features like geology into a numerical format using one-hot encoding and scale all numerical features to a common range (0-1) using MinMaxScaler for stable training.

 1# --- Encoding Categorical Features ---
 2ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
 3encoded_data = ohe.fit_transform(df_cleaned[['geology']])
 4encoded_cols = ohe.get_feature_names_out(['geology'])
 5df_encoded = pd.DataFrame(encoded_data, columns=encoded_cols, index=df_cleaned.index)
 6df_processed = pd.concat([df_cleaned.drop('geology', axis=1), df_encoded], axis=1)
 7
 8# --- Create a numeric time coordinate for the PINN ---
 9TIME_COL_NUMERIC = "time_numeric"
10df_processed[TIME_COL_NUMERIC] = df_processed[TIME_COL] - df_processed[TIME_COL].min()
11
12# --- Scaling Numerical Features ---
13cols_to_scale = [c for c in df_processed.columns if c != TIME_COL]
14scaler = MinMaxScaler()
15df_scaled = df_processed.copy()
16df_scaled[cols_to_scale] = scaler.fit_transform(df_scaled[cols_to_scale])
17print("Categorical features encoded and numerical features scaled.")
18print(f"Final processed data shape: {df_scaled.shape}")

Expected Output 3:

Categorical features encoded and numerical features scaled.
Final processed data shape: (1999, 13)

Step 4: Define Feature Sets & Generate Sequences

This is a critical step where we tell our data preparation utility, prepare_pinn_data_sequences, which columns to use for which input stream (static, dynamic, etc.) and then generate the sequences.

 1# 1. Split data into training and testing periods
 2df_train_master = df_scaled[df_scaled[TIME_COL] <= TRAIN_END_YEAR]
 3df_test_master = df_scaled[df_scaled[TIME_COL] >= TRAIN_END_YEAR]
 4
 5# 2. Define feature lists
 6static_features_list = encoded_cols.tolist()
 7dynamic_features_list = [GWL_COL, 'rainfall_mm', 'normalized_density']
 8future_features_list = ['rainfall_mm'] # Assume rainfall forecast is known
 9
10# 3. Generate sequences
11inputs_train, targets_train = prepare_pinn_data_sequences(
12    df=df_train_master,
13    time_col=TIME_COL_NUMERIC,
14    lon_col=LON_COL, lat_col=LAT_COL,
15    subsidence_col=SUBSIDENCE_COL, gwl_col=GWL_COL,
16    dynamic_cols=dynamic_features_list,
17    static_cols=static_features_list,
18    future_cols=future_features_list,
19    group_id_cols=None, # [LON_COL, LAT_COL] Disable a group to treat as a big series
20    time_steps=TIME_STEPS,
21    forecast_horizon=FORECAST_HORIZON_YEARS,
22)
23print("\nTraining sequences generated.")
24for name, arr in inputs_train.items():
25    print(f"  Train Input '{name}' shape: {arr.shape if arr is not None else 'None'}")

Expected Output 4:

Training sequences generated.
  Train Input 'coords' shape: (1791, 2, 3)
  Train Input 'static_features' shape: (1791, 5)
  Train Input 'dynamic_features' shape: (1791, 4, 3)
  Train Input 'future_features' shape: (1791, 2, 1)

Step 5: Create tf.data.Dataset

We convert our NumPy sequence arrays into tf.data.Dataset objects for efficient, high-performance training with TensorFlow.

 1# Standardize target keys to match model output names
 2targets_train_std = {
 3    'subs_pred': targets_train['subsidence'],
 4    'gwl_pred': targets_train['gwl']
 5}
 6
 7# Create the full dataset
 8full_dataset = tf.data.Dataset.from_tensor_slices((inputs_train, targets_train_std))
 9
10# Create train/validation split
11total_size = len(inputs_train['coords'])
12val_size = int(0.2 * total_size)
13train_size = total_size - val_size
14
15train_dataset = full_dataset.take(train_size).shuffle(train_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
16val_dataset = full_dataset.skip(train_size).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
17
18print(f"\nCreated training dataset ({train_size} samples) and validation dataset ({val_size} samples).")

Expected Output 5:

Created training dataset (1433 samples) and validation dataset (358 samples).

Step 6: Model Training

We now instantiate our chosen model, compile it with our composite loss function, and begin training.

 1# 1. Instantiate Model
 2ModelClass = TransFlowSubsNet if MODEL_NAME == 'TransFlowSubsNet' else PIHALNet
 3model = ModelClass(
 4    static_input_dim=inputs_train['static_features'].shape[-1],
 5    dynamic_input_dim=inputs_train['dynamic_features'].shape[-1],
 6    future_input_dim=inputs_train['future_features'].shape[-1],
 7    output_subsidence_dim=1, output_gwl_dim=1,
 8    forecast_horizon=FORECAST_HORIZON_YEARS,
 9    max_window_size=TIME_STEPS,
10    quantiles=QUANTILES,
11    pde_mode=PDE_MODE_CONFIG
12)
13
14# 2. Define Losses and Weights
15loss_dict = {'subs_pred': 'mse', 'gwl_pred': 'mse'}
16if QUANTILES:
17    loss_dict = {k: combined_quantile_loss(QUANTILES) for k in loss_dict}
18
19physics_loss_weights = {"lambda_cons": LAMBDA_PDE_CONS, "lambda_gw": LAMBDA_PDE_GW} \
20    if MODEL_NAME == 'TransFlowSubsNet' else {"lambda_physics": LAMBDA_PDE_CONFIG}
21
22# 3. Compile Model
23model.compile(
24    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
25    loss=loss_dict,
26    loss_weights={'subs_pred': 1.0, 'gwl_pred': 0.5},
27    **physics_loss_weights
28)
29
30# 4. Train
31print(f"\nStarting {MODEL_NAME} training...")
32history = model.fit(
33    train_dataset,
34    validation_data=val_dataset,
35    epochs=EPOCHS,
36    callbacks=[EarlyStopping('val_loss', patience=10, restore_best_weights=True)]
37)

Expected Output 6:

Starting TransFlowSubsNet training...
Epoch 1/20
6/6 [==============================] - 30s 422ms/step - loss: 0.5286 - gwl_pred_loss: 0.4087 - subs_pred_loss: 0.3243 - total_loss: 0.5063 - data_loss: 0.5022 - physics_loss: 0.0041 - consolidation_loss: 0.0041 - gw_flow_loss: 2.9738e-07 - val_loss: 0.1235 - val_gwl_pred_loss: 0.2469 - val_subs_pred_loss: 0.0000e+00
Epoch 2/20
6/6 [==============================] - 0s 49ms/step - loss: 0.1974 - gwl_pred_loss: 0.1396 - subs_pred_loss: 0.1276 - total_loss: 0.1798 - data_loss: 0.1788 - physics_loss: 9.3386e-04 - consolidation_loss: 9.3376e-04 - gw_flow_loss: 1.0276e-07 - val_loss: 0.0356 - val_gwl_pred_loss: 0.0711 - val_subs_pred_loss: 0.0000e+00
...
Epoch 20/20
6/6 [==============================] - 0s 53ms/step - loss: 0.0295 - gwl_pred_loss: 0.0122 - subs_pred_loss: 0.0234 - total_loss: 0.0293 - data_loss: 0.0293 - physics_loss: 2.1400e-05 - consolidation_loss: 2.1400e-05 - gw_flow_loss: 2.2128e-11 - val_loss: 0.0043 - val_gwl_pred_loss: 0.0086 - val_subs_pred_loss: 0.0000e+00

Step 7: Visualize Results

Finally, we can use the plotting utilities to visualize the training history and the forecast results.

 1# Plot training history
 2plot_history_in(history, title=f'{MODEL_NAME} Training History')
 3
 4# Generate and plot forecasts on the validation set
 5val_inputs_batch, val_targets_batch = next(iter(val_dataset))
 6predictions = model.predict(val_inputs_batch)
 7
 8df_forecast = format_pihalnet_predictions(
 9    pihalnet_outputs=predictions,
10    y_true_dict=val_targets_batch,
11    target_mapping={'subs_pred': SUBSIDENCE_COL, 'gwl_pred': GWL_COL},
12    quantiles=QUANTILES,
13    forecast_horizon=FORECAST_HORIZON_YEARS,
14    model_inputs=val_inputs_batch,
15)
16
17if not df_forecast.empty:
18    plot_forecasts(
19        df_forecast,
20        target_name=SUBSIDENCE_COL,
21        quantiles=QUANTILES,
22        kind="temporal",
23        sample_ids="first_n",
24        num_samples=4
25    )

Expected Plot 7:

Training Loss History

Visualization of TransFlowSubsNet traing loss history .

Discussion of Exercise

Congratulations! You have completed a full end-to-end workflow for a complex, hybrid physics-informed model. You have learned how to:

  • Take raw, tabular data and perform all necessary preprocessing steps.

  • Use the specialized prepare_pinn_data_sequences utility to create the multi-part input required by the models.

  • Configure, compile, and train a PIHALNet or TransFlowSubsNet model with a composite loss function.

  • Evaluate the training process and visualize the final probabilistic forecasts.

This comprehensive process is a powerful template for applying these advanced models to your own scientific machine learning challenges.