PINN Data Utilities

The Physics-Informed Neural Network (PINN) models in fusionlab-learn, such as PIHALNet and TransFlowSubsNet, have unique data requirements. They need not only the standard feature-based inputs (static, dynamic, future) but also spatio-temporal coordinates (\(t, x, y\)) to compute the physics-based loss.

This section details the specialized utility functions designed to handle the complex data preparation and results formatting tasks associated with these models.

Sequence Generation for PINNs (prepare_pinn_data_sequences)

API Reference:

prepare_pinn_data_sequences()

This is the most critical data preparation utility for PINN models. Its primary role is to transform a flat, time-series pandas.DataFrame into the complex, multi-part sequence format required for training.

The function iterates through the DataFrame, creating rolling windows to generate sequences of a specified length (time_steps) and forecasting predictions for a given forecast_horizon. Its key distinction is that it generates both the feature-based input tensors and the crucial `coords` tensor needed by the physics module.

Key Parameters:

  • df: The input DataFrame containing all features, targets, and coordinates in a long format.

  • group_id_cols: A list of columns (e.g., [‘longitude’, ‘latitude’]) used to identify and separate individual time series within the DataFrame. The function generates sequences independently for each group.

  • time_steps: The length of the historical lookback window for the dynamic features.

  • forecast_horizon: The number of future steps to predict.

  • _cols arguments: Various arguments (dynamic_cols, static_cols, future_cols, subsidence_col, etc.) that map column names in the DataFrame to their respective roles.

  • normalize_coords: A boolean flag that, when True, scales the spatio-temporal coordinate values (\(t, x, y\)) to a 0-1 range, which is highly recommended for stable gradient calculations in the PINN loss.

Workflow and Outputs:

The function returns two dictionaries containing NumPy arrays:

  1. inputs_dict: Contains all the input tensors required by the model’s call method.

    • 'coords': The spatio-temporal coordinates for the forecast

      horizon, shape: \((N, H, 3)\).

    • 'static_features': Shape: \((N, D_s)\).

    • 'dynamic_features': Shape: \((N, T, D_d)\).

    • 'future_features': Shape: \((N, H, D_f)\).

  2. targets_dict: Contains the ground-truth target tensors.

    • 'subsidence': Shape: \((N, H, O_s)\).

    • 'gwl': Shape: \((N, H, O_g)\).

Here, \(N\) is the total number of sequences generated, \(T\) is time_steps, \(H\) is forecast_horizon, and \(D\) and \(O\) are feature/output dimensions.

Usage Example

 1import pandas as pd
 2import numpy as np
 3from fusionlab.nn.pinn.utils import prepare_pinn_data_sequences
 4
 5# 1. Create a sample DataFrame with multiple time series groups
 6data = []
 7for group_id in range(2): # 2 locations
 8    for year in range(2010, 2020): # 10 years of data
 9        data.append({
10            'year': year,
11            'lon': 113.5 + group_id,
12            'lat': 22.5 + group_id,
13            'geology_type': f'type_{group_id}',
14            'subsidence': -10 - (year - 2010) * 2 - group_id,
15            'GWL': 5 - (year - 2010) * 0.5 - group_id,
16            'rainfall': 1500 + np.random.randn() * 100
17        })
18df = pd.DataFrame(data)
19
20# 2. Define feature sets
21static_cols = ['geology_type'] # This will need to be one-hot encoded first
22dynamic_cols = ['GWL', 'rainfall']
23df = pd.get_dummies(df, columns=static_cols, dtype=float)
24static_cols_encoded = [c for c in df.columns if 'geology_type' in c]
25
26# 3. Generate sequences
27inputs, targets, _ = prepare_pinn_data_sequences(
28    df=df,
29    time_col='year',
30    lon_col='lon', lat_col='lat',
31    subsidence_col='subsidence', gwl_col='GWL',
32    dynamic_cols=dynamic_cols,
33    static_cols=static_cols_encoded,
34    group_id_cols=['lon', 'lat'],
35    time_steps=5,
36    forecast_horizon=3
37)
38
39# 4. Print the shapes of the output
40print("--- Generated Input Shapes ---")
41for name, arr in inputs.items():
42    print(f"  '{name}': {arr.shape if arr is not None else 'None'}")
43print("\n--- Generated Target Shapes ---")
44for name, arr in targets.items():
45    print(f"  '{name}': {arr.shape}")

Expected Output:

--- Generated Input Shapes ---
  'coords': (4, 3, 3)
  'static_features': (4, 2)
  'dynamic_features': (4, 5, 2)
  'future_features': None

--- Generated Target Shapes ---
  'subsidence': (4, 3, 1)
  'gwl': (4, 3, 1)

Formatting Model Outputs (format_pinn_predictions)

API Reference:

format_pinn_predictions()

This function is the counterpart to the preparation utility. It takes the raw dictionary of prediction tensors from a model’s .predict() call and transforms it into a clean, long-format pandas.DataFrame that is easy to analyze, visualize, or export.

It robustly handles multi-target outputs, point or quantile forecasts, and can merge the predictions with ground-truth values, coordinates, and other static metadata for a complete results summary.

Note

The function format_pihalnet_predictions is a deprecated alias for format_pinn_predictions() and is maintained for backward compatibility. New code should use format_pinn_predictions.

Usage Example:

 1import pandas as pd
 2import numpy as np
 3from fusionlab.nn.pinn.utils import format_pinn_predictions
 4
 5# 1. Create dummy model outputs and true values
 6B, H, Q_len = 4, 3, 3 # Batch, Horizon, Num Quantiles
 7quantiles = [0.1, 0.5, 0.9]
 8
 9predictions = {
10    'subs_pred': np.random.rand(B, H, Q_len),
11    'gwl_pred': np.random.rand(B, H, Q_len)
12}
13y_true = {
14    'subsidence': np.random.rand(B, H, 1),
15    'gwl': np.random.rand(B, H, 1)
16}
17# Dummy coordinates and static IDs
18coords = np.random.rand(B, H, 3)
19ids = pd.DataFrame({'site_id': [f'site_{i}' for i in range(B)]})
20
21# 2. Format the predictions into a DataFrame
22df_results = format_pinn_predictions(
23    predictions=predictions,
24    y_true_dict=y_true,
25    quantiles=quantiles,
26    model_inputs={'coords': coords}, # Provide coords for inclusion
27    ids_data_array=ids,
28    target_mapping={'subs_pred': 'subsidence', 'gwl_pred': 'gwl'}
29)
30
31# 3. Display the head of the resulting DataFrame
32print(df_results.head())

Expected Output:

   sample_idx  forecast_step   coord_t  ...   gwl_q50   gwl_q90 gwl_actual
0           0              1  0.834251  ...  0.376958  0.417579   0.625352
1           0              2  0.591587  ...  0.749004  0.635746   0.368460
2           0              3  0.990352  ...  0.103313  0.513108   0.789334
3           1              1  0.057251  ...  0.231552  0.739546   0.087821
4           1              2  0.581780  ...  0.551159  0.279155   0.791243

[5 rows x 14 columns]

Coordinate and Feature Scaling (normalize_for_pinn)

API Reference:

normalize_for_pinn()

Normalization is crucial for training PINNs. The coordinate inputs (\(t, x, y\)) that are used to compute PDE derivatives should ideally be scaled to a standard range (e.g., [0, 1]) to ensure the gradients are well-behaved and stable.

This utility function provides a centralized way to handle this scaling.

  • scale_coords=True: This primary option applies a MinMaxScaler to the time_col, coord_x, and coord_y together, preserving their relative relationships while scaling them to the [0, 1] range.

  • cols_to_scale=’auto’: This feature automatically detects all other numerical columns in the DataFrame (excluding booleans/one-hot encoded columns) and applies a separate scaler to them.

Usage Example:

 1import pandas as pd
 2from fusionlab.nn.pinn.utils import normalize_for_pinn
 3
 4# 1. Create a sample DataFrame
 5df = pd.DataFrame({
 6    'time': [2020.0, 2021.0, 2022.0],
 7    'lon': [-122.4, -122.3, -122.2],
 8    'lat': [37.7, 37.8, 37.9],
 9    'rainfall': [500, 600, 550],
10    'is_event': [0, 1, 0] # A one-hot style column
11})
12
13# 2. Normalize coordinates and auto-selected features
14df_scaled, coord_scaler, other_scaler = normalize_for_pinn(
15    df,
16    time_col='time',
17    coord_x='lon',
18    coord_y='lat',
19    scale_coords=True,
20    cols_to_scale='auto' # Auto-detect 'rainfall'
21)
22
23# 3. Display results
24print("--- Original DataFrame ---")
25print(df)
26print("\n--- Scaled DataFrame ---")
27print(df_scaled)
28print(f"\nCoordinate Scaler Range: {coord_scaler.data_range_}")
29print(f"Feature Scaler Range: {other_scaler.data_range_}")

Expected Output:

   [INFO] Scaling time, lon, lat columns...
    [INFO] Auto-selecting numeric columns to scale...
[INFO] Excluding one-hot/boolean column 'is_event' from auto-scaling.
    [INFO] Auto-selected columns: ['rainfall']
    [INFO] Scaling additional columns: ['rainfall']
--- Original DataFrame ---
     time    lon   lat  rainfall  is_event
0  2020.0 -122.4  37.7       500         0
1  2021.0 -122.3  37.8       600         1
2  2022.0 -122.2  37.9       550         0

--- Scaled DataFrame ---
   time  lon  lat  rainfall  is_event
0   0.0  0.0  0.0       0.0         0
1   0.5  0.5  0.5       1.0         1
2   1.0  1.0  1.0       0.5         0

Coordinate Scaler Range: [2. 0.2 0.2] Feature Scaler Range: [100.]


Coordinate Extraction Utilities

The library includes two low-level helpers, extract_txy_in and extract_txy, used internally to robustly parse the \(t, x, y\) coordinate tensors from different input structures (e.g., a single concatenated tensor vs. a dictionary).

While you may not need to call these directly, understanding their difference is useful for advanced customization.

The Difference:

The key difference lies in how they handle the dimensionality of the output tensors.

extract_txy_in (Internal, Strict)

API Reference:

extract_txy_in()

This version is stricter and is designed for internal model components that always expect a 3D spatio-temporal tensor. It always ensures the output tensors have a rank of 3. If it receives a 2D input of shape (batch, 3), it will automatically expand it to (batch, 1, 3) before slicing, ensuring a consistent 3D output like (batch, 1, 1).

 1from fusionlab.nn.pinn.utils import extract_txy_in
 2
 3# A 2D spatial tensor (batch, features)
 4coords_2d = tf.random.normal((4, 3))
 5# A 3D spatio-temporal tensor (batch, time, features)
 6coords_3d = tf.random.normal((4, 10, 3))
 7
 8t2, x2, y2 = extract_txy_in(coords_2d)
 9t3, x3, y3 = extract_txy_in(coords_3d)
10
11print(f"Input 2D shape: {coords_2d.shape}")
12print(f"Output shape from 2D input: {t2.shape}")
13print(f"\nInput 3D shape: {coords_3d.shape}")
14print(f"Output shape from 3D input: {t3.shape}")

Expected Output:

Input 2D shape: (4, 3)
Output shape from 2D input: (4, 1, 1)

Input 3D shape: (4, 10, 3)
Output shape from 3D input: (4, 10, 1)

extract_txy (Flexible)

API Reference:

extract_txy()

This version is more flexible and is controlled by the expect_dim parameter. It can return 2D or 3D tensors based on the input and the desired output format, making it suitable for different parts of a model that may operate on data with or without a time dimension.

 1from fusionlab.nn.pinn.utils import extract_txy
 2
 3# Using the same 2D and 3D tensors
 4
 5# Case 1: expect_dim=None (preserves rank)
 6t, x, y = extract_txy(coords_2d, expect_dim=None)
 7print(f"With expect_dim=None, 2D input gives output shape: {t.shape}")
 8
 9# Case 2: expect_dim='3d' (expands 2D to 3D)
10t, x, y = extract_txy(coords_2d, expect_dim='3d')
11print(f"With expect_dim='3d', 2D input gives output shape: {t.shape}")

Expected Output:

With expect_dim=None, 2D input gives output shape: (4, 1)
With expect_dim='3d', 2D input gives output shape: (4, 1, 1)