PINN Data Utilities¶
The Physics-Informed Neural Network (PINN) models in fusionlab-learn,
such as PIHALNet and TransFlowSubsNet, have unique data
requirements. They need not only the standard feature-based inputs
(static, dynamic, future) but also spatio-temporal coordinates
(\(t, x, y\)) to compute the physics-based loss.
This section details the specialized utility functions designed to handle the complex data preparation and results formatting tasks associated with these models.
Sequence Generation for PINNs (prepare_pinn_data_sequences)¶
- API Reference:
prepare_pinn_data_sequences()
This is the most critical data preparation utility for PINN models. Its primary role is to transform a flat, time-series pandas.DataFrame into the complex, multi-part sequence format required for training.
The function iterates through the DataFrame, creating rolling windows to generate sequences of a specified length (time_steps) and forecasting predictions for a given forecast_horizon. Its key distinction is that it generates both the feature-based input tensors and the crucial `coords` tensor needed by the physics module.
Key Parameters:
df: The input DataFrame containing all features, targets, and coordinates in a long format.
group_id_cols: A list of columns (e.g., [‘longitude’, ‘latitude’]) used to identify and separate individual time series within the DataFrame. The function generates sequences independently for each group.
time_steps: The length of the historical lookback window for the dynamic features.
forecast_horizon: The number of future steps to predict.
_cols arguments: Various arguments (dynamic_cols,
static_cols,future_cols,subsidence_col, etc.) that map column names in the DataFrame to their respective roles.normalize_coords: A boolean flag that, when
True, scales the spatio-temporal coordinate values (\(t, x, y\)) to a 0-1 range, which is highly recommended for stable gradient calculations in the PINN loss.
Workflow and Outputs:
The function returns two dictionaries containing NumPy arrays:
inputs_dict: Contains all the input tensors required by the model’s
callmethod.'coords': The spatio-temporal coordinates for the forecasthorizon, shape: \((N, H, 3)\).
'static_features': Shape: \((N, D_s)\).'dynamic_features': Shape: \((N, T, D_d)\).'future_features': Shape: \((N, H, D_f)\).
targets_dict: Contains the ground-truth target tensors.
'subsidence': Shape: \((N, H, O_s)\).'gwl': Shape: \((N, H, O_g)\).
Here, \(N\) is the total number of sequences generated, \(T\) is time_steps, \(H\) is forecast_horizon, and \(D\) and \(O\) are feature/output dimensions.
Usage Example¶
1import pandas as pd
2import numpy as np
3from fusionlab.nn.pinn.utils import prepare_pinn_data_sequences
4
5# 1. Create a sample DataFrame with multiple time series groups
6data = []
7for group_id in range(2): # 2 locations
8 for year in range(2010, 2020): # 10 years of data
9 data.append({
10 'year': year,
11 'lon': 113.5 + group_id,
12 'lat': 22.5 + group_id,
13 'geology_type': f'type_{group_id}',
14 'subsidence': -10 - (year - 2010) * 2 - group_id,
15 'GWL': 5 - (year - 2010) * 0.5 - group_id,
16 'rainfall': 1500 + np.random.randn() * 100
17 })
18df = pd.DataFrame(data)
19
20# 2. Define feature sets
21static_cols = ['geology_type'] # This will need to be one-hot encoded first
22dynamic_cols = ['GWL', 'rainfall']
23df = pd.get_dummies(df, columns=static_cols, dtype=float)
24static_cols_encoded = [c for c in df.columns if 'geology_type' in c]
25
26# 3. Generate sequences
27inputs, targets, _ = prepare_pinn_data_sequences(
28 df=df,
29 time_col='year',
30 lon_col='lon', lat_col='lat',
31 subsidence_col='subsidence', gwl_col='GWL',
32 dynamic_cols=dynamic_cols,
33 static_cols=static_cols_encoded,
34 group_id_cols=['lon', 'lat'],
35 time_steps=5,
36 forecast_horizon=3
37)
38
39# 4. Print the shapes of the output
40print("--- Generated Input Shapes ---")
41for name, arr in inputs.items():
42 print(f" '{name}': {arr.shape if arr is not None else 'None'}")
43print("\n--- Generated Target Shapes ---")
44for name, arr in targets.items():
45 print(f" '{name}': {arr.shape}")
Expected Output:
--- Generated Input Shapes ---
'coords': (4, 3, 3)
'static_features': (4, 2)
'dynamic_features': (4, 5, 2)
'future_features': None
--- Generated Target Shapes ---
'subsidence': (4, 3, 1)
'gwl': (4, 3, 1)
Formatting Model Outputs (format_pinn_predictions)¶
- API Reference:
format_pinn_predictions()
This function is the counterpart to the preparation utility. It takes
the raw dictionary of prediction tensors from a model’s .predict()
call and transforms it into a clean, long-format pandas.DataFrame
that is easy to analyze, visualize, or export.
It robustly handles multi-target outputs, point or quantile forecasts, and can merge the predictions with ground-truth values, coordinates, and other static metadata for a complete results summary.
Note
The function format_pihalnet_predictions is a deprecated alias
for format_pinn_predictions() and is maintained for backward
compatibility. New code should use format_pinn_predictions.
Usage Example:
1import pandas as pd
2import numpy as np
3from fusionlab.nn.pinn.utils import format_pinn_predictions
4
5# 1. Create dummy model outputs and true values
6B, H, Q_len = 4, 3, 3 # Batch, Horizon, Num Quantiles
7quantiles = [0.1, 0.5, 0.9]
8
9predictions = {
10 'subs_pred': np.random.rand(B, H, Q_len),
11 'gwl_pred': np.random.rand(B, H, Q_len)
12}
13y_true = {
14 'subsidence': np.random.rand(B, H, 1),
15 'gwl': np.random.rand(B, H, 1)
16}
17# Dummy coordinates and static IDs
18coords = np.random.rand(B, H, 3)
19ids = pd.DataFrame({'site_id': [f'site_{i}' for i in range(B)]})
20
21# 2. Format the predictions into a DataFrame
22df_results = format_pinn_predictions(
23 predictions=predictions,
24 y_true_dict=y_true,
25 quantiles=quantiles,
26 model_inputs={'coords': coords}, # Provide coords for inclusion
27 ids_data_array=ids,
28 target_mapping={'subs_pred': 'subsidence', 'gwl_pred': 'gwl'}
29)
30
31# 3. Display the head of the resulting DataFrame
32print(df_results.head())
Expected Output:
sample_idx forecast_step coord_t ... gwl_q50 gwl_q90 gwl_actual
0 0 1 0.834251 ... 0.376958 0.417579 0.625352
1 0 2 0.591587 ... 0.749004 0.635746 0.368460
2 0 3 0.990352 ... 0.103313 0.513108 0.789334
3 1 1 0.057251 ... 0.231552 0.739546 0.087821
4 1 2 0.581780 ... 0.551159 0.279155 0.791243
[5 rows x 14 columns]
Coordinate and Feature Scaling (normalize_for_pinn)¶
- API Reference:
normalize_for_pinn()
Normalization is crucial for training PINNs. The coordinate inputs (\(t, x, y\)) that are used to compute PDE derivatives should ideally be scaled to a standard range (e.g., [0, 1]) to ensure the gradients are well-behaved and stable.
This utility function provides a centralized way to handle this scaling.
scale_coords=True: This primary option applies a
MinMaxScalerto the time_col, coord_x, and coord_y together, preserving their relative relationships while scaling them to the [0, 1] range.cols_to_scale=’auto’: This feature automatically detects all other numerical columns in the DataFrame (excluding booleans/one-hot encoded columns) and applies a separate scaler to them.
Usage Example:
1import pandas as pd
2from fusionlab.nn.pinn.utils import normalize_for_pinn
3
4# 1. Create a sample DataFrame
5df = pd.DataFrame({
6 'time': [2020.0, 2021.0, 2022.0],
7 'lon': [-122.4, -122.3, -122.2],
8 'lat': [37.7, 37.8, 37.9],
9 'rainfall': [500, 600, 550],
10 'is_event': [0, 1, 0] # A one-hot style column
11})
12
13# 2. Normalize coordinates and auto-selected features
14df_scaled, coord_scaler, other_scaler = normalize_for_pinn(
15 df,
16 time_col='time',
17 coord_x='lon',
18 coord_y='lat',
19 scale_coords=True,
20 cols_to_scale='auto' # Auto-detect 'rainfall'
21)
22
23# 3. Display results
24print("--- Original DataFrame ---")
25print(df)
26print("\n--- Scaled DataFrame ---")
27print(df_scaled)
28print(f"\nCoordinate Scaler Range: {coord_scaler.data_range_}")
29print(f"Feature Scaler Range: {other_scaler.data_range_}")
Expected Output:
[INFO] Scaling time, lon, lat columns...
[INFO] Auto-selecting numeric columns to scale...
[INFO] Excluding one-hot/boolean column 'is_event' from auto-scaling.
[INFO] Auto-selected columns: ['rainfall']
[INFO] Scaling additional columns: ['rainfall']
--- Original DataFrame ---
time lon lat rainfall is_event
0 2020.0 -122.4 37.7 500 0
1 2021.0 -122.3 37.8 600 1
2 2022.0 -122.2 37.9 550 0
--- Scaled DataFrame ---
time lon lat rainfall is_event
0 0.0 0.0 0.0 0.0 0
1 0.5 0.5 0.5 1.0 1
2 1.0 1.0 1.0 0.5 0
Coordinate Scaler Range: [2. 0.2 0.2] Feature Scaler Range: [100.]
Coordinate Extraction Utilities¶
The library includes two low-level helpers, extract_txy_in and
extract_txy, used internally to robustly parse the \(t, x, y\)
coordinate tensors from different input structures (e.g., a single
concatenated tensor vs. a dictionary).
While you may not need to call these directly, understanding their difference is useful for advanced customization.
The Difference:
The key difference lies in how they handle the dimensionality of the output tensors.
extract_txy_in (Internal, Strict)¶
- API Reference:
extract_txy_in()
This version is stricter and is designed for internal model components that always expect a 3D spatio-temporal tensor. It always ensures the output tensors have a rank of 3. If it receives a 2D input of shape (batch, 3), it will automatically expand it to (batch, 1, 3) before slicing, ensuring a consistent 3D output like (batch, 1, 1).
1from fusionlab.nn.pinn.utils import extract_txy_in
2
3# A 2D spatial tensor (batch, features)
4coords_2d = tf.random.normal((4, 3))
5# A 3D spatio-temporal tensor (batch, time, features)
6coords_3d = tf.random.normal((4, 10, 3))
7
8t2, x2, y2 = extract_txy_in(coords_2d)
9t3, x3, y3 = extract_txy_in(coords_3d)
10
11print(f"Input 2D shape: {coords_2d.shape}")
12print(f"Output shape from 2D input: {t2.shape}")
13print(f"\nInput 3D shape: {coords_3d.shape}")
14print(f"Output shape from 3D input: {t3.shape}")
Expected Output:
Input 2D shape: (4, 3)
Output shape from 2D input: (4, 1, 1)
Input 3D shape: (4, 10, 3)
Output shape from 3D input: (4, 10, 1)
extract_txy (Flexible)¶
- API Reference:
extract_txy()
This version is more flexible and is controlled by the expect_dim parameter. It can return 2D or 3D tensors based on the input and the desired output format, making it suitable for different parts of a model that may operate on data with or without a time dimension.
1from fusionlab.nn.pinn.utils import extract_txy
2
3# Using the same 2D and 3D tensors
4
5# Case 1: expect_dim=None (preserves rank)
6t, x, y = extract_txy(coords_2d, expect_dim=None)
7print(f"With expect_dim=None, 2D input gives output shape: {t.shape}")
8
9# Case 2: expect_dim='3d' (expands 2D to 3D)
10t, x, y = extract_txy(coords_2d, expect_dim='3d')
11print(f"With expect_dim='3d', 2D input gives output shape: {t.shape}")
Expected Output:
With expect_dim=None, 2D input gives output shape: (4, 1)
With expect_dim='3d', 2D input gives output shape: (4, 1, 1)