fusionlab.utils.widen_temporal_columns¶

fusionlab.utils.widen_temporal_columns(data, dt_col, spatial_cols=None, target_name=None, round_dt=True, ignore_cols=None, nan_op=None, nan_thresh=None, savefile=None, verbose=0)[source]¶

Convert a long PIHALNet prediction table into a wide format where each temporal slice becomes a dedicated column.

The routine pivots columns whose names follow the pattern

<base>           deterministic forecast
<base>_qXX       quantile forecast (e.g., ``subsidence_q10``)
<base>_actual    ground‑truth column

and produces columns of the form

<base>_<year>            point forecast
<base>_<year>_qXX        quantile forecast
<base>_<year>_actual     ground‑truth value

If duplicate (spatial, year) pairs are found, values are aggregated with :pyfunc:`pandas.Series.groupby(mean) <pandas.core.series.Series.groupby>` prior to pivoting to avoid “Index contains duplicate entries” errors.

Parameters:

data (PathLike object or pandas.DataFrame) – Long‑format DataFrame returned by :pyfunc:`fusionlab.utils.format_pihalnet_predictions`.
dt_col (str) – Column holding the temporal coordinate (e.g., 'coord_t'). Must be numeric or datetime‑coercible. When round_dt is True, values are rounded to integers.
spatial_cols ((str, str) or None, default None) – Names of x and y spatial coordinates. These are retained as leading columns in the output. If None, the function falls back to 'sample_idx' or an auto‑generated 'row_id'.
target_name (str or None, default None) – Restrict pivoting to a specific base (e.g., 'subsidence'). When None every base present in df is widened.
round_dt (bool, default True) – Round dt_col to the nearest integer (helpful for fractional years such as 2020.0001).
ignore_cols (list[str] or None, default None) – Additional columns to carry through unchanged. Values are propagated per spatial location using the first non‑null entry.
nan_op ({'drop', 'fill', 'both', None}, default None) –
Strategy for NaN handling after pivot:
- 'fill' – forward‑fill then back‑fill missing values.
- 'drop' – drop rows containing NaNs (see nan_thresh).
- 'both' – fill then drop according to nan_thresh.
- None – leave NaNs untouched.
nan_thresh (float or None, default None) –
When nan_op contains 'drop', rows are dropped if the proportion of missing values exceeds nan_thresh. Set nan_thresh = 0 to require no NaNs, 0.5 to allow ≤ 50 % missing, etc.

\[\text{row kept} \;\Longleftrightarrow\; \frac{\text{NaNs in row}}{\text{row width}} \le \text{nan\_thresh}\]
savefile (str, optional) – If a file path is provided, the final wide-format DataFrame will be saved as a CSV file.
verbose (int, default 0) – Diagnostic verbosity from 0 (silent) to 5 (trace every step).

Returns:

Wide‑format frame with spatial identifiers first, followed by year‑wise forecast, quantile, and actual columns.

Return type:

pandas.DataFrame

Raises:

KeyError – dt_col missing from df or spatial_cols absent.
ValueError – No columns match target_name or nan_thresh is outside \([0, 1]\).

Notes

Duplicate indices are aggregated with the arithmetic mean before pivoting. Modify the aggregation lambda inside the function for alternative choices.
If ignore_cols is provided, their first non‑null value per spatial location is appended to the output.

Examples

Minimal usage on a tiny synthetic set

>>> import pandas as pd
>>> from fusionlab.utils.data_utils import widen_temporal_columns
>>>
>>> df_long = pd.DataFrame(
...     {
...         "coord_x": [113.15, 113.15, 113.15, 113.15],
...         "coord_y": [22.63, 22.63, 22.63, 22.63],
...         "coord_t": [2019, 2020, 2019, 2020],
...         "subsidence_q50": [0.09, 0.10, 0.12, 0.13],
...         "subsidence_actual": [0.08, 0.11, 0.10, 0.14],
...     }
... )
>>>
>>> wide = widen_temporal_columns(
...     df_long,
...     dt_col="coord_t",
...     spatial_cols=("coord_x", "coord_y"),
...     verbose=2,
... )
[INFO] Initial rows: 4, columns: 2
[INFO] Widening base 'subsidence' (2 columns)
[DONE] Final wide shape: (1, 4)
>>> wide
   coord_x  coord_y  subsidence_2019_actual  subsidence_2020_actual  \
0   113.15    22.63                   0.08                   0.11

subsidence_2019_q50 subsidence_2020_q50

0 0.12 0.13

End‑to‑end example with NaN handling, ignored columns, and two targets

>>> import numpy as np
>>> rng = pd.date_range("2018", periods=3, freq="Y").year
>>> n = 5  # five spatial locations
>>>
>>> # build synthetic long DataFrame
>>> df_long = pd.DataFrame(
...     {
...         "sample_idx": np.repeat(np.arange(n), len(rng)),
...         "coord_x": np.repeat(np.linspace(113.4, 113.5, n), len(rng)),
...         "coord_y": np.repeat(np.linspace(22.1, 22.2, n), len(rng)),
...         "coord_t": np.tile(rng, n),
...         "region": np.repeat(["A", "B", "A", "B", "A"], len(rng)),
...         "subsidence_q10": np.random.rand(n * len(rng)),
...         "subsidence_q50": np.random.rand(n * len(rng)),
...         "subsidence_q90": np.random.rand(n * len(rng)),
...         "subsidence_actual": np.random.rand(n * len(rng)),
...         "GWL_q50": np.random.rand(n * len(rng)),
...     }
... )
>>>
>>> # introduce NaNs for demonstration
>>> df_long.loc[df_long.sample(frac=0.2).index, "subsidence_q50"] = np.nan
>>>
>>> wide = widen_temporal_columns(
...     df_long,
...     dt_col="coord_t",
...     spatial_cols=("coord_x", "coord_y"),
...     ignore_cols=["region"],
...     target_name=None,      # widen both 'subsidence' and 'GWL'
...     nan_op="both",         # fill then drop rows with many NaNs
...     nan_thresh=0.4,        # allow at most 40 % missing
...     verbose=3,
... )
[INFO] Initial rows: 15, columns: 7
[INFO] Widening base 'GWL' (1 columns)
  └─ 0 duplicate rows in 'GWL_q50' → aggregated
[INFO] Widening base 'subsidence' (4 columns)
  └─ 0 duplicate rows in 'subsidence_q10' → aggregated
  └─ 0 duplicate rows in 'subsidence_q50' → aggregated
  └─ 0 duplicate rows in 'subsidence_q90' → aggregated
  └─ 0 duplicate rows in 'subsidence_actual' → aggregated
[INFO] Missing values filled (ffill+bfill).
[INFO] Rows with >40% NaN dropped.
[DONE] Final wide shape: (5, 19)
>>> wide.iloc[:2, :8]  # show first 8 columns
   coord_x  coord_y  GWL_2018_q50  GWL_2019_q50  GWL_2020_q50  \
0  113.400       ...         ...          ...          ...
1  113.425       ...         ...          ...          ...

subsidence_2018_actual subsidence_2019_actual subsidence_2020_actual

0 … … … 1 … … …