Data Manipulation Utilities

The fusionlab.utils.data_utils module provides a collection of powerful helper functions for common and often complex data manipulation tasks encountered in time series and forecasting workflows.

This guide covers the primary utilities for handling missing values, reshaping temporal data from long to wide formats, and performing conditional data masking.


Advanced NaN Handling (nan_ops)

API Reference:

nan_ops()

This is a powerful, centralized function for managing missing values (\(NaN\)). It goes beyond a simple .dropna() or .fillna() by providing a robust framework for checking, validating, and sanitizing data, with special handling for keeping feature and target arrays aligned.

Its three main modes of operation (ops parameter) are: * `’check_only’`: Safely checks if NaN values exist and returns True or False. * `’validate’`: Asserts that no NaN`s are present, raising an error if they are, which is useful for ensuring data integrity in a pipeline. * **’sanitize’**: The main action mode, which can either `’fill’`` or 'drop' `NaN`s based on user-defined rules.

A key feature is its handling of auxiliary or “witness” data via the auxi_data parameter. When rows containing NaN`s are dropped from the primary `data, the exact same rows are dropped from auxi_data, perfectly preserving the alignment between, for example, a target vector and its corresponding feature matrix.

Usage Example:

In this example, we have a target Series and a feature DataFrame with NaN values in different locations. We use nan_ops to drop rows from both structures wherever the target has a NaN, keeping them aligned.

 1import pandas as pd
 2import numpy as np
 3from fusionlab.utils.data_utils import nan_ops
 4
 5# 1. Create a target Series and a feature DataFrame with NaNs
 6target = pd.Series([10, 20, np.nan, 40, 50])
 7features = pd.DataFrame({
 8    'A': [1, 2, 3, 4, 5],
 9    'B': [0.1, 0.2, 0.3, np.nan, 0.5] # NaN is in a different row
10})
11
12# 2. Sanitize by dropping rows where the target is NaN
13#    and apply the same row removal to the features.
14cleaned_target, cleaned_features = nan_ops(
15    data=target,
16    auxi_data=features,
17    data_kind='target', # Specifies that `data` is the target
18    ops='sanitize',
19    action='drop'
20)
21
22# 3. Display the results
23print("--- Cleaned Target ---")
24print(cleaned_target)
25print("\n--- Aligned Features ---")
26print(cleaned_features)

Expected Output:

--- Cleaned Target ---
0    10.0
1    20.0
3    40.0
4    50.0
dtype: float64

--- Aligned Features ---
     A    B
0  1.0  0.1
1  2.0  0.2
3  4.0  NaN
4  5.0  0.5

Reshaping Time Series Data (widen_temporal_columns)

API Reference:

widen_temporal_columns()

This utility transforms a long-format forecast DataFrame into a wide-format one. This is a specific type of pivot operation where temporal information is encoded into the column names. For example, a column subsidence_q50 with a time value of 2019 becomes a new column named subsidence_2019_q50.

This reshaping is particularly useful for creating summary tables or for analysis methods that require each time series sample to be represented by a single row.

Usage Example:

 1from fusionlab.utils.data_utils import widen_temporal_columns
 2
 3# 1. Create a sample long-format forecast DataFrame
 4df_long = pd.DataFrame({
 5    "coord_x": [113.15, 113.15, 113.20, 113.20],
 6    "coord_y": [22.63, 22.63, 22.80, 22.80],
 7    "coord_t": [2019, 2020, 2019, 2020],
 8    "subsidence_q50": [9.1, 10.2, 12.5, 13.6],
 9    "subsidence_actual": [9.0, 10.5, 12.4, 13.8],
10    "GWL_q50": [5.5, 5.3, 4.1, 4.0],
11    "region": ["A", "A", "B", "B"]
12})
13
14# 2. Convert to wide format
15df_wide = widen_temporal_columns(
16    df_long,
17    dt_col="coord_t",
18    spatial_cols=("coord_x", "coord_y"),
19    ignore_cols=["region"], # Carry this static column through
20    verbose=1
21)
22
23# 3. Display the wide DataFrame
24print(df_wide)

Expected Output:

[INFO] Initial rows: 4, columns: 2
[INFO] Widening base 'GWL' (1 columns)
[INFO] Widening base 'subsidence' (2 columns)
[DONE] Final wide shape: (2, 9)
   coord_x  coord_y region  GWL_2019_q50  GWL_2020_q50  subsidence_2019_actual  subsidence_2020_actual  subsidence_2019_q50  subsidence_2020_q50
0   113.15    22.63      A           5.5           5.3                     9.0                    10.5                  9.1                 10.2
1   113.20    22.80      B           4.1           4.0                    12.4                    13.8                 12.5                 13.6

Conditional Data Masking (mask_by_reference)

API Reference:

mask_by_reference()

This function provides a powerful way to conditionally modify data. It finds rows in a DataFrame based on a value in a ref_col and then replaces the values in other columns of those matched rows with a specified fill_value.

It is highly flexible, supporting both exact matching and approximate (closest value) matching for numeric data. This is useful for data cleaning, scenario creation, or preparing data for specific modeling techniques.

Usage Example:

 1from fusionlab.utils.data_utils import mask_by_reference
 2
 3# 1. Create a sample DataFrame
 4df = pd.DataFrame({
 5    "A_code": [10, 0, 8, 0],
 6    "B_value": [2.0, 0.5, 18.0, 85.0],
 7    "C_value": [34.0, 0.8, 12.0, 4.5],
 8})
 9print("--- Original DataFrame ---")
10print(df)
11
12# 2. Example: Find rows where 'A_code' is exactly 0 and
13#    mask only the 'C_value' column with -999.
14df_masked = mask_by_reference(
15    data=df,
16    ref_col="A_code",
17    values=0,
18    fill_value=-999,
19    mask_columns=["C_value"] # Only affect this column
20)
21
22print("\n--- Masked DataFrame ---")
23print(df_masked)

Expected Output:

--- Original DataFrame ---
   A_code  B_value  C_value
0      10      2.0     34.0
1       0      0.5      0.8
2       8     18.0     12.0
3       0     85.0      4.5

--- Masked DataFrame ---
   A_code  B_value  C_value
0      10      2.0     34.0
1       0      0.5   -999.0
2       8     18.0     12.0
3       0     85.0   -999.0