Datasets

The fusionlab.datasets module provides access to sample datasets relevant to the forecasting tasks addressed by the library (like land subsidence) and includes tools for generating synthetic datasets. These can be useful for testing models, demonstrating utilities, and running examples.

Loading Included Datasets

These functions load pre-packaged (or downloadable) datasets, often derived from real-world studies but potentially sampled or processed for convenience. They typically handle locating the data file (checking local cache, package data, and optionally downloading).

fetch_zhongshan_data

API Reference:

fetch_zhongshan_data()

Loads the sampled Zhongshan land subsidence dataset (zhongshan_2000.csv). This file contains ~2000 data points spatially sampled from a larger dataset [Liu24], including raw features like coordinates, year, GWL, rainfall, geology, seismic risk, density metrics, and the target subsidence variable.

The function can return the data as a pandas DataFrame or a scikit-learn style Bunch object containing the data and metadata. It also supports further sub-sampling via the n_samples parameter.

Basic Usage (Load as Bunch):

 1from fusionlab.datasets import fetch_zhongshan_data
 2
 3# Load the full dataset (~2000 samples) as a Bunch
 4zhongshan_bunch = fetch_zhongshan_data(as_frame=False)
 5
 6# Access the DataFrame
 7print("Zhongshan DataFrame shape:", zhongshan_bunch.frame.shape)
 8
 9# Access target values
10print("Target shape:", zhongshan_bunch.target.shape)
11
12# Access feature names
13print("Feature names:", zhongshan_bunch.feature_names)

fetch_nansha_data

API Reference:

fetch_nansha_data()

Loads the sampled Nansha land subsidence dataset (nansha_2000.csv). Similar to the Zhongshan dataset, this contains ~2000 spatially sampled data points with features relevant to subsidence in the Nansha area, including coordinates, year, geological information, hydrogeology, building concentration, soil thickness, risk scores, and the target subsidence.

It provides the same options as Workspace_zhongshan_data for returning a DataFrame or Bunch, sub-sampling (n_samples), and controlling data loading/caching.

Basic Usage (Load Sample as DataFrame):

 1from fusionlab.datasets import fetch_nansha_data
 2
 3# Load a random spatial sample of 500 points as a DataFrame
 4nansha_df_sample = fetch_nansha_data(
 5    n_samples=500,
 6    as_frame=True,
 7    random_state=42, # for reproducibility
 8    verbose=False # suppress messages
 9    )
10
11print(f"Loaded Nansha sample shape: {nansha_df_sample.shape}")
12print(nansha_df_sample.head())

load_processed_subsidence_data

API Reference:

load_processed_subsidence_data()

This function provides a higher-level pipeline that loads one of the raw datasets (zhongshan_2000.csv or nansha_2000.csv via the Workspace_ functions), applies a predefined preprocessing workflow (feature selection, NaN handling, categorical encoding, numerical scaling), and optionally reshapes the data into sequences suitable for TFT/XTFT models using reshape_xtft_data().

It includes options to control which preprocessing steps are applied and utilizes caching for processed DataFrames and generated sequences to speed up repeated calls.

Basic Usage (Get Processed Frame):

 1from fusionlab.datasets import load_processed_subsidence_data
 2
 3# Load and preprocess Zhongshan data, return as DataFrame
 4# Applies default preprocessing: feature select, nan fill, one-hot, minmax scale
 5df_processed = load_processed_subsidence_data(
 6    dataset_name='zhongshan',
 7    return_sequences=False, # Get the processed DataFrame
 8    as_frame=True,
 9    use_processed_cache=True, # Try to load from cache first
10    save_processed_frame=True # Save if reprocessed
11)
12print("Processed Zhongshan DataFrame info:")
13df_processed.info()

Usage for Model Training (Get Sequences):

 1from fusionlab.datasets import load_processed_subsidence_data
 2
 3# Load Zhongshan, preprocess, and generate sequences
 4static, dynamic, future, target = load_processed_subsidence_data(
 5    dataset_name='zhongshan',
 6    return_sequences=True, # Request sequence arrays
 7    time_steps=12,         # Example lookback
 8    forecast_horizon=6,    # Example horizon
 9    use_sequence_cache=True,
10    save_sequences=True
11)
12print("\nGenerated sequences for model training:")
13print(f"Static shape: {static.shape}")
14print(f"Dynamic shape: {dynamic.shape}")
15print(f"Future shape: {future.shape}")
16print(f"Target shape: {target.shape}")

load_subsidence_pinn_data

API Reference:

load_subsidence_pinn_data()

This function is the recommended all-in-one data preparation pipeline for any project using the library’s PINN models, such as PIHALNet and TransFlowSubsNet.

It is designed to handle the entire data ingestion and preprocessing workflow, which is particularly complex for physics-informed models. By abstracting away common steps like data loading, cleaning, encoding, scaling, and optional augmentation, it saves a significant amount of boilerplate code and helps prevent common errors, allowing you to focus on the modeling itself.

End-to-End Workflow

The function executes a comprehensive, multi-stage workflow, with each stage being configurable through the function’s parameters.

1. Data Sourcing & Caching

The first step is to get the data. The function employs an efficient strategy:

  • Caching: If use_cache=True, it first checks for a pre-processed version of the data in your local cache directory. If found, it loads this file instantly, skipping all subsequent processing steps and saving significant time on repeated runs.

  • Data Loading: If no cache is found, it proceeds to load the raw data according to the strategy parameter: 'load' (requires the file to exist), 'fallback' (loads the file if present, otherwise generates dummy data), or 'generate' (always creates a new dummy dataset, great for testing).

  • Saving Cache: After processing the data for the first time, you can set save_cache=True to save the fully processed results, including the DataFrame and any fitted scalers/encoders, for fast retrieval in the future.

2. Automated Core Preprocessing

Once the raw data is loaded, the function performs a full preprocessing pipeline:

  • It ensures essential columns (coordinates, time, targets) exist and drops rows where they are missing.

  • It robustly converts the time column (e.g., integer years) into a proper datetime object for internal calculations.

  • It one-hot encodes specified categorical columns (like geology).

  • It creates a continuous numerical time coordinate, which is essential for computing derivatives in the PINN loss function.

  • It scales specified numerical features to a [0, 1] range to ensure stable model training. The fitted scaler and encoder objects are saved along with the data.

3. Optional Data Augmentation

By setting augment_data=True, you can invoke the augment_city_spatiotemporal_data() pipeline. This can perform two types of augmentation on the data before it is returned:

  • Temporal Interpolation: Fills in missing time steps in your data (e.g., missing years) for each location.

  • Feature Augmentation: Adds a small amount of random noise to feature columns to create a larger, more diverse training set, which can improve model robustness.

4. Flexible Output Format

The function can return its results in two convenient formats, controlled by return_dataframe:

  • A single, fully processed pandas.DataFrame.

  • A bunch ``XBunch`` object. This is often the preferred output, as it’s a self-contained object that bundles the processed DataFrame (.frame) with crucial metadata like feature names (.feature_names), target names (.target_names), and a human-readable description of all the processing steps that were applied (.DESCR).

Usage Example

This example demonstrates how to use the function to load, process, and augment a dataset in a single call. For reproducibility, we first create a dummy CSV file to simulate a raw data source.

 1import pandas as pd
 2import numpy as np
 3import os
 4from fusionlab.datasets.load import load_subsidence_pinn_data
 5from fusionlab.utils.geo_utils import generate_dummy_pinn_data
 6
 7# --- 1. Create a dummy raw data file for the example ---
 8DUMMY_DATA_DIR = "./dummy_data"
 9# The function will look inside a 'data' subdirectory of data_home
10os.makedirs(DUMMY_DATA_DIR, exist_ok=True)
11dummy_data_path = os.path.join(DUMMY_DATA_DIR, "zhongshan_2000.csv")
12
13dummy_dict = generate_dummy_pinn_data(n_samples=100)
14dummy_dict['geology'] = np.random.choice(['Clay', 'Sand'], 100)
15pd.DataFrame(dummy_dict).to_csv(dummy_data_path, index=False)
16print(f"Dummy data created at: {dummy_data_path}")
17
18# --- 2. Use the pipeline to load and process the data ---
19# We will load from the file, encode 'geology', scale numericals,
20# and perform augmentation, returning a rich Bunch object.
21processed_bunch = load_subsidence_pinn_data(
22    data_name='zhongshan',        # This configures internal column names
23    strategy='load',              # Explicitly load from file
24    data_home=DUMMY_DATA_DIR,     # Tell the function where to look
25    encode_categoricals=True,     # Enable one-hot encoding
26    scale_numericals=True,        # Enable MinMax scaling
27    augment_data=True,            # Enable augmentation
28    augment_mode='interpolate',   # Specify interpolation mode
29    use_cache=False,              # Disable caching for this demo
30    as_frame=False                # Return the rich XBunch object
31)
32
33# --- 3. Inspect the output ---
34print("\n--- Processed DataFrame (from Bunch) ---")
35# The XBunch contains the processed DataFrame in the 'frame' attribute
36print(processed_bunch.frame.head())
37
38print("\n--- Description of Processing (from Bunch) ---")
39print(processed_bunch.DESCR)

Expected Output:

Dummy data created at: ./dummy_data/data/zhongshan_2000.csv
... (Log messages from the function will appear here) ...

--- Processed DataFrame (from Bunch) ---
        year  longitude  latitude  ...  geology_Clay  geology_Sand  year_numeric
0 2008-01-01   113.0084   22.3616  ...           1.0           0.0        2008.0
1 2017-01-01   113.0172   22.3231  ...           1.0           0.0        2017.0
2 2010-01-01   113.0226   22.7769  ...           1.0           0.0        2010.0
3 2008-01-01   113.0289   22.4596  ...           1.0           0.0        2008.0
4 2021-01-01   113.0308   22.7131  ...           0.0           1.0        2021.0

[5 rows x 10 columns]

--- Description of Processing (from Bunch) ---
Processed Zhongshan PINN data.
Load Strategy: load.
Cache Used: No, Cache Path: N/A.
Categorical Encoding: Applied.
Numerical Scaling: minmax.
Augmentation: Applied (Mode: interpolate).
Rows: 100, Features: 7 (in 'data' array).
Targets: ['subsidence', 'GWL'].
Coordinate Precision: 4 decimal places.
Time Column (numeric): year_numeric.
...

Generating Synthetic Datasets

The fusionlab.datasets.make module provides functions to create synthetic datasets with specific characteristics. These are useful for:

  • Testing model implementations (TFT, XTFT, etc.).

  • Demonstrating specific features or components.

  • Creating reproducible examples for documentation or tutorials.

  • Evaluating algorithms under controlled conditions (e.g., specific

    trend types, anomaly patterns).

make_multi_feature_time_series

API Reference:

make_multi_feature_time_series()

Purpose: Generates a multi-variate dataset across multiple independent series (e.g., items, locations), including static, dynamic (past), and known future features, along with a target variable influenced by these components plus trend, seasonality, and noise.

Functionality: Creates a DataFrame simulating data suitable for models like TFT and XTFT. Key generated features include: * series_id (static) * base_level (static, noisy per series) * month, dayofweek (dynamic/future) * dynamic_cov (simulated dynamic covariate) * target_lag1 (dynamic) * future_event (simulated binary future covariate) * target (combination of inputs + trend + seasonality + noise)

Usage Context: Ideal for creating a complete, structured dataset from scratch to test the end-to-end workflow of TFT/XTFT models, including data preparation with reshape_xtft_data().

Code Example:

 1from fusionlab.datasets.make import make_multi_feature_time_series
 2
 3# Generate data for 3 series, 50 steps each
 4data_bunch = make_multi_feature_time_series(
 5    n_series=3,
 6    n_timesteps=50,
 7    freq='D',        # Daily frequency
 8    seasonality_period=7, # Weekly seasonality
 9    seed=42,
10    as_frame=False   # Return Bunch object
11)
12
13print("--- Multi-Feature Time Series Bunch ---")
14print("Generated DataFrame shape:", data_bunch.frame.shape)
15print("Static features:", data_bunch.static_features)
16print("Dynamic features:", data_bunch.dynamic_features)
17print("Future features:", data_bunch.future_features)
18print("Target column:", data_bunch.target_col)
19print("\nSample Data:")
20print(data_bunch.frame.head())

make_quantile_prediction_data

API Reference:

make_quantile_prediction_data()

Purpose: Generates a dataset simulating the typical output of a multi-horizon quantile forecasting model. It includes columns for actual target values and corresponding predicted quantiles for multiple steps ahead.

Functionality: Creates a DataFrame in a “wide” format where columns represent different forecast horizons (\(h\)) and quantiles (\(q\)).

  • Target columns: target_h1, target_h2, …

  • Prediction columns: pred_qX_h1, pred_qY_h1, …, pred_qX_h2, …

Actual values are drawn from a normal distribution, and predictions are generated around a potentially biased version of the actuals, with spread controlled by parameters.

Usage Context: Useful for testing or demonstrating evaluation metrics and visualization functions that operate on quantile forecast results (e.g., calculating pinball loss, coverage scores, or plotting prediction intervals against actuals).

Code Example:

 1from fusionlab.datasets.make import make_quantile_prediction_data
 2
 3# Generate data for 10 samples, 5 horizons, 3 quantiles
 4quantiles = [0.1, 0.5, 0.9]
 5pred_data_bunch = make_quantile_prediction_data(
 6    n_samples=10,
 7    n_horizons=5,
 8    quantiles=quantiles,
 9    seed=123,
10    as_frame=False # Return Bunch
11)
12
13print("\n--- Quantile Prediction Data Bunch ---")
14print("Generated DataFrame shape:", pred_data_bunch.frame.shape)
15print("Available quantiles:", pred_data_bunch.quantiles)
16print("Target columns:", pred_data_bunch.target_cols)
17print("Prediction columns for q0.5:",
18      pred_data_bunch.prediction_cols.get('q0.5', 'N/A'))
19print("\nSample DataFrame:")
20print(pred_data_bunch.frame.head(3))

make_anomaly_data

API Reference:

make_anomaly_data()

Purpose: Generates univariate sequence data where a specified fraction of the sequences contain injected anomalies (either spikes or level shifts).

Functionality: Creates normal sequences (e.g., sine wave + noise) and injects anomalies into a subset based on anomaly_fraction and anomaly_type. Returns the sequence data and corresponding binary labels (0=normal, 1=anomaly).

Usage Context: Designed for creating simple datasets to test anomaly detection algorithms (like LSTMAutoencoderAnomaly) or anomaly-aware training strategies.

Code Example:

 1import numpy as np
 2from fusionlab.datasets.make import make_anomaly_data
 3
 4# Generate 100 sequences, 20% with spike anomalies
 5sequences, labels = make_anomaly_data(
 6    n_sequences=100,
 7    sequence_length=50,
 8    n_features=1, # Required
 9    anomaly_fraction=0.2,
10    anomaly_type='spike',
11    anomaly_magnitude=10.0,
12    seed=42,
13    as_frame=False # Return numpy arrays
14)
15
16print("\n--- Anomaly Data ---")
17print(f"Generated sequences shape: {sequences.shape}")
18print(f"Generated labels shape: {labels.shape}")
19print(f"Number of normal sequences: {np.sum(labels == 0)}")
20print(f"Number of anomalous sequences: {np.sum(labels == 1)}")

make_trend_seasonal_data

API Reference:

make_trend_seasonal_data()

Purpose: Generates a univariate time series with clearly defined and controllable polynomial trend and multiple sinusoidal seasonal components, plus noise.

Functionality: Combines a polynomial trend (order specified by trend_order and coefficients by trend_coeffs) with one or more sine waves (defined by seasonal_periods and seasonal_amplitudes) and adds Gaussian noise (noise_level).

Usage Context: Useful for testing specific aspects of time series models, such as their ability to capture linear or non-linear trends, handle multiple overlapping seasonalities, or for demonstrating time series decomposition utilities like decompose_ts().

Code Example:

 1from fusionlab.datasets.make import make_trend_seasonal_data
 2import matplotlib.pyplot as plt
 3
 4# Generate data with quadratic trend, weekly + monthly seasonality
 5data_bunch = make_trend_seasonal_data(
 6    n_timesteps=90, # 3 months daily
 7    freq='D',
 8    trend_order=2, trend_coeffs=[20, 0.1, 0.01], # Quadratic
 9    seasonal_periods=[7, 30.5], # Weekly & approx Monthly
10    seasonal_amplitudes=[3, 8],
11    noise_level=0.5,
12    seed=99
13)
14
15print("\n--- Trend + Seasonal Data ---")
16print("Generated DataFrame shape:", data_bunch.frame.shape)
17print(data_bunch.frame.head())
18
19# Simple plot
20# data_bunch.frame.plot(x='date', y='value', figsize=(10,4),
21#                       title="Generated Trend + Seasonal Data")
22# plt.show()

make_multivariate_target_data

API Reference:

make_multivariate_target_data()

Purpose: Generates synthetic data simulating multiple time series (e.g., different items or locations) where each series has not only static, dynamic, and future features but also multiple target variables that may exhibit some interdependencies.

Functionality: Extends the logic of make_multi_feature_time_series() to generate n_targets distinct target columns (e.g., ‘target_1’, ‘target_2’, …). The generation process includes:

  • A base signal incorporating trend and seasonality.

  • Noise specific to each target.

  • An optional lagged dependency where target_N is influenced by target_{N-1} from a previous time step (cross_target_lag), controlled by cross_target_factor.

Usage Context: Useful for developing, testing, or demonstrating forecasting models that are capable of performing multivariate forecasting, i.e., predicting multiple related target variables simultaneously (e.g., predicting sales and inventory for multiple stores). The generated data mimics scenarios where target variables might influence each other over time.

Code Example:

 1import numpy as np
 2# Assuming make_multivariate_target_data is importable
 3from fusionlab.datasets.make import make_multivariate_target_data
 4
 5# Generate data for 2 series, 50 steps, 3 target variables
 6multi_target_bunch = make_multivariate_target_data(
 7    n_series=2,
 8    n_timesteps=50,
 9    n_targets=3,
10    freq='D',
11    seasonality_period=7,
12    cross_target_factor=0.4, # Example dependency
13    seed=123,
14    as_frame=False # Return Bunch object
15)
16
17print("\n--- Multi-Target Data Bunch ---")
18print("Generated DataFrame shape:", multi_target_bunch.frame.shape)
19print("Static features:", multi_target_bunch.static_features)
20print("Dynamic features:", multi_target_bunch.dynamic_features)
21print("Future features:", multi_target_bunch.future_features)
22# Check multiple target names and target array shape
23print("Target names:", multi_target_bunch.target_names)
24print("Target array shape:", multi_target_bunch.target.shape)
25print("\nSample Data:")
26# Display target columns
27print(multi_target_bunch.frame[
28    ['date', 'series_id'] + multi_target_bunch.target_names
29    ].head())