Datasets¶
The fusionlab.datasets module provides access to sample datasets
relevant to the forecasting tasks addressed by the library (like land
subsidence) and includes tools for generating synthetic datasets. These
can be useful for testing models, demonstrating utilities, and running
examples.
Loading Included Datasets¶
These functions load pre-packaged (or downloadable) datasets, often derived from real-world studies but potentially sampled or processed for convenience. They typically handle locating the data file (checking local cache, package data, and optionally downloading).
fetch_zhongshan_data¶
- API Reference:
Loads the sampled Zhongshan land subsidence dataset (zhongshan_2000.csv). This file contains ~2000 data points spatially sampled from a larger dataset [Liu24], including raw features like coordinates, year, GWL, rainfall, geology, seismic risk, density metrics, and the target subsidence variable.
The function can return the data as a pandas DataFrame or a scikit-learn
style Bunch object containing the data
and metadata. It also supports further sub-sampling via the n_samples
parameter.
Basic Usage (Load as Bunch):
1from fusionlab.datasets import fetch_zhongshan_data
2
3# Load the full dataset (~2000 samples) as a Bunch
4zhongshan_bunch = fetch_zhongshan_data(as_frame=False)
5
6# Access the DataFrame
7print("Zhongshan DataFrame shape:", zhongshan_bunch.frame.shape)
8
9# Access target values
10print("Target shape:", zhongshan_bunch.target.shape)
11
12# Access feature names
13print("Feature names:", zhongshan_bunch.feature_names)
fetch_nansha_data¶
- API Reference:
Loads the sampled Nansha land subsidence dataset (nansha_2000.csv). Similar to the Zhongshan dataset, this contains ~2000 spatially sampled data points with features relevant to subsidence in the Nansha area, including coordinates, year, geological information, hydrogeology, building concentration, soil thickness, risk scores, and the target subsidence.
It provides the same options as Workspace_zhongshan_data for returning a DataFrame or Bunch, sub-sampling (n_samples), and controlling data loading/caching.
Basic Usage (Load Sample as DataFrame):
1from fusionlab.datasets import fetch_nansha_data
2
3# Load a random spatial sample of 500 points as a DataFrame
4nansha_df_sample = fetch_nansha_data(
5 n_samples=500,
6 as_frame=True,
7 random_state=42, # for reproducibility
8 verbose=False # suppress messages
9 )
10
11print(f"Loaded Nansha sample shape: {nansha_df_sample.shape}")
12print(nansha_df_sample.head())
load_processed_subsidence_data¶
- API Reference:
This function provides a higher-level pipeline that loads one of the
raw datasets (zhongshan_2000.csv or nansha_2000.csv via the Workspace_
functions), applies a predefined preprocessing workflow (feature
selection, NaN handling, categorical encoding, numerical scaling), and
optionally reshapes the data into sequences suitable for TFT/XTFT
models using reshape_xtft_data().
It includes options to control which preprocessing steps are applied and utilizes caching for processed DataFrames and generated sequences to speed up repeated calls.
Basic Usage (Get Processed Frame):
1from fusionlab.datasets import load_processed_subsidence_data
2
3# Load and preprocess Zhongshan data, return as DataFrame
4# Applies default preprocessing: feature select, nan fill, one-hot, minmax scale
5df_processed = load_processed_subsidence_data(
6 dataset_name='zhongshan',
7 return_sequences=False, # Get the processed DataFrame
8 as_frame=True,
9 use_processed_cache=True, # Try to load from cache first
10 save_processed_frame=True # Save if reprocessed
11)
12print("Processed Zhongshan DataFrame info:")
13df_processed.info()
Usage for Model Training (Get Sequences):
1from fusionlab.datasets import load_processed_subsidence_data
2
3# Load Zhongshan, preprocess, and generate sequences
4static, dynamic, future, target = load_processed_subsidence_data(
5 dataset_name='zhongshan',
6 return_sequences=True, # Request sequence arrays
7 time_steps=12, # Example lookback
8 forecast_horizon=6, # Example horizon
9 use_sequence_cache=True,
10 save_sequences=True
11)
12print("\nGenerated sequences for model training:")
13print(f"Static shape: {static.shape}")
14print(f"Dynamic shape: {dynamic.shape}")
15print(f"Future shape: {future.shape}")
16print(f"Target shape: {target.shape}")
load_subsidence_pinn_data¶
- API Reference:
load_subsidence_pinn_data()
This function is the recommended all-in-one data preparation
pipeline for any project using the library’s PINN models,
such as PIHALNet and TransFlowSubsNet.
It is designed to handle the entire data ingestion and preprocessing workflow, which is particularly complex for physics-informed models. By abstracting away common steps like data loading, cleaning, encoding, scaling, and optional augmentation, it saves a significant amount of boilerplate code and helps prevent common errors, allowing you to focus on the modeling itself.
End-to-End Workflow
The function executes a comprehensive, multi-stage workflow, with each stage being configurable through the function’s parameters.
1. Data Sourcing & Caching
The first step is to get the data. The function employs an efficient strategy:
Caching: If
use_cache=True, it first checks for a pre-processed version of the data in your local cache directory. If found, it loads this file instantly, skipping all subsequent processing steps and saving significant time on repeated runs.Data Loading: If no cache is found, it proceeds to load the raw data according to the
strategyparameter:'load'(requires the file to exist),'fallback'(loads the file if present, otherwise generates dummy data), or'generate'(always creates a new dummy dataset, great for testing).Saving Cache: After processing the data for the first time, you can set
save_cache=Trueto save the fully processed results, including the DataFrame and any fitted scalers/encoders, for fast retrieval in the future.
2. Automated Core Preprocessing
Once the raw data is loaded, the function performs a full preprocessing pipeline:
It ensures essential columns (coordinates, time, targets) exist and drops rows where they are missing.
It robustly converts the time column (e.g., integer years) into a proper datetime object for internal calculations.
It one-hot encodes specified categorical columns (like geology).
It creates a continuous numerical time coordinate, which is essential for computing derivatives in the PINN loss function.
It scales specified numerical features to a [0, 1] range to ensure stable model training. The fitted scaler and encoder objects are saved along with the data.
3. Optional Data Augmentation
By setting augment_data=True, you can invoke the
augment_city_spatiotemporal_data()
pipeline. This can perform two types of augmentation on the data before
it is returned:
Temporal Interpolation: Fills in missing time steps in your data (e.g., missing years) for each location.
Feature Augmentation: Adds a small amount of random noise to feature columns to create a larger, more diverse training set, which can improve model robustness.
4. Flexible Output Format
The function can return its results in two convenient formats,
controlled by return_dataframe:
A single, fully processed
pandas.DataFrame.A
bunch``XBunch`` object. This is often the preferred output, as it’s a self-contained object that bundles the processed DataFrame (.frame) with crucial metadata like feature names (.feature_names), target names (.target_names), and a human-readable description of all the processing steps that were applied (.DESCR).
Usage Example
This example demonstrates how to use the function to load, process, and augment a dataset in a single call. For reproducibility, we first create a dummy CSV file to simulate a raw data source.
1import pandas as pd
2import numpy as np
3import os
4from fusionlab.datasets.load import load_subsidence_pinn_data
5from fusionlab.utils.geo_utils import generate_dummy_pinn_data
6
7# --- 1. Create a dummy raw data file for the example ---
8DUMMY_DATA_DIR = "./dummy_data"
9# The function will look inside a 'data' subdirectory of data_home
10os.makedirs(DUMMY_DATA_DIR, exist_ok=True)
11dummy_data_path = os.path.join(DUMMY_DATA_DIR, "zhongshan_2000.csv")
12
13dummy_dict = generate_dummy_pinn_data(n_samples=100)
14dummy_dict['geology'] = np.random.choice(['Clay', 'Sand'], 100)
15pd.DataFrame(dummy_dict).to_csv(dummy_data_path, index=False)
16print(f"Dummy data created at: {dummy_data_path}")
17
18# --- 2. Use the pipeline to load and process the data ---
19# We will load from the file, encode 'geology', scale numericals,
20# and perform augmentation, returning a rich Bunch object.
21processed_bunch = load_subsidence_pinn_data(
22 data_name='zhongshan', # This configures internal column names
23 strategy='load', # Explicitly load from file
24 data_home=DUMMY_DATA_DIR, # Tell the function where to look
25 encode_categoricals=True, # Enable one-hot encoding
26 scale_numericals=True, # Enable MinMax scaling
27 augment_data=True, # Enable augmentation
28 augment_mode='interpolate', # Specify interpolation mode
29 use_cache=False, # Disable caching for this demo
30 as_frame=False # Return the rich XBunch object
31)
32
33# --- 3. Inspect the output ---
34print("\n--- Processed DataFrame (from Bunch) ---")
35# The XBunch contains the processed DataFrame in the 'frame' attribute
36print(processed_bunch.frame.head())
37
38print("\n--- Description of Processing (from Bunch) ---")
39print(processed_bunch.DESCR)
Expected Output:
Dummy data created at: ./dummy_data/data/zhongshan_2000.csv
... (Log messages from the function will appear here) ...
--- Processed DataFrame (from Bunch) ---
year longitude latitude ... geology_Clay geology_Sand year_numeric
0 2008-01-01 113.0084 22.3616 ... 1.0 0.0 2008.0
1 2017-01-01 113.0172 22.3231 ... 1.0 0.0 2017.0
2 2010-01-01 113.0226 22.7769 ... 1.0 0.0 2010.0
3 2008-01-01 113.0289 22.4596 ... 1.0 0.0 2008.0
4 2021-01-01 113.0308 22.7131 ... 0.0 1.0 2021.0
[5 rows x 10 columns]
--- Description of Processing (from Bunch) ---
Processed Zhongshan PINN data.
Load Strategy: load.
Cache Used: No, Cache Path: N/A.
Categorical Encoding: Applied.
Numerical Scaling: minmax.
Augmentation: Applied (Mode: interpolate).
Rows: 100, Features: 7 (in 'data' array).
Targets: ['subsidence', 'GWL'].
Coordinate Precision: 4 decimal places.
Time Column (numeric): year_numeric.
...
Generating Synthetic Datasets¶
The fusionlab.datasets.make module provides functions to create
synthetic datasets with specific characteristics. These are useful for:
Testing model implementations (TFT, XTFT, etc.).
Demonstrating specific features or components.
Creating reproducible examples for documentation or tutorials.
- Evaluating algorithms under controlled conditions (e.g., specific
trend types, anomaly patterns).
make_multi_feature_time_series¶
- API Reference:
make_multi_feature_time_series()
Purpose: Generates a multi-variate dataset across multiple independent series (e.g., items, locations), including static, dynamic (past), and known future features, along with a target variable influenced by these components plus trend, seasonality, and noise.
Functionality: Creates a DataFrame simulating data suitable for
models like TFT and
XTFT. Key generated features include:
* series_id (static)
* base_level (static, noisy per series)
* month, dayofweek (dynamic/future)
* dynamic_cov (simulated dynamic covariate)
* target_lag1 (dynamic)
* future_event (simulated binary future covariate)
* target (combination of inputs + trend + seasonality + noise)
Usage Context: Ideal for creating a complete, structured dataset
from scratch to test the end-to-end workflow of TFT/XTFT models,
including data preparation with
reshape_xtft_data().
Code Example:
1from fusionlab.datasets.make import make_multi_feature_time_series
2
3# Generate data for 3 series, 50 steps each
4data_bunch = make_multi_feature_time_series(
5 n_series=3,
6 n_timesteps=50,
7 freq='D', # Daily frequency
8 seasonality_period=7, # Weekly seasonality
9 seed=42,
10 as_frame=False # Return Bunch object
11)
12
13print("--- Multi-Feature Time Series Bunch ---")
14print("Generated DataFrame shape:", data_bunch.frame.shape)
15print("Static features:", data_bunch.static_features)
16print("Dynamic features:", data_bunch.dynamic_features)
17print("Future features:", data_bunch.future_features)
18print("Target column:", data_bunch.target_col)
19print("\nSample Data:")
20print(data_bunch.frame.head())
make_quantile_prediction_data¶
- API Reference:
make_quantile_prediction_data()
Purpose: Generates a dataset simulating the typical output of a multi-horizon quantile forecasting model. It includes columns for actual target values and corresponding predicted quantiles for multiple steps ahead.
Functionality: Creates a DataFrame in a “wide” format where columns represent different forecast horizons (\(h\)) and quantiles (\(q\)).
Target columns: target_h1, target_h2, …
Prediction columns: pred_qX_h1, pred_qY_h1, …, pred_qX_h2, …
Actual values are drawn from a normal distribution, and predictions are generated around a potentially biased version of the actuals, with spread controlled by parameters.
Usage Context: Useful for testing or demonstrating evaluation metrics and visualization functions that operate on quantile forecast results (e.g., calculating pinball loss, coverage scores, or plotting prediction intervals against actuals).
Code Example:
1from fusionlab.datasets.make import make_quantile_prediction_data
2
3# Generate data for 10 samples, 5 horizons, 3 quantiles
4quantiles = [0.1, 0.5, 0.9]
5pred_data_bunch = make_quantile_prediction_data(
6 n_samples=10,
7 n_horizons=5,
8 quantiles=quantiles,
9 seed=123,
10 as_frame=False # Return Bunch
11)
12
13print("\n--- Quantile Prediction Data Bunch ---")
14print("Generated DataFrame shape:", pred_data_bunch.frame.shape)
15print("Available quantiles:", pred_data_bunch.quantiles)
16print("Target columns:", pred_data_bunch.target_cols)
17print("Prediction columns for q0.5:",
18 pred_data_bunch.prediction_cols.get('q0.5', 'N/A'))
19print("\nSample DataFrame:")
20print(pred_data_bunch.frame.head(3))
make_anomaly_data¶
- API Reference:
make_anomaly_data()
Purpose: Generates univariate sequence data where a specified fraction of the sequences contain injected anomalies (either spikes or level shifts).
Functionality: Creates normal sequences (e.g., sine wave + noise) and injects anomalies into a subset based on anomaly_fraction and anomaly_type. Returns the sequence data and corresponding binary labels (0=normal, 1=anomaly).
Usage Context: Designed for creating simple datasets to test anomaly
detection algorithms (like
LSTMAutoencoderAnomaly) or
anomaly-aware training strategies.
Code Example:
1import numpy as np
2from fusionlab.datasets.make import make_anomaly_data
3
4# Generate 100 sequences, 20% with spike anomalies
5sequences, labels = make_anomaly_data(
6 n_sequences=100,
7 sequence_length=50,
8 n_features=1, # Required
9 anomaly_fraction=0.2,
10 anomaly_type='spike',
11 anomaly_magnitude=10.0,
12 seed=42,
13 as_frame=False # Return numpy arrays
14)
15
16print("\n--- Anomaly Data ---")
17print(f"Generated sequences shape: {sequences.shape}")
18print(f"Generated labels shape: {labels.shape}")
19print(f"Number of normal sequences: {np.sum(labels == 0)}")
20print(f"Number of anomalous sequences: {np.sum(labels == 1)}")
make_trend_seasonal_data¶
- API Reference:
make_trend_seasonal_data()
Purpose: Generates a univariate time series with clearly defined and controllable polynomial trend and multiple sinusoidal seasonal components, plus noise.
Functionality: Combines a polynomial trend (order specified by trend_order and coefficients by trend_coeffs) with one or more sine waves (defined by seasonal_periods and seasonal_amplitudes) and adds Gaussian noise (noise_level).
Usage Context: Useful for testing specific aspects of time series
models, such as their ability to capture linear or non-linear trends,
handle multiple overlapping seasonalities, or for demonstrating time
series decomposition utilities like
decompose_ts().
Code Example:
1from fusionlab.datasets.make import make_trend_seasonal_data
2import matplotlib.pyplot as plt
3
4# Generate data with quadratic trend, weekly + monthly seasonality
5data_bunch = make_trend_seasonal_data(
6 n_timesteps=90, # 3 months daily
7 freq='D',
8 trend_order=2, trend_coeffs=[20, 0.1, 0.01], # Quadratic
9 seasonal_periods=[7, 30.5], # Weekly & approx Monthly
10 seasonal_amplitudes=[3, 8],
11 noise_level=0.5,
12 seed=99
13)
14
15print("\n--- Trend + Seasonal Data ---")
16print("Generated DataFrame shape:", data_bunch.frame.shape)
17print(data_bunch.frame.head())
18
19# Simple plot
20# data_bunch.frame.plot(x='date', y='value', figsize=(10,4),
21# title="Generated Trend + Seasonal Data")
22# plt.show()
make_multivariate_target_data¶
- API Reference:
make_multivariate_target_data()
Purpose: Generates synthetic data simulating multiple time series (e.g., different items or locations) where each series has not only static, dynamic, and future features but also multiple target variables that may exhibit some interdependencies.
Functionality:
Extends the logic of make_multi_feature_time_series() to
generate n_targets distinct target columns (e.g., ‘target_1’,
‘target_2’, …). The generation process includes:
A base signal incorporating trend and seasonality.
Noise specific to each target.
An optional lagged dependency where target_N is influenced by target_{N-1} from a previous time step (cross_target_lag), controlled by cross_target_factor.
Usage Context: Useful for developing, testing, or demonstrating forecasting models that are capable of performing multivariate forecasting, i.e., predicting multiple related target variables simultaneously (e.g., predicting sales and inventory for multiple stores). The generated data mimics scenarios where target variables might influence each other over time.
Code Example:
1import numpy as np
2# Assuming make_multivariate_target_data is importable
3from fusionlab.datasets.make import make_multivariate_target_data
4
5# Generate data for 2 series, 50 steps, 3 target variables
6multi_target_bunch = make_multivariate_target_data(
7 n_series=2,
8 n_timesteps=50,
9 n_targets=3,
10 freq='D',
11 seasonality_period=7,
12 cross_target_factor=0.4, # Example dependency
13 seed=123,
14 as_frame=False # Return Bunch object
15)
16
17print("\n--- Multi-Target Data Bunch ---")
18print("Generated DataFrame shape:", multi_target_bunch.frame.shape)
19print("Static features:", multi_target_bunch.static_features)
20print("Dynamic features:", multi_target_bunch.dynamic_features)
21print("Future features:", multi_target_bunch.future_features)
22# Check multiple target names and target array shape
23print("Target names:", multi_target_bunch.target_names)
24print("Target array shape:", multi_target_bunch.target.shape)
25print("\nSample Data:")
26# Display target columns
27print(multi_target_bunch.frame[
28 ['date', 'series_id'] + multi_target_bunch.target_names
29 ].head())