Exercise: Data Preparation Workflow for Case History Data

Welcome to this hands-on exercise! We will walk through a typical data preparation pipeline for time series forecasting, using the Zhongshan subsidence dataset available in fusionlab-learn. This exercise focuses on cleaning, feature engineering, encoding, and scaling the data to make it ready for sequence generation (e.g., using reshape_xtft_data()) and subsequent model training.

Learning Objectives:

  • Load a sample dataset using fusionlab.datasets.

  • Perform initial data cleaning and validation, including datetime conversion and NaN handling.

  • Generate relevant time series features (lags, rolling statistics, calendar features) using ts_engineering().

  • Encode categorical features for model consumption.

  • Scale numerical features to improve model training.

  • Define distinct feature sets (static, dynamic, future, target) for advanced forecasting models.

  • Save processed data and scalers for reusability.

Let’s begin!

Prerequisites

Ensure you have fusionlab-learn and its common dependencies installed. We’ll also use joblib for saving artifacts.

pip install fusionlab-learn scikit-learn joblib matplotlib

Step 1: Imports and Setup

We start by importing the necessary libraries and fusionlab utilities. An output directory is also created for any artifacts we generate, like scalers or processed data.

 1import numpy as np
 2import pandas as pd
 3from sklearn.preprocessing import StandardScaler, OneHotEncoder
 4import joblib
 5import os
 6import warnings
 7import matplotlib.pyplot as plt # For initial data viz
 8
 9# FusionLab imports
10from fusionlab.datasets import fetch_zhongshan_data
11from fusionlab.utils.ts_utils import ts_engineering, to_dt
12from fusionlab.utils.data_utils import nan_ops
13
14# Suppress warnings and TF logs for cleaner output
15warnings.filterwarnings('ignore')
16os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # Suppress C++ level TF logs
17try:
18    import tensorflow as tf
19    tf.get_logger().setLevel('ERROR')
20    if hasattr(tf, 'autograph'):
21        tf.autograph.set_verbosity(0)
22except ImportError:
23    print("TensorFlow not found, skipping TF log suppression.")
24
25# Configuration for outputs from this exercise
26exercise_output_dir_dataprep = "./zhongshan_data_prep_exercise_outputs"
27os.makedirs(exercise_output_dir_dataprep, exist_ok=True)
28
29print("Libraries imported and setup complete for data prep exercise.")

Expected Output 1.1:

Libraries imported and setup complete for data prep exercise.

Step 2: Load Zhongshan Dataset

We’ll use the fetch_zhongshan_data() function to load the sample Zhongshan subsidence dataset. This function returns a Bunch object containing the DataFrame and metadata about the columns.

 1# Fetch the data as a Bunch object
 2zhongshan_bunch = fetch_zhongshan_data(as_frame=False, verbose=0)
 3df_raw = zhongshan_bunch.frame.copy() # Work with a copy
 4
 5print(f"Loaded Zhongshan data. Shape: {df_raw.shape}")
 6print(f"Available columns: {df_raw.columns.tolist()}")
 7print("\nSample of raw data:")
 8print(df_raw.head())
 9print("\nData types:")
10print(df_raw.info())
Expected Output 2.2:

(Details will match the `zhongshan_2000.csv` structure)

Loaded Zhongshan data. Shape: (1999, 14)
Available columns: ['longitude', 'latitude', 'year', 'GWL', 'seismic_risk_score', 'rainfall_mm', 'geology', 'normalized_density', 'density_tier', 'subsidence_intensity', 'density_concentration', 'normalized_seismic_risk_score', 'rainfall_category', 'subsidence']

Sample of raw data:
    longitude   latitude  ...  rainfall_category  subsidence
0  113.240334  22.476652  ...             Medium       15.51
1  113.215866  22.510025  ...             Medium       31.60
2  113.237984  22.494591  ...             Medium        8.09
3  113.219109  22.513433  ...             Medium       15.49
4  113.210678  22.536232  ...             Medium       14.02

[5 rows x 14 columns]

Data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 14 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   longitude                      1999 non-null   float64
 1   latitude                       1999 non-null   float64
 2   year                           1999 non-null   int64
 3   GWL                            1999 non-null   float64
 4   seismic_risk_score             1999 non-null   float64
 5   rainfall_mm                    1999 non-null   float64
 6   geology                        1999 non-null   object
 7   normalized_density             1999 non-null   float64
 8   density_tier                   1998 non-null   object
 9   subsidence_intensity           1998 non-null   object
 10  density_concentration          1999 non-null   object
 11  normalized_seismic_risk_score  1999 non-null   float64
 12  rainfall_category              1999 non-null   object
 13  subsidence                     1999 non-null   float64
dtypes: float64(8), int64(1), object(5)
memory usage: 218.8+ KB
None

Step 3: Initial Data Cleaning and Validation

The ‘year’ column is currently an integer. For time series operations, it’s often better to have a proper datetime index or column. We’ll convert ‘year’ to a datetime object representing the start of each year. Then, we’ll handle any missing values.

 1dt_col_exercise = 'Date' # We will create this column
 2df_clean = df_raw.copy()
 3
 4# Convert 'year' to a datetime column (start of year)
 5df_clean[dt_col_exercise] = pd.to_datetime(
 6    df_clean['year'], format='%Y'
 7    )
 8print(f"\nConverted 'year' to datetime column '{dt_col_exercise}'.")
 9
10# Handle missing values using nan_ops for robust ffill/bfill
11print(f"NaNs before cleaning: "
12      f"{df_clean.isna().any().sum()} columns have NaNs.")
13df_clean = nan_ops(df_clean, ops='sanitize', action='fill', # Uses ffill then bfill
14                   verbose=0)
15print(f"NaNs after cleaning: "
16      f"{df_clean.isna().any().sum()} columns have NaNs.")
17print("Cleaned data sample (with new 'Date' column):")
18print(df_clean[['Date', 'year', 'subsidence', 'GWL']].head())

Expected Output 3.3:

Converted 'year' to datetime column 'Date'.
NaNs before cleaning: 2 columns have NaNs.
NaNs after cleaning: 0 columns have NaNs.
Cleaned data sample (with new 'Date' column):
        Date    year  subsidence       GWL
0 2015-01-01  2015.0       15.51  2.865853
1 2023-01-01  2023.0       31.60  1.924022
2 2018-01-01  2018.0        8.09  0.752556
3 2019-01-01  2019.0       15.49  1.043998
4 2015-01-01  2015.0       14.02  1.700558

Step 4: Feature Engineering

Generate new time-based features (lags, rolling statistics, calendar features) from the ‘subsidence’ column using ts_engineering(). This function requires data to be sorted by time for each group if grouping is implicit. Since our data has multiple spatial points per year, we should ideally group by spatial identifiers before applying ts_engineering if lags/rolling stats are per location. For this exercise, we’ll apply it globally first, then drop NaNs. A more advanced workflow would group by [‘longitude’, ‘latitude’] before this step.

 1target_col_exercise = 'subsidence' # Target and base for engineering
 2
 3# For ts_engineering, ensure data is sorted if applying globally
 4# or group by spatial identifiers first.
 5# Here, we sort by ItemID (if exists) then Date.
 6# Zhongshan data has 'longitude', 'latitude' as identifiers.
 7# For simplicity, we'll sort by Date and assume global features.
 8# A production workflow would group by 'longitude', 'latitude'.
 9df_for_eng = df_clean.sort_values(by=[dt_col_exercise]).copy()
10
11df_featured = ts_engineering(
12    df=df_for_eng,
13    value_col=target_col_exercise,
14    dt_col=dt_col_exercise, # Use the new 'Date' column
15    lags=2,             # Create subsidence_lag_1, _lag_2
16    window=3,               # Rolling mean/std over 3 periods (years)
17    window_type='triang',   # Example window type
18    seasonal_period=0,      # No explicit seasonal decomp here
19    diff_order=0,
20    apply_fourier=False,
21    time_features=['year', 'month', 'quarter', 'day_of_week'],
22    scaler=None             # Scale later
23)
24print(f"\nShape after feature engineering (before dropna): "
25      f"{df_featured.shape}")
26
27# Drop rows with NaNs introduced by lags/rolling features
28df_featured.dropna(inplace=True)
29df_featured.reset_index (inplace =True)
30print(f"Shape after dropna: {df_featured.shape}")
31print("Sample of engineered features (new columns):")
32print(df_featured[['Date', target_col_exercise, 'lag_1',
33                   'rolling_mean_3', 'month', 'quarter']].head())
Expected Output 4.4:

(Shapes and new columns will reflect `ts_engineering` output)

Shape after feature engineering (before dropna): (1997, 25)
Shape after dropna: (1997, 26)
Sample of engineered features (new columns):
        Date  subsidence  lag_1  rolling_mean_3  month  quarter
0 2015-01-01        5.13  13.71       11.450000      1        1
1 2015-01-01       22.17   5.13       13.670000      1        1
2 2015-01-01       22.65  22.17       16.650000      1        1
3 2015-01-01       10.98  22.65       18.600000      1        1
4 2015-01-01        8.98  10.98       14.203333      1        1

Step 5: Categorical Feature Encoding

Models require numerical inputs. Categorical features like ‘geology’ and ‘density_tier’ from the Zhongshan dataset need to be encoded. We’ll use One-Hot Encoding.

 1df_to_encode = df_featured.copy()
 2categorical_cols_zhongshan = ['geology', 'density_tier']
 3encoded_feature_names = [] # To store names of new one-hot columns
 4
 5for col in categorical_cols_zhongshan:
 6    if col in df_to_encode.columns:
 7        encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore',
 8                                dtype=np.float32)
 9        encoded_data = encoder.fit_transform(df_to_encode[[col]])
10        new_cols = [f"{col}_{cat.replace(' ', '_')}" for cat in encoder.categories_[0]]
11        encoded_df_part = pd.DataFrame(
12            encoded_data, columns=new_cols, index=df_to_encode.index
13            )
14        df_to_encode = pd.concat([df_to_encode, encoded_df_part], axis=1)
15        df_to_encode.drop(columns=[col], inplace=True)
16        encoded_feature_names.extend(new_cols)
17        print(f"  Encoded '{col}' into: {new_cols}")
18    else:
19        print(f"  Warning: Categorical column '{col}' not found for encoding.")
20
21df_encoded = df_to_encode
22print(f"\nShape after one-hot encoding: {df_encoded.shape}")
23print(f"Added one-hot encoded columns: {encoded_feature_names}")

Expected Output 5.5:

  Encoded 'geology' into: ['geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand']
  Encoded 'density_tier' into: ['density_tier_High', 'density_tier_Low', 'density_tier_Medium']

Shape after one-hot encoding: (1997, 32)
Added one-hot encoded columns: ['geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand', 'density_tier_High', 'density_tier_Low', 'density_tier_Medium']

Step 6: Define Final Feature Sets and Scale Numerical Features

Now that all features are generated and encoded, we define our final sets of static, dynamic, and future features. Then, we scale all numerical features (including the target) that will be fed into the model.

 1df_for_scaling = df_encoded.copy()
 2
 3# Define feature sets based on available columns
 4# Static: longitude, latitude, and one-hot encoded categoricals
 5static_feature_names_ex = ['longitude', 'latitude'] + encoded_feature_names
 6static_feature_names_ex = [
 7    c for c in static_feature_names_ex if c in df_for_scaling.columns
 8    ]
 9
10# Dynamic: Original numericals + engineered numericals + calendar features
11dynamic_feature_names_ex = [
12    'GWL', 'rainfall_mm', 'normalized_density',
13    'normalized_seismic_risk_score',
14    'subsidence_lag_1', 'subsidence_lag_2', # Check if lag_2 was created
15    'rolling_mean_3', 'rolling_std_3',      # Check if _std_3 was created
16    'month', 'quarter', 'day_of_week'     # Calendar features
17]
18# Filter to existing columns
19dynamic_feature_names_ex = [
20    c for c in dynamic_feature_names_ex if c in df_for_scaling.columns
21    ]
22# Ensure target is not in dynamic features if it's handled separately
23if target_col_exercise in dynamic_feature_names_ex:
24    dynamic_feature_names_ex.remove(target_col_exercise)
25
26# Future: For this example, assume some calendar features are "known future"
27# In a real scenario, these would be genuinely known ahead of time.
28future_feature_names_ex = ['month', 'quarter', 'day_of_week']
29future_feature_names_ex = [
30    c for c in future_feature_names_ex if c in df_for_scaling.columns
31    ]
32
33# Columns to be scaled: all numerical features including target
34# Exclude already one-hot encoded and simple calendar integers if
35# they are to be embedded or treated as categorical by the model.
36# For this exercise, we scale most numericals.
37numerical_cols_to_scale_ex = [
38    'longitude', 'latitude', 'GWL', 'rainfall_mm',
39    'normalized_density', 'normalized_seismic_risk_score',
40    target_col_exercise # Include target
41]
42# Add engineered numerical features if they exist
43engineered_to_scale = [
44    'subsidence_lag_1', 'subsidence_lag_2',
45    'rolling_mean_3', 'rolling_std_3'
46    ]
47numerical_cols_to_scale_ex.extend(
48    [c for c in engineered_to_scale if c in df_for_scaling.columns]
49    )
50# Ensure unique and existing columns
51numerical_cols_to_scale_ex = list(set(
52    c for c in numerical_cols_to_scale_ex if c in df_for_scaling.columns
53    ))
54
55print(f"\nStatic features for model: {static_feature_names_ex}")
56print(f"Dynamic features for model: {dynamic_feature_names_ex}")
57print(f"Future features for model: {future_feature_names_ex}")
58print(f"Numerical columns to be scaled: {numerical_cols_to_scale_ex}")
59
60df_final_scaled = df_for_scaling.copy()
61if numerical_cols_to_scale_ex:
62    scaler_final = StandardScaler()
63    df_final_scaled[numerical_cols_to_scale_ex] = \
64        scaler_final.fit_transform(
65            df_final_scaled[numerical_cols_to_scale_ex]
66            )
67    scaler_path_final = os.path.join(
68        exercise_output_dir_dataprep, "zhongshan_final_scaler.joblib"
69        )
70    joblib.dump(scaler_final, scaler_path_final)
71    print(f"\nFinal numerical features scaled. Scaler saved to {scaler_path_final}")
72else:
73    print("\nNo numerical columns identified for final scaling.")
74
75print("\nSample of fully processed data (first 5 rows, selected columns):")
76cols_to_show = static_feature_names_ex[:2] + \
77               dynamic_feature_names_ex[:2] + \
78               future_feature_names_ex[:1] + [target_col_exercise]
79cols_to_show = [c for c in cols_to_show if c in df_final_scaled.columns]
80print(df_final_scaled[cols_to_show].head())
Expected Output 6.6:

(Column lists and sample data will reflect the processing)

Static features for model: ['longitude', 'latitude', 'geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand', 'density_tier_High', 'density_tier_Low', 'density_tier_Medium']
Dynamic features for model: ['GWL', 'rainfall_mm', 'normalized_density', 'normalized_seismic_risk_score', 'lag_1', 'lag_2', 'rolling_mean_3', 'rolling_std_3', 'month', 'quarter', 'day_of_week']
Future features for model: ['month', 'quarter', 'day_of_week']
Numerical columns to be scaled: ['GWL', 'latitude', 'normalized_seismic_risk_score', 'longitude', 'rolling_std_3', 'rainfall_mm', 'rolling_mean_3', 'normalized_density', 'lag_1', 'subsidence', 'lag_2']

Final numerical features scaled. Scaler saved to ./zhongshan_data_prep_exercise_outputs\zhongshan_final_scaler.joblib

Sample of fully processed data (first 5 rows, selected columns):
   longitude  latitude       GWL  rainfall_mm  month  subsidence
0   0.379708 -1.794134 -0.556755    -1.135748      1   -0.804539
1   0.750644  0.812747  0.390138    -0.973036      1    0.382239
2   0.814110  0.736287  0.240789    -0.929659      1    0.415669
3  -0.583759 -0.571861  0.551307    -1.252451      1   -0.397107
4   0.225787 -2.117750 -0.392659    -1.088000      1   -0.536400

Step 7: Save Processed DataFrame

It’s good practice to save the fully processed DataFrame. This allows you to load it directly for sequence reshaping and model training in future sessions without repeating all preprocessing steps.

1processed_df_path = os.path.join(
2    exercise_output_dir_dataprep, "zhongshan_fully_processed_data.csv"
3    )
4df_final_scaled.to_csv(processed_df_path, index=False)
5print(f"\nFully processed DataFrame saved to: {processed_df_path}")
6print(f"Final DataFrame shape: {df_final_scaled.shape}")

Expected Output 7.7:

Fully processed DataFrame saved to: ./zhongshan_data_prep_exercise_outputs\zhongshan_fully_processed_data.csv
Final DataFrame shape: (1997, 32)

Discussion of Exercise

In this exercise, we performed a comprehensive data preparation workflow for the Zhongshan dataset:

  1. Loaded the raw data.

  2. Cleaned it by ensuring correct datetime formatting and handling missing values.

  3. Engineered new features like lags, rolling statistics, and calendar attributes using ts_engineering.

  4. Encoded categorical features (geology, density_tier) into a numerical format (one-hot encoding) suitable for machine learning models.

  5. Scaled numerical features (including the target variable) using StandardScaler to normalize their ranges.

  6. Defined distinct sets of features (static, dynamic, future) based on the processed data, ready for input into advanced forecasting models like TFT or XTFT (via reshape_xtft_data).

This prepared DataFrame (df_final_scaled) is now in a state where it can be passed to reshape_xtft_data() to create sequences for training models. The saved scaler (zhongshan_final_scaler.joblib) is essential for inverse-transforming predictions back to their original scale.

This workflow demonstrates key steps in transforming raw time series data from multiple sources/locations into a structured format suitable for sophisticated deep learning forecasting models.