Exercise: Data Preparation Workflow for Case History Data¶
Welcome to this hands-on exercise! We will walk through a typical
data preparation pipeline for time series forecasting, using the
Zhongshan subsidence dataset available in fusionlab-learn.
This exercise focuses on cleaning, feature engineering, encoding,
and scaling the data to make it ready for sequence generation
(e.g., using reshape_xtft_data()) and
subsequent model training.
Learning Objectives:
Load a sample dataset using
fusionlab.datasets.Perform initial data cleaning and validation, including datetime conversion and NaN handling.
Generate relevant time series features (lags, rolling statistics, calendar features) using
ts_engineering().Encode categorical features for model consumption.
Scale numerical features to improve model training.
Define distinct feature sets (static, dynamic, future, target) for advanced forecasting models.
Save processed data and scalers for reusability.
Let’s begin!
Prerequisites¶
Ensure you have fusionlab-learn and its common dependencies
installed. We’ll also use joblib for saving artifacts.
pip install fusionlab-learn scikit-learn joblib matplotlib
Step 1: Imports and Setup¶
We start by importing the necessary libraries and fusionlab
utilities. An output directory is also created for any artifacts
we generate, like scalers or processed data.
1import numpy as np
2import pandas as pd
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4import joblib
5import os
6import warnings
7import matplotlib.pyplot as plt # For initial data viz
8
9# FusionLab imports
10from fusionlab.datasets import fetch_zhongshan_data
11from fusionlab.utils.ts_utils import ts_engineering, to_dt
12from fusionlab.utils.data_utils import nan_ops
13
14# Suppress warnings and TF logs for cleaner output
15warnings.filterwarnings('ignore')
16os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' # Suppress C++ level TF logs
17try:
18 import tensorflow as tf
19 tf.get_logger().setLevel('ERROR')
20 if hasattr(tf, 'autograph'):
21 tf.autograph.set_verbosity(0)
22except ImportError:
23 print("TensorFlow not found, skipping TF log suppression.")
24
25# Configuration for outputs from this exercise
26exercise_output_dir_dataprep = "./zhongshan_data_prep_exercise_outputs"
27os.makedirs(exercise_output_dir_dataprep, exist_ok=True)
28
29print("Libraries imported and setup complete for data prep exercise.")
Expected Output 1.1:
Libraries imported and setup complete for data prep exercise.
Step 2: Load Zhongshan Dataset¶
We’ll use the fetch_zhongshan_data()
function to load the sample Zhongshan subsidence dataset. This
function returns a Bunch object containing the DataFrame and
metadata about the columns.
1# Fetch the data as a Bunch object
2zhongshan_bunch = fetch_zhongshan_data(as_frame=False, verbose=0)
3df_raw = zhongshan_bunch.frame.copy() # Work with a copy
4
5print(f"Loaded Zhongshan data. Shape: {df_raw.shape}")
6print(f"Available columns: {df_raw.columns.tolist()}")
7print("\nSample of raw data:")
8print(df_raw.head())
9print("\nData types:")
10print(df_raw.info())
- Expected Output 2.2:
(Details will match the `zhongshan_2000.csv` structure)
Loaded Zhongshan data. Shape: (1999, 14)
Available columns: ['longitude', 'latitude', 'year', 'GWL', 'seismic_risk_score', 'rainfall_mm', 'geology', 'normalized_density', 'density_tier', 'subsidence_intensity', 'density_concentration', 'normalized_seismic_risk_score', 'rainfall_category', 'subsidence']
Sample of raw data:
longitude latitude ... rainfall_category subsidence
0 113.240334 22.476652 ... Medium 15.51
1 113.215866 22.510025 ... Medium 31.60
2 113.237984 22.494591 ... Medium 8.09
3 113.219109 22.513433 ... Medium 15.49
4 113.210678 22.536232 ... Medium 14.02
[5 rows x 14 columns]
Data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1999 entries, 0 to 1998
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 1999 non-null float64
1 latitude 1999 non-null float64
2 year 1999 non-null int64
3 GWL 1999 non-null float64
4 seismic_risk_score 1999 non-null float64
5 rainfall_mm 1999 non-null float64
6 geology 1999 non-null object
7 normalized_density 1999 non-null float64
8 density_tier 1998 non-null object
9 subsidence_intensity 1998 non-null object
10 density_concentration 1999 non-null object
11 normalized_seismic_risk_score 1999 non-null float64
12 rainfall_category 1999 non-null object
13 subsidence 1999 non-null float64
dtypes: float64(8), int64(1), object(5)
memory usage: 218.8+ KB
None
Step 3: Initial Data Cleaning and Validation¶
The ‘year’ column is currently an integer. For time series operations, it’s often better to have a proper datetime index or column. We’ll convert ‘year’ to a datetime object representing the start of each year. Then, we’ll handle any missing values.
1dt_col_exercise = 'Date' # We will create this column
2df_clean = df_raw.copy()
3
4# Convert 'year' to a datetime column (start of year)
5df_clean[dt_col_exercise] = pd.to_datetime(
6 df_clean['year'], format='%Y'
7 )
8print(f"\nConverted 'year' to datetime column '{dt_col_exercise}'.")
9
10# Handle missing values using nan_ops for robust ffill/bfill
11print(f"NaNs before cleaning: "
12 f"{df_clean.isna().any().sum()} columns have NaNs.")
13df_clean = nan_ops(df_clean, ops='sanitize', action='fill', # Uses ffill then bfill
14 verbose=0)
15print(f"NaNs after cleaning: "
16 f"{df_clean.isna().any().sum()} columns have NaNs.")
17print("Cleaned data sample (with new 'Date' column):")
18print(df_clean[['Date', 'year', 'subsidence', 'GWL']].head())
Expected Output 3.3:
Converted 'year' to datetime column 'Date'.
NaNs before cleaning: 2 columns have NaNs.
NaNs after cleaning: 0 columns have NaNs.
Cleaned data sample (with new 'Date' column):
Date year subsidence GWL
0 2015-01-01 2015.0 15.51 2.865853
1 2023-01-01 2023.0 31.60 1.924022
2 2018-01-01 2018.0 8.09 0.752556
3 2019-01-01 2019.0 15.49 1.043998
4 2015-01-01 2015.0 14.02 1.700558
Step 4: Feature Engineering¶
Generate new time-based features (lags, rolling statistics, calendar
features) from the ‘subsidence’ column using
ts_engineering().
This function requires data to be sorted by time for each group if
grouping is implicit. Since our data has multiple spatial points
per year, we should ideally group by spatial identifiers before
applying ts_engineering if lags/rolling stats are per location.
For this exercise, we’ll apply it globally first, then drop NaNs.
A more advanced workflow would group by [‘longitude’, ‘latitude’]
before this step.
1target_col_exercise = 'subsidence' # Target and base for engineering
2
3# For ts_engineering, ensure data is sorted if applying globally
4# or group by spatial identifiers first.
5# Here, we sort by ItemID (if exists) then Date.
6# Zhongshan data has 'longitude', 'latitude' as identifiers.
7# For simplicity, we'll sort by Date and assume global features.
8# A production workflow would group by 'longitude', 'latitude'.
9df_for_eng = df_clean.sort_values(by=[dt_col_exercise]).copy()
10
11df_featured = ts_engineering(
12 df=df_for_eng,
13 value_col=target_col_exercise,
14 dt_col=dt_col_exercise, # Use the new 'Date' column
15 lags=2, # Create subsidence_lag_1, _lag_2
16 window=3, # Rolling mean/std over 3 periods (years)
17 window_type='triang', # Example window type
18 seasonal_period=0, # No explicit seasonal decomp here
19 diff_order=0,
20 apply_fourier=False,
21 time_features=['year', 'month', 'quarter', 'day_of_week'],
22 scaler=None # Scale later
23)
24print(f"\nShape after feature engineering (before dropna): "
25 f"{df_featured.shape}")
26
27# Drop rows with NaNs introduced by lags/rolling features
28df_featured.dropna(inplace=True)
29df_featured.reset_index (inplace =True)
30print(f"Shape after dropna: {df_featured.shape}")
31print("Sample of engineered features (new columns):")
32print(df_featured[['Date', target_col_exercise, 'lag_1',
33 'rolling_mean_3', 'month', 'quarter']].head())
- Expected Output 4.4:
(Shapes and new columns will reflect `ts_engineering` output)
Shape after feature engineering (before dropna): (1997, 25)
Shape after dropna: (1997, 26)
Sample of engineered features (new columns):
Date subsidence lag_1 rolling_mean_3 month quarter
0 2015-01-01 5.13 13.71 11.450000 1 1
1 2015-01-01 22.17 5.13 13.670000 1 1
2 2015-01-01 22.65 22.17 16.650000 1 1
3 2015-01-01 10.98 22.65 18.600000 1 1
4 2015-01-01 8.98 10.98 14.203333 1 1
Step 5: Categorical Feature Encoding¶
Models require numerical inputs. Categorical features like ‘geology’ and ‘density_tier’ from the Zhongshan dataset need to be encoded. We’ll use One-Hot Encoding.
1df_to_encode = df_featured.copy()
2categorical_cols_zhongshan = ['geology', 'density_tier']
3encoded_feature_names = [] # To store names of new one-hot columns
4
5for col in categorical_cols_zhongshan:
6 if col in df_to_encode.columns:
7 encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore',
8 dtype=np.float32)
9 encoded_data = encoder.fit_transform(df_to_encode[[col]])
10 new_cols = [f"{col}_{cat.replace(' ', '_')}" for cat in encoder.categories_[0]]
11 encoded_df_part = pd.DataFrame(
12 encoded_data, columns=new_cols, index=df_to_encode.index
13 )
14 df_to_encode = pd.concat([df_to_encode, encoded_df_part], axis=1)
15 df_to_encode.drop(columns=[col], inplace=True)
16 encoded_feature_names.extend(new_cols)
17 print(f" Encoded '{col}' into: {new_cols}")
18 else:
19 print(f" Warning: Categorical column '{col}' not found for encoding.")
20
21df_encoded = df_to_encode
22print(f"\nShape after one-hot encoding: {df_encoded.shape}")
23print(f"Added one-hot encoded columns: {encoded_feature_names}")
Expected Output 5.5:
Encoded 'geology' into: ['geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand']
Encoded 'density_tier' into: ['density_tier_High', 'density_tier_Low', 'density_tier_Medium']
Shape after one-hot encoding: (1997, 32)
Added one-hot encoded columns: ['geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand', 'density_tier_High', 'density_tier_Low', 'density_tier_Medium']
Step 6: Define Final Feature Sets and Scale Numerical Features¶
Now that all features are generated and encoded, we define our final sets of static, dynamic, and future features. Then, we scale all numerical features (including the target) that will be fed into the model.
1df_for_scaling = df_encoded.copy()
2
3# Define feature sets based on available columns
4# Static: longitude, latitude, and one-hot encoded categoricals
5static_feature_names_ex = ['longitude', 'latitude'] + encoded_feature_names
6static_feature_names_ex = [
7 c for c in static_feature_names_ex if c in df_for_scaling.columns
8 ]
9
10# Dynamic: Original numericals + engineered numericals + calendar features
11dynamic_feature_names_ex = [
12 'GWL', 'rainfall_mm', 'normalized_density',
13 'normalized_seismic_risk_score',
14 'subsidence_lag_1', 'subsidence_lag_2', # Check if lag_2 was created
15 'rolling_mean_3', 'rolling_std_3', # Check if _std_3 was created
16 'month', 'quarter', 'day_of_week' # Calendar features
17]
18# Filter to existing columns
19dynamic_feature_names_ex = [
20 c for c in dynamic_feature_names_ex if c in df_for_scaling.columns
21 ]
22# Ensure target is not in dynamic features if it's handled separately
23if target_col_exercise in dynamic_feature_names_ex:
24 dynamic_feature_names_ex.remove(target_col_exercise)
25
26# Future: For this example, assume some calendar features are "known future"
27# In a real scenario, these would be genuinely known ahead of time.
28future_feature_names_ex = ['month', 'quarter', 'day_of_week']
29future_feature_names_ex = [
30 c for c in future_feature_names_ex if c in df_for_scaling.columns
31 ]
32
33# Columns to be scaled: all numerical features including target
34# Exclude already one-hot encoded and simple calendar integers if
35# they are to be embedded or treated as categorical by the model.
36# For this exercise, we scale most numericals.
37numerical_cols_to_scale_ex = [
38 'longitude', 'latitude', 'GWL', 'rainfall_mm',
39 'normalized_density', 'normalized_seismic_risk_score',
40 target_col_exercise # Include target
41]
42# Add engineered numerical features if they exist
43engineered_to_scale = [
44 'subsidence_lag_1', 'subsidence_lag_2',
45 'rolling_mean_3', 'rolling_std_3'
46 ]
47numerical_cols_to_scale_ex.extend(
48 [c for c in engineered_to_scale if c in df_for_scaling.columns]
49 )
50# Ensure unique and existing columns
51numerical_cols_to_scale_ex = list(set(
52 c for c in numerical_cols_to_scale_ex if c in df_for_scaling.columns
53 ))
54
55print(f"\nStatic features for model: {static_feature_names_ex}")
56print(f"Dynamic features for model: {dynamic_feature_names_ex}")
57print(f"Future features for model: {future_feature_names_ex}")
58print(f"Numerical columns to be scaled: {numerical_cols_to_scale_ex}")
59
60df_final_scaled = df_for_scaling.copy()
61if numerical_cols_to_scale_ex:
62 scaler_final = StandardScaler()
63 df_final_scaled[numerical_cols_to_scale_ex] = \
64 scaler_final.fit_transform(
65 df_final_scaled[numerical_cols_to_scale_ex]
66 )
67 scaler_path_final = os.path.join(
68 exercise_output_dir_dataprep, "zhongshan_final_scaler.joblib"
69 )
70 joblib.dump(scaler_final, scaler_path_final)
71 print(f"\nFinal numerical features scaled. Scaler saved to {scaler_path_final}")
72else:
73 print("\nNo numerical columns identified for final scaling.")
74
75print("\nSample of fully processed data (first 5 rows, selected columns):")
76cols_to_show = static_feature_names_ex[:2] + \
77 dynamic_feature_names_ex[:2] + \
78 future_feature_names_ex[:1] + [target_col_exercise]
79cols_to_show = [c for c in cols_to_show if c in df_final_scaled.columns]
80print(df_final_scaled[cols_to_show].head())
- Expected Output 6.6:
(Column lists and sample data will reflect the processing)
Static features for model: ['longitude', 'latitude', 'geology_Cohesive_Soil', 'geology_Gravelly_Soil', 'geology_Residual_Soil', 'geology_Rock', 'geology_Sand', 'density_tier_High', 'density_tier_Low', 'density_tier_Medium']
Dynamic features for model: ['GWL', 'rainfall_mm', 'normalized_density', 'normalized_seismic_risk_score', 'lag_1', 'lag_2', 'rolling_mean_3', 'rolling_std_3', 'month', 'quarter', 'day_of_week']
Future features for model: ['month', 'quarter', 'day_of_week']
Numerical columns to be scaled: ['GWL', 'latitude', 'normalized_seismic_risk_score', 'longitude', 'rolling_std_3', 'rainfall_mm', 'rolling_mean_3', 'normalized_density', 'lag_1', 'subsidence', 'lag_2']
Final numerical features scaled. Scaler saved to ./zhongshan_data_prep_exercise_outputs\zhongshan_final_scaler.joblib
Sample of fully processed data (first 5 rows, selected columns):
longitude latitude GWL rainfall_mm month subsidence
0 0.379708 -1.794134 -0.556755 -1.135748 1 -0.804539
1 0.750644 0.812747 0.390138 -0.973036 1 0.382239
2 0.814110 0.736287 0.240789 -0.929659 1 0.415669
3 -0.583759 -0.571861 0.551307 -1.252451 1 -0.397107
4 0.225787 -2.117750 -0.392659 -1.088000 1 -0.536400
Step 7: Save Processed DataFrame¶
It’s good practice to save the fully processed DataFrame. This allows you to load it directly for sequence reshaping and model training in future sessions without repeating all preprocessing steps.
1processed_df_path = os.path.join(
2 exercise_output_dir_dataprep, "zhongshan_fully_processed_data.csv"
3 )
4df_final_scaled.to_csv(processed_df_path, index=False)
5print(f"\nFully processed DataFrame saved to: {processed_df_path}")
6print(f"Final DataFrame shape: {df_final_scaled.shape}")
Expected Output 7.7:
Fully processed DataFrame saved to: ./zhongshan_data_prep_exercise_outputs\zhongshan_fully_processed_data.csv
Final DataFrame shape: (1997, 32)
Discussion of Exercise¶
In this exercise, we performed a comprehensive data preparation workflow for the Zhongshan dataset:
Loaded the raw data.
Cleaned it by ensuring correct datetime formatting and handling missing values.
Engineered new features like lags, rolling statistics, and calendar attributes using ts_engineering.
Encoded categorical features (geology, density_tier) into a numerical format (one-hot encoding) suitable for machine learning models.
Scaled numerical features (including the target variable) using StandardScaler to normalize their ranges.
Defined distinct sets of features (static, dynamic, future) based on the processed data, ready for input into advanced forecasting models like TFT or XTFT (via reshape_xtft_data).
This prepared DataFrame (df_final_scaled) is now in a state where it
can be passed to reshape_xtft_data() to
create sequences for training models. The saved scaler
(zhongshan_final_scaler.joblib) is essential for inverse-transforming
predictions back to their original scale.
This workflow demonstrates key steps in transforming raw time series data from multiple sources/locations into a structured format suitable for sophisticated deep learning forecasting models.