Data Preparation Workflow¶

Preparing time series data correctly is crucial for training effective forecasting models like TFT and XTFT. This example demonstrates a typical workflow using utilities from fusionlab to transform raw time series data into the structured sequence format required by these models.

We will cover the following steps:

Imports and Configuration Setup.
Loading raw data.
Basic cleaning and datetime validation.
Generating time series features (lags, rolling stats, etc.).
(Optional) Feature Selection.
Defining feature sets and scaling numerical features.
Reshaping the data into sequences using reshape_xtft_data.
Splitting the sequences into training, validation, and test sets.
(Optional) Saving the processed data.

Step 1: Imports and Configuration¶

First, we import the necessary libraries: pandas for data handling, numpy for numerical operations, StandardScaler from scikit-learn for feature scaling, joblib for saving artifacts, and the relevant utilities from fusionlab. We also set up an output directory and suppress warnings.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
# from sklearn.model_selection import train_test_split # Not used directly here
import joblib # For saving scalers
import os

# from fusionlab.core.io import read_data # Example loader (or use pd.read_csv)
# from fusionlab.utils.io import save_job # For robust saver
from fusionlab.nn.utils import reshape_xtft_data
from fusionlab.utils.ts_utils import (
    to_dt,
    ts_engineering,
    # select_and_reduce_features # Import if using this optional step
)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# --- Configuration ---
output_dir = "./data_prep_output" # Directory to save artifacts
os.makedirs(output_dir, exist_ok=True)
print(f"Artifacts will be saved to: {output_dir}")

Step 2: Load Raw Data¶

Load your initial dataset, typically from a CSV file or database. For this example, we generate synthetic multi-item data similar to previous examples. Replace this block with your actual data loading code.

# Replace with: df_raw = pd.read_csv("path/to/your/raw_data.csv")
n_items = 3
n_timesteps = 48 # 4 years of monthly data
date_rng = pd.date_range(start='2018-01-01', periods=n_timesteps, freq='MS')
df_list = []
for item_id in range(n_items):
    time = np.arange(n_timesteps)
    sales = (
        100 + item_id * 50 + time * (2 + item_id) +
        30 * np.sin(2 * np.pi * time / 12) +
        np.random.normal(0, 15, n_timesteps)
    )
    temp = 15 + 10 * np.sin(2 * np.pi * (time % 12) / 12 + np.pi) + np.random.normal(0, 2)
    promo = np.random.randint(0, 2, n_timesteps)

    item_df = pd.DataFrame({
        'Date': date_rng, 'ItemID': item_id,
        'Temperature': temp, 'PlannedPromotion': promo, 'Sales': sales
    })
    df_list.append(item_df)
df_raw = pd.concat(df_list).reset_index(drop=True)
print(f"Loaded raw data shape: {df_raw.shape}")
print("Raw data sample:")
print(df_raw.head(3))

Step 3: Initial Cleaning & Validation¶

Ensure the time column is in the correct datetime format using to_dt(). Handle any initial missing values using an appropriate strategy (here, forward fill ffill).

dt_col = 'Date'
# Ensure datetime column is correct format
df_clean = to_dt(df_raw, dt_col=dt_col, error='raise')

# Handle missing values (example: forward fill)
df_clean = df_clean.ffill()
print("\nPerformed initial cleaning (datetime check, ffill).")
print("Cleaned data sample:")
print(df_clean.head(3))

Step 4: Feature Engineering¶

Generate additional time series features using ts_engineering(). This can create lag features, rolling window statistics, and time-based features (like year, month, day of week). Rows with NaNs generated by these operations are typically dropped.

target_col = 'Sales'
df_featured = ts_engineering(
    df=df_clean.copy(),   # Pass a copy to avoid modifying df_clean
    value_col=target_col, # Generate features based on Sales
    dt_col=dt_col,        # Use Date for time features
    lags=3,               # Create Sales_lag_1, _lag_2, _lag_3
    window=6,             # Create rolling mean/std over 6 months
    diff_order=0,         # No differencing
    apply_fourier=False,  # No Fourier features
    scaler=None            # Apply scaling later
)
# Drop rows with NaNs introduced by lags/rolling features
df_featured.dropna(inplace=True)
print("\nGenerated time series features.")
print("Shape after feature engineering and dropna:", df_featured.shape)
print("New columns added:", df_featured.columns.tolist())

Step 5: Feature Selection / Reduction (Optional)¶

After generating many features, you might want to remove redundant ones (e.g., highly correlated lags) or reduce dimensionality. You could use select_and_reduce_features() here. For this example, we proceed with all generated features.

# --- OPTIONAL STEP ---
# exclude_cols = [dt_col, 'ItemID'] # Keep identifiers
# df_selected, _ = select_and_reduce_features(
#     df=df_featured,
#     target_col=target_col, exclude_cols=exclude_cols,
#     method='corr', corr_threshold=0.95, verbose=0
# )
# print("\nApplied optional feature selection (if uncommented).")
# --- END OPTIONAL STEP ---

# Use all features generated in Step 4 for this workflow
df_selected = df_featured
print("\nSkipped optional feature selection step.")

Step 6: Define Feature Sets & Scale Numerics¶

Define the final lists of static, dynamic, and future columns based on the features now present in the DataFrame. Then, apply scaling (e.g., StandardScaler) to the numerical features that will be fed into the neural network. Save the scaler for later use during prediction to inverse-transform the output.

# Define final feature sets AFTER engineering/selection
static_cols = ['ItemID'] # Only ItemID remains truly static here
# Dynamic includes time features, lags, rolling stats, temp
dynamic_cols = [
    'Temperature', 'Sales_lag_1', 'Sales_lag_2', 'Sales_lag_3',
    'rolling_mean_6', 'rolling_std_6', 'year', 'month', 'day',
    'day_of_week', 'is_weekend', 'quarter'
    ]
# Future includes known promotions and time features known ahead
future_cols = ['PlannedPromotion', 'year', 'month', 'day',
               'day_of_week', 'is_weekend', 'quarter']
# Columns to be scaled (numerical inputs and the target)
numerical_cols = ['Temperature', target_col, # Include target
                  'Sales_lag_1', 'Sales_lag_2', 'Sales_lag_3',
                  'rolling_mean_6', 'rolling_std_6']

# Apply scaling
scaler = StandardScaler()
df_scaled = df_selected.copy()
# Scale target AND relevant numerical input features
df_scaled[numerical_cols] = scaler.fit_transform(df_scaled[numerical_cols])

# Save the scaler (important!)
scaler_path = os.path.join(output_dir, "feature_scaler.joblib")
joblib.dump(scaler, scaler_path)
print(f"\nScaled numerical features. Scaler saved to {scaler_path}")

# Note: Categorical features ('ItemID', 'Month', 'PlannedPromotion',
# time features like 'day_of_week', 'is_weekend', 'quarter') are
# assumed handled by model embeddings. If not, encode them here.

Step 7: Reshape into Sequences using reshape_xtft_data¶

Use the reshape_xtft_data() utility to transform the processed, scaled DataFrame into the structured NumPy arrays (static, dynamic, future, target) expected by TFT/XTFT models. This handles the rolling window creation and feature separation.

time_steps = 12         # 1 year lookback
forecast_horizons = 6   # Predict 6 months ahead
spatial_cols = ['ItemID'] # Group sequences by ItemID

print(f"\nReshaping data into sequences (T={time_steps}, H={forecast_horizons})...")
static_data, dynamic_data, future_data, target_data = reshape_xtft_data(
    df=df_scaled, # Use scaled data
    dt_col=dt_col,
    target_col=target_col, # Target column name
    dynamic_cols=dynamic_cols, # List of dynamic column names
    static_cols=static_cols,   # List of static column names
    future_cols=future_cols,   # List of future column names
    spatial_cols=spatial_cols, # List of grouping columns
    time_steps=time_steps,
    forecast_horizons=forecast_horizons,
    verbose=1 # Show resulting shapes
)

Step 8: Train / Validation / Test Split¶

Split the generated sequence arrays into training, validation, and (optionally) test sets. A chronological split is typically required for time series data.

# Example: 70% Train, 15% Validation, 15% Test (Chronological)
# Ensure data was sorted by time before reshaping for this split type
if static_data is not None: # Check if static data exists
    n_samples = static_data.shape[0]
elif dynamic_data is not None:
     n_samples = dynamic_data.shape[0]
else:
     raise ValueError("No data available to split.")

n_val = int(n_samples * 0.15)
n_test = int(n_samples * 0.15)
n_train = n_samples - n_val - n_test

# Perform the split for each array type
X_train_static, X_val_static, X_test_static = (
    static_data[:n_train],
    static_data[n_train:n_train + n_val],
    static_data[n_train + n_val:]
)
X_train_dynamic, X_val_dynamic, X_test_dynamic = (
    dynamic_data[:n_train],
    dynamic_data[n_train:n_train + n_val],
    dynamic_data[n_train + n_val:]
)
X_train_future, X_val_future, X_test_future = (
    future_data[:n_train],
    future_data[n_train:n_train + n_val],
    future_data[n_train + n_val:]
)
y_train, y_val, y_test = (
    target_data[:n_train],
    target_data[n_train:n_train + n_val],
    target_data[n_train + n_val:]
)

print("\nData split into Train/Validation/Test sets:")
print(f"  Train Samples : {n_train}")
print(f"  Val.  Samples : {n_val}")
print(f"  Test  Samples : {n_test}")
print(f"  Example Train Dynamic Shape: {X_train_dynamic.shape}")

Step 9: Save Processed Data (Optional)¶

Optionally, save the final processed sequence arrays using np.savez for easy reloading during model training experiments, avoiding the need to repeat the entire preprocessing pipeline each time.

processed_data_path = os.path.join(output_dir, "processed_sequences.npz")
np.savez(
    processed_data_path,
    # Saving only train/val for brevity, add test sets if needed
    X_train_static=X_train_static, X_val_static=X_val_static,
    X_train_dynamic=X_train_dynamic, X_val_dynamic=X_val_dynamic,
    X_train_future=X_train_future, X_val_future=X_val_future,
    y_train=y_train, y_val=y_val
    # Add X_test_*, y_test if split earlier
)
print(f"\nProcessed sequence data saved to {processed_data_path}")