fusionlab.utils.ts_utils.select_and_reduce_features

fusionlab.utils.ts_utils.select_and_reduce_features(df, target_col=None, exclude_cols=None, method='corr', corr_threshold=0.9, n_components=None, scale_data=True, return_pca=False, verbose=0)[source]

Perform feature selection or dimensionality reduction on a dataset, using either correlation-based filtering or Principal Component Analysis (PCA).

\[\text{Var}_{\text{explained}}(\text{PC}_i) = \frac{\lambda_i}{\sum_j \lambda_j},\]

where \(\lambda_i\) are the eigenvalues from the covariance matrix in PCA [1].

Parameters:
  • df (pandas.DataFrame) – The input DataFrame containing the dataset. Typically, it includes both feature columns and optionally a target column.

  • target_col (str, list, optional) – The name(s) of the target column(s) which should be excluded from feature selection or reduction. If a list is provided, these columns are excluded as well. If None, no column is excluded as target.

  • exclude_cols (list of str, optional) – Additional columns to exclude from feature selection and PCA transformations (e.g. ID columns, date-time columns). Defaults to an empty list.

  • method ({'corr', 'correlation', 'pca'}, optional) –

    The approach for feature reduction:

    • 'corr' or 'correlation': Use correlation-based feature selection. Features exceeding a specified correlation threshold are dropped.

    • 'pca': Use Principal Component Analysis to reduce the dimensionality.

  • corr_threshold (float, optional) – The correlation threshold for correlation-based feature selection. Any pair of features with absolute correlation above this value leads to dropping one of them. Defaults to 0.9.

  • n_components (int or float, optional) – Number of PCA components to keep. If an integer, keeps that many components. If a float in range (0,1], it indicates the proportion of variance to retain. Only used if method='pca'.

  • scale_data (bool, optional) – If True, standardizes the features before PCA using sklearn.preprocessing.StandardScaler. Ignored for correlation-based selection. Default is True.

  • return_pca (bool, optional) – If True and method='pca', returns the fitted PCA model along with the transformed DataFrame.

  • verbose (int, optional) –

    Verbosity level:

    • 0 : No output.

    • 1 : Basic logs of feature counts and steps.

    • 2 : More detailed information such as correlation matrix or explained variance ratio.

Returns:

  • transformed_df (pandas.DataFrame) – The resulting DataFrame after feature selection or PCA-based dimensionality reduction. If a target was specified, it is re-appended at the end.

  • pca_model (sklearn.decomposition.PCA or None) – If method='pca' and return_pca=True, returns the fitted PCA model. Otherwise None.

Examples

>>> import pandas as pd
>>> from fusionlab.utils.ts_utils import select_and_reduce_features
>>> data = {
...     'A': [1, 2, 3, 4, 5],
...     'B': [2, 4, 6, 8, 10],
...     'C': [5, 3, 6, 2, 11],
...     'Target': [0, 1, 0, 1, 0]
... }
>>> df = pd.DataFrame(data)
>>> # Correlation-based selection
>>> out_df = select_and_reduce_features(
...     df, target_col='Target',
...     method='corr', corr_threshold=0.8,
...     verbose=1
... )
Number of features before selection: 3
Excluded columns: []
Performing correlation-based feature selection...
>>> # PCA-based reduction
>>> pca_df, pca_model = select_and_reduce_features(
...     df, target_col='Target', method='pca',
...     n_components=2, scale_data=True,
...     return_pca=True, verbose=1
... )
Number of features before selection: 3
Excluded columns: []
Performing Principal Component Analysis (PCA)...
Standardizing data before PCA.
Explained variance ratio: [0.63717928 0.29160977 0.07121096]
Number of components selected: 2

Notes

  • Correlation-based selection can be efficient if many features are highly correlated, but it might discard relevant signals if multiple correlated features collectively provide synergy [2].

  • PCA transforms the data to orthogonal principal components, which can simplify many ML models but complicate interpretability.

See also

PCA

The scikit-learn PCA class used for dimension reduction.

transform_stationarity

Stabilize time-series data prior to certain modeling approaches.

References