fusionlab.utils.ts_utils.select_and_reduce_features¶
- fusionlab.utils.ts_utils.select_and_reduce_features(df, target_col=None, exclude_cols=None, method='corr', corr_threshold=0.9, n_components=None, scale_data=True, return_pca=False, verbose=0)[source]¶
Perform feature selection or dimensionality reduction on a dataset, using either correlation-based filtering or Principal Component Analysis (PCA).
\[\text{Var}_{\text{explained}}(\text{PC}_i) = \frac{\lambda_i}{\sum_j \lambda_j},\]where \(\lambda_i\) are the eigenvalues from the covariance matrix in PCA [1].
- Parameters:
df (
pandas.DataFrame) – The input DataFrame containing the dataset. Typically, it includes both feature columns and optionally a target column.target_col (
str,list, optional) – The name(s) of the target column(s) which should be excluded from feature selection or reduction. If a list is provided, these columns are excluded as well. IfNone, no column is excluded as target.exclude_cols (
listofstr, optional) – Additional columns to exclude from feature selection and PCA transformations (e.g. ID columns, date-time columns). Defaults to an empty list.method (
{'corr', 'correlation', 'pca'}, optional) –The approach for feature reduction:
'corr'or'correlation': Use correlation-based feature selection. Features exceeding a specified correlation threshold are dropped.'pca': Use Principal Component Analysis to reduce the dimensionality.
corr_threshold (
float, optional) – The correlation threshold for correlation-based feature selection. Any pair of features with absolute correlation above this value leads to dropping one of them. Defaults to 0.9.n_components (
intorfloat, optional) – Number of PCA components to keep. If an integer, keeps that many components. If a float in range(0,1], it indicates the proportion of variance to retain. Only used ifmethod='pca'.scale_data (
bool, optional) – IfTrue, standardizes the features before PCA usingsklearn.preprocessing.StandardScaler. Ignored for correlation-based selection. Default is True.return_pca (
bool, optional) – IfTrueandmethod='pca', returns the fitted PCA model along with the transformed DataFrame.verbose (
int, optional) –Verbosity level:
0 : No output.
1 : Basic logs of feature counts and steps.
2 : More detailed information such as correlation matrix or explained variance ratio.
- Returns:
transformed_df (
pandas.DataFrame) – The resulting DataFrame after feature selection or PCA-based dimensionality reduction. If a target was specified, it is re-appended at the end.pca_model (
sklearn.decomposition.PCAorNone) – Ifmethod='pca'andreturn_pca=True, returns the fitted PCA model. OtherwiseNone.
Examples
>>> import pandas as pd >>> from fusionlab.utils.ts_utils import select_and_reduce_features >>> data = { ... 'A': [1, 2, 3, 4, 5], ... 'B': [2, 4, 6, 8, 10], ... 'C': [5, 3, 6, 2, 11], ... 'Target': [0, 1, 0, 1, 0] ... } >>> df = pd.DataFrame(data) >>> # Correlation-based selection >>> out_df = select_and_reduce_features( ... df, target_col='Target', ... method='corr', corr_threshold=0.8, ... verbose=1 ... ) Number of features before selection: 3 Excluded columns: [] Performing correlation-based feature selection...
>>> # PCA-based reduction >>> pca_df, pca_model = select_and_reduce_features( ... df, target_col='Target', method='pca', ... n_components=2, scale_data=True, ... return_pca=True, verbose=1 ... ) Number of features before selection: 3 Excluded columns: [] Performing Principal Component Analysis (PCA)... Standardizing data before PCA. Explained variance ratio: [0.63717928 0.29160977 0.07121096] Number of components selected: 2
Notes
Correlation-based selection can be efficient if many features are highly correlated, but it might discard relevant signals if multiple correlated features collectively provide synergy [2].
PCA transforms the data to orthogonal principal components, which can simplify many ML models but complicate interpretability.
See also
PCAThe scikit-learn PCA class used for dimension reduction.
transform_stationarityStabilize time-series data prior to certain modeling approaches.
References