fusionlab.utils.ts_utils.ts_outlier_detector¶

fusionlab.utils.ts_utils.ts_outlier_detector(df, dt_col=None, value_col=None, method='zscore', threshold=3, view=False, fig_size=(10, 5), show_grid=True, drop=False, verbose=0)[source]¶

Detect outliers in a time series using either Z-Score or Interquartile Range (IQR). Outliers can optionally be removed from the DataFrame.

In many time-series analyses, anomalous points can distort model training or skew statistical inferences. Common outlier detection approaches include the Z-Score:

\[Z_t = \frac{X_t - \mu}{\sigma},\]

which flags points for which \(|Z_t| > \text{threshold}\).

Parameters:

df (pandas.DataFrame) – The input DataFrame containing the time series data. Must include a datetime column or index.
dt_col (str, optional) – Column name representing the datetime dimension. If None, the function assumes the index is datetime-like or uses ts_validator.
value_col (str, optional) – Name of the target variable in the DataFrame (e.g., “Sales”).
method ({'zscore', 'iqr'}, optional) –
- 'zscore': Use Z-Scores to detect outliers.
- 'iqr': Use the Interquartile Range method, \(Q_1\) and \(Q_3\) scaled by threshold * IQR.
threshold (int or float, optional) – Threshold multiplier for the chosen method. For Z-Scores, it represents how many standard deviations above/below the mean qualifies as an outlier (default=3). For IQR, it is the multiplier applied to the IQR to define lower and upper bounds.
view (bool, optional) – If True, displays a plot marking outliers in red over the original time series.
fig_size (tuple of (float, float), optional) – The size of the figure (width, height) if visualizing.
show_grid (bool, optional) – Whether to display gridlines in the plot.
drop (bool, optional) – If True, removes the rows flagged as outliers from df.
verbose (int, optional) –
Verbosity level:
- 0 : No console messages.
- 1 : Basic information about outlier counts.
- 2+ : (Not implemented here, but can be extended).

Returns:

result – The original DataFrame with a new column 'is_outlier' marking outlier rows (True/False), unless drop=True. In that case, it returns the DataFrame after removing these rows (and without the extra column).

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> from fusionlab.utils.ts_utils import ts_outlier_detector
>>> data = {
...     'Date': [
...         '2021-01-01','2021-01-02','2021-01-03',
...         '2021-01-04','2021-01-05','2021-01-06'
...     ],
...     'Sales': [10, 100, 12, 13, 200, 15]
... }
>>> df = pd.DataFrame(data)
>>> df['Date'] = pd.to_datetime(df['Date'])
>>> df_out = ts_outlier_detector(
...     df,
...     dt_col='Date',
...     value_col='Sales',
...     method='zscore',
...     threshold=2.5,
...     view=True,
...     drop=False,
...     verbose=1
... )
Target variable: Sales
Datetime column: Date
Outlier detection method: zscore, Threshold: 2.5
Detecting outliers using Z-Score...
Number of outliers detected: 2
Outliers retained in the DataFrame.

Notes

The choice of outlier detection (Z-Score vs. IQR) can be context dependent. Z-Scores assume a somewhat normal distribution of data [1] while IQR is more robust to skewed distributions [2].