fusionlab.utils.nan_ops¶

fusionlab.utils.nan_ops(data, auxi_data=None, data_kind=None, ops='check_only', action=None, error='raise', process=None, condition=None, savefile=None, verbose=0)[source]¶

Perform operations on NaN values within data structures, handling both primary data and optional witness data based on specified parameters.

This function provides a comprehensive toolkit for managing missing values (NaN) in various data structures such as NumPy arrays, pandas DataFrames, and pandas Series. Depending on the ops parameter, it can check for the presence of NaN`s, validate data integrity, or sanitize the data by filling or dropping `NaN values. The function also supports handling witness data, which can be crucial in scenarios where the relationship between primary and witness data must be maintained.

\[\begin{split}\text{Processed\_data} = \begin{cases} \text{filled\_data} & \text{if action is 'fill'} \\ \text{dropped\_data} & \text{if action is 'drop'} \\ \text{original\_data} & \text{otherwise} \end{cases}\end{split}\]

Parameters:

data (array-like, pandas.DataFrame, or pandas.Series) – The primary data structure containing NaN values to be processed.
auxi_data (array-like, pandas.DataFrame, or pandas.Series, optional) – Auxiliary data that accompanies the primary data. Its role depends on the data_kind parameter. If data_kind is ‘target’, auxi_data is treated as feature data, and vice versa. This is useful for operations that need to maintain the alignment between primary and witness data.
data_kind ({'target', 'feature', None}, optional) – Specifies the role of the primary data. If set to ‘target’, data is considered target data, and auxi_data (if provided) is treated as feature data. If set to ‘feature’, data is treated as feature data, and auxi_data is considered target data. If None, no special handling is applied, and witness data is ignored unless explicitly required by other parameters.
ops ({'check_only', 'validate', 'sanitize'}, default :class:``’check_only’:class:``) –
Defines the operation to perform on the NaN values in the data:
- 'check_only': Checks whether the data contains any NaN values and returns a boolean indicator.
- 'validate': Validates that the data does not contain NaN values. If NaN`s are found, it raises an error or warns based on the ``error` parameter.
- 'sanitize': Cleans the data by either filling or dropping NaN values based on the action, process, and condition parameters.
action ({'fill', 'drop'}, optional) –
Specifies the action to take when ops is set to ‘sanitize’:
- 'fill': Fills NaN values using the fill_NaN function with the method set to ‘both’.
- 'drop': Drops NaN values based on the conditions and process specified. If data_kind is ‘target’, it handles `NaN`s in a way that preserves data integrity for machine learning models.
- If None, defaults to ‘drop’ when sanitizing.
Note: If ops is not ‘sanitize’ and action is set, an error is raised indicating conflicting parameters.
error ({'raise', 'warn', None}, default :class:``’raise’:class:``) –
Determines the error handling policy:
- 'raise': Raises exceptions when encountering issues.
- 'warn': Emits warnings instead of raising exceptions.
- None: Defaults to the base policy, which is typically ‘warn’.
This parameter is utilized by the error_policy function to enforce consistent error handling throughout the operation.
process ({'do', 'do_anyway'}, optional) –
Works in conjunction with the action parameter when action is ‘drop’:
- 'do': Drops NaN values only if certain conditions are met.
- 'do_anyway': Forces the dropping of NaN values regardless of conditions.
This provides flexibility in handling `NaN`s based on the specific requirements of the dataset and the analysis being performed.
condition (callable or None, optional) – A callable that defines a condition for dropping NaN values when action is ‘drop’. For example, it can specify that the number of NaN`s should not exceed a certain fraction of the dataset. If the condition is not met, the behavior is controlled by the ``process` parameter.
verbose (int, default 0) –
Controls the verbosity level of the function’s output for debugging purposes:
- 0: No output.
- 1: Basic informational messages.
- 2: Detailed processing messages.
- 3: Debug-level messages with complete trace of operations.
Higher verbosity levels provide more insights into the function’s internal operations, aiding in debugging and monitoring.

Returns:

The sanitized data structure with NaN values handled according to the specified parameters. If auxi_data is provided and processed, a tuple containing the sanitized data and auxi_data is returned. Otherwise, only the sanitized data is returned.

Return type:

array-like, pandas.DataFrame, or pandas.Series

Raises:

ValueError –
- If an invalid value is provided for ops or data_kind. - If auxi_data does not align with data in shape. - If sanitization conditions are not met and the error policy is set to ‘raise’.
Warning –
- Emits warnings when NaN values are present and the error policy is set to ‘warn’.

Examples

>>> from fusionlab.utils.data_utils import nan_ops
>>> import pandas as pd
>>> import numpy as np
>>> # Example with target data and witness feature data
>>> target = pd.Series([1, 2, np.nan, 4])
>>> features = pd.DataFrame({
...     'A': [5, np.nan, 7, 8],
...     'B': ['x', 'y', 'z', np.nan]
... })
>>> # Check for NaNs
>>> nan_ops(target, auxi_data=features, data_kind='target', ops='check_only')
(True, True)
>>> # Validate data (will raise ValueError if NaNs are present)
>>> nan_ops(target, auxi_data=features, data_kind='target', ops='validate')
Traceback (most recent call last):
    ...
ValueError: Target contains NaN values.
>>> # Sanitize data by dropping NaNs
>>> cleaned_target, cleaned_features = nan_ops(
...     target,
...     auxi_data=features,
...     data_kind='target',
...     ops='sanitize',
...     action='drop',
...     verbose=2
... )
Dropping NaN values.
Dropped NaNs successfully.
>>> cleaned_target
0    1.0
1    2.0
3    4.0
dtype: float64
>>> cleaned_features
     A    B
0  5.0    x
3  8.0  NaN

Notes

The nan_ops function is designed to provide a robust framework for handling missing values in datasets, especially in machine learning workflows where the integrity of target and feature data is paramount. By allowing conditional operations and providing flexibility in error handling, it ensures that data preprocessing can be tailored to the specific needs of the analysis.

The function leverages helper utilities such as fill_NaN, drop_nan_in, and error_policy to maintain consistency and reliability across different data structures and scenarios. The verbosity levels aid developers in tracing the function’s execution flow, making it easier to debug and verify data transformations.