fusionlab.utils.mask_by_reference¶
- fusionlab.utils.mask_by_reference(data, ref_col, values=None, find_closest=False, fill_value=0, mask_columns=None, error='raise', verbose=0, inplace=False, savefile=None)[source]¶
Masks (replaces) values in columns other than the reference column for rows in which the reference column matches (or is closest to) the specified value(s).
If a row’s reference-column value is matched, that row’s values in the other columns are overwritten by
fill_value. The reference column itself is not modified.- This function supports both exact and approximate matching:
Exact matching is used if
find_closest=False.Approximate (closest) matching is used if
find_closest=Trueand the reference column is numeric.
By default, if the reference column does not exist or if the given
valuescannot be found (or approximated) in the reference column, an exception is raised. This behavior can be adjusted with theerrorparameter.- Parameters:
data (
pd.DataFrame) – The input DataFrame containing the data to be masked.ref_col (
str) – The column indataserving as the reference for matching or finding the closest values.values (
AnyorsequenceofAny, optional) –- The reference values to look for in
ref_col. This can be: A single value (e.g.,
0or"apple").A list/tuple of values (e.g.,
[0, 10, 25]).If
valuesis None, all rows are masked (i.e. all rows match), effectively overwriting the entire DataFrame (except the reference column) withfill_value.
Note that if
find_closest=False, these values must appear in the reference column; otherwise, an error or warning is triggered (depending on theerrorsetting).- The reference values to look for in
find_closest (
bool, defaultFalse) – If True, performs an approximate match for numeric reference columns. For each entry invalues, the function locates the row(s) inref_colwhose value is numerically closest. Non-numeric reference columns will revert to exact matching regardless.fill_value (
Any, default0) –The value used to fill/mask the non-reference columns wherever the condition (exact or approximate match) is met. This can be any valid type, e.g., integer, float, string, np.nan, etc. If
fill_value='auto'and multiple values are given, each row matched by a particular reference value is filled with that same reference value.- Examples:
If
values=9andfill_value='auto', the fill value is 9 for matched rows.If
values=['a', 10]andfill_value='auto', then rows matching ‘a’ are filled with ‘a’, and rows matching 10 are filled with 10.
mask_columns (
strorlistofstr, optional) – If specified, only these columns are masked. If None, all columns exceptref_colare masked. If any column inmask_columnsdoes not exist in the DataFrame anderror='raise', a KeyError is raised; otherwise, a warning may be issued or ignored.error (
{'raise', 'warn', 'ignore'}, default'raise') –- Controls how to handle errors:
’raise’: raise an error if the reference column does not exist or if any of the given values cannot be matched (or approximated).
’warn’: only issue a warning instead of raising an error.
’ignore’: silently ignore any issues.
verbose (
int, default0) –- Verbosity level:
0: silent (no messages).
1: minimal feedback.
2 or 3: more detailed messages for debugging.
inplace (
bool, defaultFalse) – If True, performs the operation in place and returns the original DataFrame with modifications. If False, returns a modified copy, leaving the original unaltered.savefile (
strorNone, optional) – File path where the DataFrame is saved if the decorator-based saving is active. If None, no saving occurs.
- Returns:
A DataFrame where rows matching the specified condition (exact or approximate) have had their non-reference columns replaced by
fill_value.- Return type:
pd.DataFrame- Raises:
KeyError – If
error='raise'andref_colis not indata.columns.ValueError – If
error='raise'and no exact/approx match can be found for one or more entries invalues.
Notes
If
valuesis None, all rows are masked in the non-ref columns, effectively overwriting them withfill_value.When
find_closest=True, approximate matching is performed only if the reference column is numeric. For non-numeric data, it falls back to exact matching.When multiple reference values are provided, each is processed in turn. If fill_value=’auto’, each matched row is filled with that specific reference value.
Examples
>>> import pandas as pd >>> from fusionlab.utils.data_utils import mask_by_reference >>> >>> df = pd.DataFrame({ ... "A": [10, 0, 8, 0], ... "B": [2, 0.5, 18, 85], ... "C": [34, 0.8, 12, 4.5], ... "D": [0, 78, 25, 3.2] ... }) >>> >>> # Example 1: Exact matching, replace all columns except 'A' with 0 >>> masked_df = mask_by_reference( ... data=df, ... ref_col="A", ... values=0, ... fill_value=0, ... find_closest=False, ... error="raise" ... ) >>> print(masked_df) >>> # 'B', 'C', 'D' for rows where A=0 are replaced with 0. >>> >>> # Example 2: Approximate matching for numeric >>> # If 'A' has values [0, 10, 8] and we search for 9, then 'A=8' or 'A=10' >>> # are the closest, so those rows get masked in non-ref columns. >>> masked_df2 = mask_by_reference( ... data=df, ... ref_col="A", ... values=9, ... find_closest=True, ... fill_value=-999 ... ) >>> print(masked_df2)
>>> >>> # Example 2: Approx. match for numeric ref_col >>> # 9 is between 8 and 10, so rows with A=8 and A=10 are masked >>> res2 = mask_by_reference(df, "A", 9, find_closest=True, fill_value=-999) >>> print(res2) ... # Rows 0 (A=10) and 2 (A=8) are replaced with -999 in columns B,C,D >>> >>> # Example 3: fill_value='auto' with multiple values >>> # Rows matching A=0 => fill with 0; rows matching A=8 => fill with 8 >>> res3 = mask_by_reference(df, "A", [0, 8], fill_value='auto') >>> print(res3) ... # => rows with A=0 => B,C,D replaced by 0 ... # => rows with A=8 => B,C,D replaced by 8 >>> >>> # 2) mask_columns=['C','D'] => only columns C and D are masked >>> res2 = mask_by_reference(df, "A", values=0, fill_value=999, ... mask_columns=["C","D"]) >>> print(res2) ... # Rows where A=0 => columns C,D replaced by 999, while B remains unchanged >>>