holisticai.security.mitigation.Anonymize#

class holisticai.security.mitigation.Anonymize(k: int, quasi_identifiers: ndarray | list, features: list | None = None, features_names: list | None = None, quasi_identifer_slices: list | None = None, categorical_features: list | None = None, is_regression: bool | None = False, train_only_QI: bool | None = False)[source]#

Class for performing tailored, model-guided anonymization of training datasets for ML models.

Parameters

kint

The privacy parameter that determines the number of records that will be indistinguishable from each other (when looking at the quasi identifiers). Should be at least 2.

quasi_identifiersnp.ndarray or list of strings or integers.

The features that need to be minimized in case of pandas data, and indexes of features in case of numpy data.

quasi_identifer_slices: list of lists of strings or integers.

If some of the quasi-identifiers represent 1-hot encoded features that need to remain consistent after anonymization, provide a list containing the list of column names or indexes that represent a single feature.

categorical_featureslist

The list of categorical features (if supplied, these featurtes will be one-hot encoded before using them to train the decision tree model).

is_regressionbool

Whether the model is a regression model or not (if False, assumes a classification model). Default is False.

train_only_QIbool

The required method to train data set for anonymization. Default is to train the tree on all features.

References

Goldsteen, Abigail, Gilad Ezov, Ron Shmelkin, Micha Moffie, and Ariel Farkash. “Anonymizing machine learning models.” In International Workshop on Data Privacy Management, pp. 121-136. Cham: Springer International Publishing, 2021.

anonymize(X_train, y_train)[source]#

Description

Method for performing model-guided anonymization.

Parameters

X_trainnp.ndarray or pandas DataFrame

Dataset containing the training data for the model.

y_trainnp.ndarray

The predictions of the original model on the training data.

Returns

The anonymized training dataset as pandas DataFrame.