Anonymization Mitigator#

Definition#

Anonymization is a privacy-preserving technique designed to protect sensitive information in datasets by making individual records indistinguishable from at least (k-1) other records. This ensures that even if a dataset is compromised, identifying specific individuals remains infeasible. The primary goal of anonymization is to balance data utility and privacy by transforming the data in a way that preserves its analytical value while safeguarding privacy.

Methodology#

The anonymization process involves the following key steps:

Identify Quasi-Identifiers (QI): Determine which attributes in the dataset can be used to identify individuals when combined with external data sources.
Train Initial Model: Train a machine learning model on the original dataset to capture its patterns and relationships.
Generate Predictions: Use the trained model to generate predictions for the training data. These predictions guide the anonymization process.
Train Anonymization Model: Train a decision tree model using the QI and the predictions from the initial model. The decision tree helps create groups of records that are similar in terms of their QI values and model predictions.
Group and Generalize: Group records in the leaf nodes of the decision tree, ensuring each group has at least (k) records. Replace the original values in each group with representative values to achieve k-anonymity.

Algorithm#

Initial Model Training: Train a machine learning model ( M ) on the original dataset ( D ).
Prediction Generation: Generate predictions ( hat{y} ) using ( M ) on ( D ).
Anonymization Model Training: Train a decision tree model ( T ) using ( QI ) and ( hat{y} ).
Record Grouping: Group records based on the leaf nodes of ( T ).
Generalization: Replace original values with representative values in each group.

Mathematically, the process can be summarized as follows:

\[\text{Train } M: D \rightarrow \hat{y}\]

\[\text{Train } T: (QI, \hat{y}) \rightarrow \text{Groups}\]

\[\text{Generalize}: \text{Groups} \rightarrow D_{\text{anonymized}}\]

Reference#

For more details, see the paper Anonymizing Machine Learning Models.