Quickstart#
This quickstart guide will show you how to get started with the holisticai library.
Install holisticai using pip:
[1]:
#!pip install -q holisticai
[2]:
# skip warnings
import warnings
warnings.filterwarnings("ignore")
Load an example dataset and split
[4]:
from holisticai.datasets import load_dataset
dataset = load_dataset('law_school')
dataset_split = dataset.train_test_split(test_size=0.3)
# some stats about the data
dataset['x'].describe()
[4]:
| age | decile1 | decile3 | fam_inc | lsat | ugpa | |
|---|---|---|---|---|---|---|
| count | 20800.000000 | 20800.000000 | 20800.000000 | 20800.000000 | 20800.000000 | 20800.000000 |
| mean | 59.112115 | 5.746779 | 5.565096 | 3.465625 | 36.762808 | 3.226500 |
| std | 5.271906 | 2.780279 | 2.857598 | 0.853842 | 5.386819 | 0.412616 |
| min | 3.000000 | 1.000000 | 1.000000 | 1.000000 | 11.000000 | 0.000000 |
| 25% | 58.000000 | 3.000000 | 3.000000 | 3.000000 | 33.000000 | 3.000000 |
| 50% | 61.000000 | 6.000000 | 6.000000 | 4.000000 | 37.000000 | 3.300000 |
| 75% | 62.000000 | 8.000000 | 8.000000 | 4.000000 | 41.000000 | 3.500000 |
| max | 69.000000 | 10.000000 | 10.000000 | 5.000000 | 48.000000 | 4.000000 |
[5]:
dataset['y'].value_counts()
[5]:
bar
1 18507
0 2293
Name: count, dtype: int64
Dataset clustering and correlation analysis
[6]:
import seaborn as sns
g = sns.clustermap(dataset['x'].corr(), center=0, cmap="seismic", dendrogram_ratio=(.1, .2), cbar_pos=(.02, .32, .03, .2),
linewidths=.75, figsize=(5, 5))
g.ax_row_dendrogram.remove()
Separate the data into train and test sets
[7]:
train_data = dataset_split['train']
test_data = dataset_split['test']
print(train_data['x'].shape)
print(test_data['x'].shape)
(14560, 8)
(6240, 8)
Rescale the training and testing data
[8]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_t = scaler.fit_transform(train_data['x'])
X_test_t = scaler.transform(test_data['x'])
Training a logistic regression model and compute predictions
[9]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42, max_iter=500)
model.fit(X_train_t, train_data['y'])
# make predictions
y_pred = model.predict(X_test_t)
A simple classification report for model’s predictions
[10]:
from sklearn.metrics import classification_report
print(classification_report(test_data['y'], y_pred))
precision recall f1-score support
0 0.60 0.22 0.33 707
1 0.91 0.98 0.94 5533
accuracy 0.90 6240
macro avg 0.75 0.60 0.64 6240
weighted avg 0.87 0.90 0.87 6240
Classification bias metrics for the model’s predictions
[11]:
from holisticai.metrics.bias import classification_bias_metrics
metrics = classification_bias_metrics(
group_a = test_data['group_a'],
group_b = test_data['group_b'],
y_true = test_data['y'],
y_pred = y_pred
)
metrics
[11]:
| Value | Reference | |
|---|---|---|
| Metric | ||
| Statistical Parity | 0.172941 | 0 |
| Disparate Impact | 1.213069 | 1 |
| Four Fifths Rule | 0.824356 | 1 |
| Cohen D | 0.902750 | 0 |
| 2SD Rule | 24.618566 | 0 |
| Equality of Opportunity Difference | 0.085044 | 0 |
| False Positive Rate Difference | 0.313732 | 0 |
| Average Odds Difference | 0.199388 | 0 |
| Accuracy Difference | 0.164560 | 0 |
Plot the bias metrics in a simple plot
[12]:
from holisticai.plots.bias import bias_metrics_report
bias_metrics_report(model_type='binary_classification', table_metrics=metrics)