Datasets#
In this section, we provide an overview of the datasets used in our projects. The datasets are categorized into two groups: Raw Datasets and Processed Datasets.
Raw Datasets#
Raw datasets are the initial data collected from the original sources. These datasets are typically unprocessed and require significant preparation before they can be used for machine learning tasks. The table below provides a summary of the raw datasets.
Dataset |
Method |
Description |
|---|---|---|
Student |
load_hai_dataset(“student”) |
Student performance dataset, used to predict students’ grades based on various features. |
Adult |
load_hai_dataset(“adult”) |
Adult dataset, used for income classification tasks based on census data. |
Law School |
load_hai_dataset(“law_school”) |
Law school admission dataset, used to predict law school admission based on various features. |
Last FM |
load_hai_dataset(“lastfm”) |
Last FM dataset, used for music recommendation tasks based on user listening history. |
US Crime |
load_hai_dataset(“us_crime”) |
US Crime dataset, used to analyze and predict crime rates based on various socio-economic factors. |
Clinical Records |
load_hai_dataset(“clinical_records”) |
Heart failure clinical records dataset, used to predict heart failure events based on clinical features. |
German Credit |
load_hai_dataset(“german_credit”) |
German Credit dataset, used for credit risk classification based on financial and demographic factors. |
Census KDD |
load_hai_dataset(“census_kdd”) |
Census KDD dataset, used for income prediction based on detailed census data. |
Bank Marketing |
load_hai_dataset(“bank_marketing”) |
Bank Marketing dataset, used to predict whether a client will subscribe to a term deposit based on direct marketing campaigns. |
Compas |
load_hai_dataset(“compas_two_year_recid”) |
COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms. |
Compas |
load_hai_dataset(“compas_is_recid”) |
COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms. |
Diabetes |
load_hai_dataset(“diabetes”) |
Diabetes dataset, used for predicting the onset of diabetes based on health features. |
ACS Income |
load_hai_dataset(“acsincome”) |
American Community Survey Income dataset, used to predict income levels based on demographic and socioeconomic data. |
ACS Public Coverage |
load_hai_dataset(“acspublic”) |
American Community Survey Public Coverage dataset, used to analyze public health insurance coverage based on demographic data. |
For example, if we want to load the Adult dataset, we can use the following code:
from holisticai.datasets import load_hai_datasets
data, target = load_hai_datasets(dataset_name="adult")
Processed Datasets#
Processed datasets are refined and structured for specific machine learning tasks, addressing various technical concerns such as bias, efficacy, explainability, etc. These datasets are encapsulated within a datasets_objects, containing variables that are ready for the machine learning process. The table below provides details on these processed datasets. The function load_dataset in dataset_loading_functions allow us to load the processed datasets. The function receibe the following parameters:
Protected Attribute: The attribute that is considered sensitive and should be protected from bias.
Processed: The method used to process X and y. Normally categorical to numerical encoding, normalization, and standardization.
Target: If the dataset has more than one target, the primary target is specified here.
dataset_name |
Dataset |
Learning Task |
protected_attribute |
|---|---|---|---|
adult |
Adult |
Binary Classification |
race / sex |
clinical_records |
Heart Clinical Records |
Binary Classification |
sex |
law_school |
Law School |
Binary Classification |
race / gender |
lastfm |
Last FM |
Recommender |
|
student |
Student |
Binary Classification |
sex / address |
student_multiclass |
Student (Multiclass) |
Multiclassification |
sex / address |
us_crime |
US Crime |
Regression |
race |
us_crime_multiclass |
US Crime (Multiclass) |
Regression |
race |
german_credit |
German Credit |
Binary Classification |
sex |
census_kdd |
Census KDD |
Binary Classification |
sex/race |
bank_marketing |
Bank Marketing |
Binary Classification |
marital |
compas_two_year_recid |
Compas (two_year_recid) |
Binary Classification |
sex/race |
compas_is_recid |
Compas (is_recid) |
Binary Classification |
sex/race |
diabetes |
Diabetes |
Binary Classification |
sex/race |
acsincome |
ACS Income |
Binary Classification |
sex/race |
acspublic |
ACS Public Coverage |
Binary Classification |
sex/race |
You can use this processed datasets using the function load_dataset from holisticai.datasets. For example, to load the processed version of the Adult dataset and use the protected attribute sex, represented by group_a and group_b, we can use the following code:
from holisticai.datasets import load_dataset
dataset = load_dataset(dataset_name="adult", preprocessed=True, protected_attribute="sex")