Datasets#

In this section, we provide an overview of the datasets used in our projects. The datasets are categorized into two groups: Raw Datasets and Processed Datasets.

Raw Datasets#

Raw datasets are the initial data collected from the original sources. These datasets are typically unprocessed and require significant preparation before they can be used for machine learning tasks. The table below provides a summary of the raw datasets.

Table 1 Raw Datasets#

Dataset

Method

Description

Student

load_hai_dataset(“student”)

Student performance dataset, used to predict students’ grades based on various features.

Adult

load_hai_dataset(“adult”)

Adult dataset, used for income classification tasks based on census data.

Law School

load_hai_dataset(“law_school”)

Law school admission dataset, used to predict law school admission based on various features.

Last FM

load_hai_dataset(“lastfm”)

Last FM dataset, used for music recommendation tasks based on user listening history.

US Crime

load_hai_dataset(“us_crime”)

US Crime dataset, used to analyze and predict crime rates based on various socio-economic factors.

Clinical Records

load_hai_dataset(“clinical_records”)

Heart failure clinical records dataset, used to predict heart failure events based on clinical features.

German Credit

load_hai_dataset(“german_credit”)

German Credit dataset, used for credit risk classification based on financial and demographic factors.

Census KDD

load_hai_dataset(“census_kdd”)

Census KDD dataset, used for income prediction based on detailed census data.

Bank Marketing

load_hai_dataset(“bank_marketing”)

Bank Marketing dataset, used to predict whether a client will subscribe to a term deposit based on direct marketing campaigns.

Compas

load_hai_dataset(“compas_two_year_recid”)

COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms.

Compas

load_hai_dataset(“compas_is_recid”)

COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms.

Diabetes

load_hai_dataset(“diabetes”)

Diabetes dataset, used for predicting the onset of diabetes based on health features.

ACS Income

load_hai_dataset(“acsincome”)

American Community Survey Income dataset, used to predict income levels based on demographic and socioeconomic data.

ACS Public Coverage

load_hai_dataset(“acspublic”)

American Community Survey Public Coverage dataset, used to analyze public health insurance coverage based on demographic data.

For example, if we want to load the Adult dataset, we can use the following code:

from holisticai.datasets import load_hai_datasets

data, target = load_hai_datasets(dataset_name="adult")

Processed Datasets#

Processed datasets are refined and structured for specific machine learning tasks, addressing various technical concerns such as bias, efficacy, explainability, etc. These datasets are encapsulated within a datasets_objects, containing variables that are ready for the machine learning process. The table below provides details on these processed datasets. The function load_dataset in dataset_loading_functions allow us to load the processed datasets. The function receibe the following parameters:

  • Protected Attribute: The attribute that is considered sensitive and should be protected from bias.

  • Processed: The method used to process X and y. Normally categorical to numerical encoding, normalization, and standardization.

  • Target: If the dataset has more than one target, the primary target is specified here.

Table 2 Processed Datasets#

dataset_name

Dataset

Learning Task

protected_attribute

adult

Adult

Binary Classification

race / sex

clinical_records

Heart Clinical Records

Binary Classification

sex

law_school

Law School

Binary Classification

race / gender

lastfm

Last FM

Recommender

student

Student

Binary Classification

sex / address

student_multiclass

Student (Multiclass)

Multiclassification

sex / address

us_crime

US Crime

Regression

race

us_crime_multiclass

US Crime (Multiclass)

Regression

race

german_credit

German Credit

Binary Classification

sex

census_kdd

Census KDD

Binary Classification

sex/race

bank_marketing

Bank Marketing

Binary Classification

marital

compas_two_year_recid

Compas (two_year_recid)

Binary Classification

sex/race

compas_is_recid

Compas (is_recid)

Binary Classification

sex/race

diabetes

Diabetes

Binary Classification

sex/race

acsincome

ACS Income

Binary Classification

sex/race

acspublic

ACS Public Coverage

Binary Classification

sex/race

You can use this processed datasets using the function load_dataset from holisticai.datasets. For example, to load the processed version of the Adult dataset and use the protected attribute sex, represented by group_a and group_b, we can use the following code:

from holisticai.datasets import load_dataset

 dataset = load_dataset(dataset_name="adult", preprocessed=True, protected_attribute="sex")