Datasets#

In this section, we provide an overview of the datasets used in our projects. The datasets are categorized into two groups: Raw Datasets and Processed Datasets.

Raw Datasets #

Raw datasets are the initial data collected from the original sources. These datasets are typically unprocessed and require significant preparation before they can be used for machine learning tasks. The table below provides a summary of the raw datasets.

Table 1 Raw Datasets#
Dataset	Method	Description
Student	load_hai_dataset(“student”)	Student performance dataset, used to predict students’ grades based on various features.
Adult	load_hai_dataset(“adult”)	Adult dataset, used for income classification tasks based on census data.
Law School	load_hai_dataset(“law_school”)	Law school admission dataset, used to predict law school admission based on various features.
Last FM	load_hai_dataset(“lastfm”)	Last FM dataset, used for music recommendation tasks based on user listening history.
US Crime	load_hai_dataset(“us_crime”)	US Crime dataset, used to analyze and predict crime rates based on various socio-economic factors.
Clinical Records	load_hai_dataset(“clinical_records”)	Heart failure clinical records dataset, used to predict heart failure events based on clinical features.
German Credit	load_hai_dataset(“german_credit”)	German Credit dataset, used for credit risk classification based on financial and demographic factors.
Census KDD	load_hai_dataset(“census_kdd”)	Census KDD dataset, used for income prediction based on detailed census data.
Bank Marketing	load_hai_dataset(“bank_marketing”)	Bank Marketing dataset, used to predict whether a client will subscribe to a term deposit based on direct marketing campaigns.
Compas	load_hai_dataset(“compas_two_year_recid”)	COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms.
Compas	load_hai_dataset(“compas_is_recid”)	COMPAS dataset, used for recidivism prediction and analyzing bias in criminal justice algorithms.
Diabetes	load_hai_dataset(“diabetes”)	Diabetes dataset, used for predicting the onset of diabetes based on health features.
ACS Income	load_hai_dataset(“acsincome”)	American Community Survey Income dataset, used to predict income levels based on demographic and socioeconomic data.
ACS Public Coverage	load_hai_dataset(“acspublic”)	American Community Survey Public Coverage dataset, used to analyze public health insurance coverage based on demographic data.

For example, if we want to load the Adult dataset, we can use the following code:

from holisticai.datasets import load_hai_datasets

data, target = load_hai_datasets(dataset_name="adult")

Processed Datasets #

Processed datasets are refined and structured for specific machine learning tasks, addressing various technical concerns such as bias, efficacy, explainability, etc. These datasets are encapsulated within a datasets_objects, containing variables that are ready for the machine learning process. The table below provides details on these processed datasets. The function load_dataset in dataset_loading_functions allow us to load the processed datasets. The function receibe the following parameters:

Protected Attribute: The attribute that is considered sensitive and should be protected from bias.
Processed: The method used to process X and y. Normally categorical to numerical encoding, normalization, and standardization.
Target: If the dataset has more than one target, the primary target is specified here.

Table 2 Processed Datasets#
dataset_name	Dataset	Learning Task	protected_attribute
adult	Adult	Binary Classification	race / sex
clinical_records	Heart Clinical Records	Binary Classification	sex
law_school	Law School	Binary Classification	race / gender
lastfm	Last FM	Recommender
student	Student	Binary Classification	sex / address
student_multiclass	Student (Multiclass)	Multiclassification	sex / address
us_crime	US Crime	Regression	race
us_crime_multiclass	US Crime (Multiclass)	Regression	race
german_credit	German Credit	Binary Classification	sex
census_kdd	Census KDD	Binary Classification	sex/race
bank_marketing	Bank Marketing	Binary Classification	marital
compas_two_year_recid	Compas (two_year_recid)	Binary Classification	sex/race
compas_is_recid	Compas (is_recid)	Binary Classification	sex/race
diabetes	Diabetes	Binary Classification	sex/race
acsincome	ACS Income	Binary Classification	sex/race
acspublic	ACS Public Coverage	Binary Classification	sex/race

You can use this processed datasets using the function load_dataset from holisticai.datasets. For example, to load the processed version of the Adult dataset and use the protected attribute sex, represented by group_a and group_b, we can use the following code:

from holisticai.datasets import load_dataset

 dataset = load_dataset(dataset_name="adult", preprocessed=True, protected_attribute="sex")

Datasets#

Raw Datasets#

Processed Datasets#

Raw Datasets #

Processed Datasets #