holisticai.datasets.Dataset#

class holisticai.datasets.Dataset(_data: DataFrame | None = None, _metadata=None, **kargs)[source]#

Represents a dataset.

Parameters

data: pd.DataFrame

The underlying data of the dataset.

features: list[str]

The list of features in the dataset.

num_rows: int

The number of rows in the dataset.

random_state: np.random.RandomState

The random state used for sampling.

filter(fn)[source]#

Returns a new dataset with rows filtered based on the given function.

groupby(key: list[str] | str)[source]#

Returns a new GroupByDataset object based on the given key.

map(fn, vectorized=True)[source]#

Applies a function to the dataset and returns a new dataset.

Parameters

fn: function

The function to apply to the dataset.

vectorized: bool

Whether to apply the function in a vectorized manner or not.

remove_columns(columns: str | list)[source]#

Returns a new dataset with the given columns removed.

rename(renames)[source]#

Returns a new dataset with renamed columns.

sample(n, random_state=None)[source]#

Returns a random sample of n rows from the dataset.

select(indices: Iterable)[source]#

Returns a new dataset with selected rows based on the given indices.

train_test_split(test_size=0.3, **kargs)[source]#

Splits the dataset into train and test datasets.