fluke.data.datasets

This module contains the Datasets for loading the supported datasets.

Classes

static class fluke.data.datasets.Datasets

get

Get a dataset by name initialized with the provided arguments.

MNIST

Load the MNIST dataset.

MNISTM

Load the MNIST-M dataset.

EMNIST

Load the Extended MNIST (EMNIST) dataset.

CIFAR10

Load the CIFAR-10 dataset.

CIFAR100

Load the CIFAR-100 dataset.

SVHN

Load the Street View House Numbers (SVHN) dataset.

CINIC10

Load the CINIC-10 dataset.

FASHION_MNIST

Load the Fashion MNIST dataset.

TINY_IMAGENET

Load the Tiny-ImageNet dataset.

FEMNIST

Load the Federated EMNIST (FEMNIST) dataset.

SHAKESPEARE

Load the Federated Shakespeare dataset.

FCUBE

This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079.

class fluke.data.datasets.Datasets[source]

Static class for loading datasets. Datasets are downloaded (if needed) into the path folder. The supported datasets are: MNIST, MNISTM, SVHN, FEMNIST, EMNIST, CIFAR10, CIFAR100, Tiny Imagenet, Shakespear, Fashion MNIST, and CINIC10. Each dataset but femnist and shakespeare can be transformed using the transforms argument. Each dataset is returned as a fluke.data.DataContainer object.

Important

onthefly_transforms are transformations that are applied on-the-fly to the data through the data loader. This is useful when the transformations are stochastic and should be applied at each iteration. These transformations cannot be configured through the configuration file.

classmethod get(name: str, **kwargs) DataContainer | tuple[source]

Get a dataset by name initialized with the provided arguments. Supported datasets are: mnist, mnistm, svhn, femnist, emnist, cifar10, cifar100, tiny_imagenet, shakespeare, fashion_mnist, and cinic10. If name is not in the supported datasets, it is assumed to be a fully qualified name of a custom dataset function (callable[..., DataContainer]).

Parameters:
  • name (str) – The name of the dataset to load or the fully qualified name of a custom dataset function.

  • **kwargs – Additional arguments to pass to construct the dataset.

Returns:

The DataContainer object containing the dataset.

Return type:

DataContainer

Raises:

ValueError – If the dataset is not supported or the name is wrong.

classmethod MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None, channel_dim: bool = False) DataContainer[source]

Load the MNIST dataset. The dataset is split into training and testing sets according to the default split of the torchvision.datasets.MNIST class. If no transformations are provided, the data is normalized to the range [0, 1]. An example of the dataset is a 28x28 image, i.e., a tensor of shape (28, 28). The dataset has 10 classes, corresponding to the digits 0-9.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

  • channel_dim (bool, optional) – Whether to add a channel dimension to the data, i.e., the shape of the an example becomes (1, 28, 28). Defaults to False.

Returns:

The MNIST dataset.

Return type:

DataContainer

classmethod MNISTM(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the MNIST-M dataset. MNIST-M is a dataset where the MNIST digits are placed on random color patches. The dataset is split into training and testing sets according to the default split of the data at https://github.com/liyxi/mnist-m/releases/download/data/. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 10 classes, corresponding to the digits 0-9.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The MNIST-M dataset.

Return type:

DataContainer

classmethod EMNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the Extended MNIST (EMNIST) dataset. The dataset is split into training and testing sets according to the default split of the data at https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist as provided by the torchvision.datasets.EMNIST class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 47 classes, corresponding to the digits 0-9 and the uppercase and lowercase letters.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The EMNIST dataset.

Return type:

DataContainer

classmethod SVHN(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the Street View House Numbers (SVHN) dataset. The dataset is split into training and testing sets according to the default split of the torchvision.datasets.SVHN class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the digits 0-9.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The SVHN dataset.

Return type:

DataContainer

classmethod CIFAR10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the CIFAR-10 dataset. The dataset is split into training and testing sets according to the default split of the torchvision.datasets.CIFAR10 class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the following classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The images shape is (3, 32, 32).

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The CIFAR-10 dataset.

Return type:

DataContainer

classmethod CINIC10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the CINIC-10 dataset. CINIC-10 is an augmented extension of CIFAR-10. It contains the images from CIFAR-10 (60,000 images, 32x32 RGB pixels) and a selection of ImageNet database images (210,000 images downsampled to 32x32). It was compiled as a ‘bridge’ between CIFAR-10 and ImageNet, for benchmarking machine learning applications. It is split into three equal subsets - train, validation, and test - each of which contain 90,000 images.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The CINIC-10 dataset.

Return type:

DataContainer

classmethod CIFAR100(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the CIFAR-100 dataset. The dataset is split into training and testing sets according to the default split of the torchvision.datasets.CIFAR100 class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 100 classes, corresponding to different type of objects. The images shape is (3, 32, 32).

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The CIFAR-100 dataset.

Return type:

DataContainer

classmethod FASHION_MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the Fashion MNIST dataset. The dataset is split into training and testing sets according to the default split of the torchvision.datasets.FashionMNIST class. The dataset contains 10 classes, corresponding to different types of clothing. The images shape is (28, 28).

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The CIFAR-100 dataset.

Return type:

DataContainer

classmethod TINY_IMAGENET(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]

Load the Tiny-ImageNet dataset. This version of the dataset is the one offered by the Hugging Face. The dataset is split into training and testing sets according to the default split of the data. The dataset contains 200 classes, corresponding to different types of objects. The images shape is (3, 64, 64).

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "../data".

  • transforms (callable, optional) – The transformations to apply to the data. Defaults to None.

  • onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to None.

Returns:

The Tiny-ImageNet dataset.

Return type:

DataContainer

classmethod FEMNIST(path: str = './data', batch_size: int = 10, filter: str = 'all', onthefly_transforms: callable | None = None) DataContainer[source]

Load the Federated EMNIST (FEMNIST) dataset. This dataset is the one offered by the Leaf project. FEMNIST contains images of handwritten digits of size 28 by 28 pixels (with option to make them all 128 by 128 pixels) taken from 3500 users. The dataset has 62 classes corresponding to different characters. The label-class correspondence is as follows:

classes: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
labels : 01234567890123456789012345678901234567890123456789012345678901
                   10        20        30        40        50        60

Important

Differently from the other datasets (but SHAKESPEARE()), the FEMNIST dataset can not be downloaded directly from fluke but it must be downloaded from the Leaf project and stored in the path folder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:

path
├── FEMNIST
   ├── train
      ├── user_data_0.json
      ├── user_data_1.json
      └── ...
   └── test
       ├── user_data_0.json
       ├── user_data_1.json
       └── ...

where in each user_data_X.json file there is a dictionary with the keys user_data containing the data of the user.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "./data".

  • batch_size (int, optional) – The batch size. Defaults to 10.

  • filter (str, optional) – The filter for the selection of a specific portion of the dataset. The options are: all, uppercase, lowercase, and digits. Defaults to "all".

Returns:

A tuple containing the training and testing data loaders for the clients. The

server data loader is None.

Return type:

tuple

classmethod SHAKESPEARE(path: str = './data', batch_size: int = 10, onthefly_transforms: callable | None = None) tuple[source]

Load the Federated Shakespeare dataset. This dataset is the one offered by the Leaf project. Shakespeare is a text dataset containing dialogues from Shakespeare’s plays. Dialogues are taken from 660 users and the task is to predict the next character in a dialogue (which is solved as a classification problem with 100 classes).

Important

Differently from the other datasets (but FEMNIST()), the SHAKESPEARE dataset can not be downloaded directly from fluke but it must be downloaded from the Leaf project and stored in the path folder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:

path
├── shakespeare
   ├── train
      ├── user_data_0.json
      ├── user_data_1.json
      └── ...
   └── test
       ├── user_data_0.json
       ├── user_data_1.json
       └── ...

where in each user_data_X.json file there is a dictionary with the keys user_data containing the data of the user.

Parameters:
  • path (str, optional) – The path where the dataset is stored. Defaults to "./data".

  • batch_size (int, optional) – The batch size. Defaults to 10.

Returns:

A tuple containing the training and testing data loaders for the clients. The

server data loader is None.

Return type:

tuple

classmethod FCUBE(path: str = './data', batch_size: int = 10, n_points: int = 1000, test_size: int = 0.1, dimensions: int = 3) tuple[source]

This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079. This implementation generalizes for $n$ dimensions. The number of clients depends on the number of dimensions, i.e., the dataset will be divided in a number of partitions that is equal to $2^{d-1}$ where $d$ is the number of dimensions.

Warning

This procedure may become very slow for large value of dimensions, e.g., dimensions > 8.

Parameters:
  • path (str, optional) – The path where to save or load the dataset. Defaults to “./data”.

  • batch_size (int, optional) – The batch size. Defaults to 10.

  • n_points (int, optional) – The total number of points to generate. Defaults to 1000.

  • test_size (int, optional) – The percentage of points to include in the test sets for both the clients and the server. Defaults to 0.1.

  • dimensions (int, optional) – The number of dimensions of the points. Defaults to 3.

Returns:

_description_

Return type:

tuple