fluke.data.datasets¶
This module contains the datasets for loading the supported datasets.
Classes¶
static class fluke.data.datasets.Datasets
| Get a dataset by name initialized with the provided arguments. | |
| Load the MNIST dataset. | |
| Load the MNIST-M dataset. | |
| Load the Extended MNIST (EMNIST) dataset. | |
| Load the CIFAR-10 dataset. | |
| Load the CIFAR-100 dataset. | |
| Load the Street View House Numbers (SVHN) dataset. | |
| Load the CINIC-10 dataset. | |
| Load the Fashion MNIST dataset. | |
| Load the Tiny-ImageNet dataset. | |
| Load the Federated EMNIST (FEMNIST) dataset. | |
| Load the Federated Shakespeare dataset. | |
| This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079. | 
- class fluke.data.datasets.Datasets[source]¶
- Static class for loading datasets. Datasets are downloaded (if needed) into the - pathfolder. The supported datasets are:- MNIST,- MNISTM,- SVHN,- FEMNIST,- EMNIST,- CIFAR10,- CIFAR100,- Tiny Imagenet,- Shakespeare,- Fashion MNIST, and- CINIC10. Each dataset but- femnistand- shakespearecan be transformed using the- transformsargument. Each dataset is returned as a- fluke.data.DataContainerobject.- Important - onthefly_transformsare transformations that are applied on-the-fly to the data through the data loader. This is useful when the transformations are stochastic and should be applied at each iteration. These transformations cannot be configured through the configuration file.- classmethod get(name: str, **kwargs) DataContainer | tuple[source]¶
- Get a dataset by name initialized with the provided arguments. Supported datasets are: - mnist,- mnistm,- svhn,- femnist,- emnist,- cifar10,- cifar100,- tiny_imagenet,- shakespeare,- fashion_mnist, and- cinic10. If name is not in the supported datasets, it is assumed to be a fully qualified name of a custom dataset function (- callable[..., DataContainer]).- Parameters:
- name (str) – The name of the dataset to load or the fully qualified name of a custom dataset function. 
- **kwargs – Additional arguments to pass to construct the dataset. 
 
- Returns:
- The - DataContainerobject containing the dataset.
- Return type:
- Raises:
- ValueError – If the dataset is not supported or the name is wrong. 
 
 - classmethod MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None, channel_dim: bool = False) DataContainer[source]¶
- Load the MNIST dataset. The dataset is split into training and testing sets according to the default split of the - torchvision.datasets.MNISTclass. If no transformations are provided, the data is normalized to the range [0, 1]. An example of the dataset is a 28x28 image, i.e., a tensor of shape (28, 28). The dataset has 10 classes, corresponding to the digits 0-9.- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
- channel_dim (bool, optional) – Whether to add a channel dimension to the data, i.e., the shape of the example becomes (1, 28, 28). Defaults to - False.
 
- Returns:
- The MNIST dataset. 
- Return type:
 
 - classmethod MNISTM(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the MNIST-M dataset. MNIST-M is a dataset where the MNIST digits are placed on random color patches. The dataset is split into training and testing sets according to the default split of the data at https://github.com/liyxi/mnist-m/releases/download/data/. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 10 classes, corresponding to the digits 0-9. - Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The MNIST-M dataset. 
- Return type:
 
 - classmethod EMNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None, channel_dim: bool = False) DataContainer[source]¶
- Load the Extended MNIST (EMNIST) dataset. The dataset is split into training and testing sets according to the default split of the data at https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist as provided by the - torchvision.datasets.EMNISTclass. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 47 classes, corresponding to the digits 0-9 and the uppercase and lowercase letters.- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
- channel_dim (bool, optional) – Whether to add a channel dimension to the data, i.e., the shape of the example becomes (1, 28, 28). Defaults to - False.
 
- Returns:
- The EMNIST dataset. 
- Return type:
 
 - classmethod SVHN(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the Street View House Numbers (SVHN) dataset. The dataset is split into training and testing sets according to the default split of the - torchvision.datasets.SVHNclass. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the digits 0-9.- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The SVHN dataset. 
- Return type:
 
 - classmethod CIFAR10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the CIFAR-10 dataset. The dataset is split into training and testing sets according to the default split of the - torchvision.datasets.CIFAR10class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the following classes:- airplane,- automobile,- bird,- cat,- deer,- dog,- frog,- horse,- ship, and- truck. The images shape is (3, 32, 32).- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The CIFAR-10 dataset. 
- Return type:
 
 - classmethod CINIC10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the CINIC-10 dataset. CINIC-10 is an augmented extension of CIFAR-10. It contains the images from CIFAR-10 (60,000 images, 32x32 RGB pixels) and a selection of ImageNet database images (210,000 images downsampled to 32x32). It was compiled as a ‘bridge’ between CIFAR-10 and ImageNet, for benchmarking machine learning applications. It is split into three equal subsets - train, validation, and test - each of which contain 90,000 images. - Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The CINIC-10 dataset. 
- Return type:
 
 - classmethod CIFAR100(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the CIFAR-100 dataset. The dataset is split into training and testing sets according to the default split of the - torchvision.datasets.CIFAR100class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 100 classes, corresponding to different type of objects. The images shape is (3, 32, 32).- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The CIFAR-100 dataset. 
- Return type:
 
 - classmethod FASHION_MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the Fashion MNIST dataset. The dataset is split into training and testing sets according to the default split of the - torchvision.datasets.FashionMNISTclass. The dataset contains 10 classes, corresponding to different types of clothing. The images shape is (28, 28).- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The CIFAR-100 dataset. 
- Return type:
 
 - classmethod TINY_IMAGENET(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer[source]¶
- Load the Tiny-ImageNet dataset. This version of the dataset is the one offered by the Hugging Face. The dataset is split into training and testing sets according to the default split of the data. The dataset contains 200 classes, corresponding to different types of objects. The images shape is (3, 64, 64). - Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - ../data.
- transforms (callable, optional) – The transformations to apply to the data. Defaults to - None.
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The Tiny-ImageNet dataset. 
- Return type:
 
 - classmethod FEMNIST(path: str = './data', batch_size: int = 10, filter: str = 'all', onthefly_transforms: callable | None = None) DummyDataContainer[source]¶
- Load the Federated EMNIST (FEMNIST) dataset. This dataset is the one offered by the Leaf project. FEMNIST contains images of handwritten digits of size 28 by 28 pixels (with option to make them all 128 by 128 pixels) taken from 3500 users. The dataset has 62 classes corresponding to different characters. The label-class correspondence is as follows: - classes: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz labels : 01234567890123456789012345678901234567890123456789012345678901 10 20 30 40 50 60 - Important - Differently from the other datasets (but - SHAKESPEARE()), the FEMNIST dataset can not be downloaded directly from- flukebut it must be downloaded from the Leaf project and stored in the- pathfolder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:- path ├── FEMNIST │ ├── train │ │ ├── user_data_0.json │ │ ├── user_data_1.json │ │ └── ... │ └── test │ ├── user_data_0.json │ ├── user_data_1.json │ └── ... - where in each - user_data_X.jsonfile there is a dictionary with the keys- user_datacontaining the data of the user.- Parameters:
- path (str, optional) – The path where the dataset is stored. Defaults to - "./data".
- batch_size (int, optional) – The batch size. Defaults to - 10.
- filter (str, optional) – The filter for the selection of a specific portion of the dataset. The options are: - all,- uppercase,- lowercase, and- digits. Defaults to- "all".
- onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to - None.
 
- Returns:
- The DummyDataContainerobject containing the training and
- testing data loaders for the clients. The server data loader is - None.
 
- The 
- Return type:
 
 - classmethod SHAKESPEARE(path: str = './data', batch_size: int = 10, onthefly_transforms: callable | None = None) DummyDataContainer[source]¶
- Load the Federated Shakespeare dataset. This dataset is the one offered by the Leaf project. Shakespeare is a text dataset containing dialogues from Shakespeare’s plays. Dialogues are taken from 660 users and the task is to predict the next character in a dialogue (which is solved as a classification problem with 100 classes). - Important - Differently from the other datasets (but - FEMNIST()), the- SHAKESPEAREdataset can not be downloaded directly from- flukebut it must be downloaded from the Leaf project and stored in the- pathfolder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:- path ├── shakespeare │ ├── train │ │ ├── user_data_0.json │ │ ├── user_data_1.json │ │ └── ... │ └── test │ ├── user_data_0.json │ ├── user_data_1.json │ └── ... - where in each - user_data_X.jsonfile there is a dictionary with the keys- user_datacontaining the data of the user.- Parameters:
- Returns:
- The DummyDataContainerobject containing the training and
- testing data loaders for the clients. The server data loader is - None.
 
- The 
- Return type:
 
 - classmethod FCUBE(path: str = './data', batch_size: int = 10, n_points: int = 1000, test_size: int = 0.1, dimensions: int = 3) tuple[source]¶
- This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079. This implementation generalizes for $n$ dimensions. The number of clients depends on the number of dimensions, i.e., the dataset will be divided in a number of partitions that is equal to $2^{d-1}$ where $d$ is the number of dimensions. - Warning - This procedure may become very slow for large value of dimensions, e.g., dimensions > 8. - Parameters:
- path (str, optional) – The path where to save or load the dataset. Defaults to “./data”. 
- batch_size (int, optional) – The batch size. Defaults to 10. 
- n_points (int, optional) – The total number of points to generate. Defaults to 1000. 
- test_size (int, optional) – The percentage of points to include in the test sets for both the clients and the server. Defaults to 0.1. 
- dimensions (int, optional) – The number of dimensions of the points. Defaults to 3. 
 
- Returns:
- _description_ 
- Return type: