fluke.data.datasets
¶
This module contains the Datasets
for loading the supported datasets.
Classes¶
static class fluke.data.datasets.Datasets
Get a dataset by name initialized with the provided arguments. |
|
Load the MNIST dataset. |
|
Load the MNIST-M dataset. |
|
Load the Extended MNIST (EMNIST) dataset. |
|
Load the CIFAR-10 dataset. |
|
Load the CIFAR-100 dataset. |
|
Load the Street View House Numbers (SVHN) dataset. |
|
Load the CINIC-10 dataset. |
|
Load the Fashion MNIST dataset. |
|
Load the Tiny-ImageNet dataset. |
|
Load the Federated EMNIST (FEMNIST) dataset. |
|
Load the Federated Shakespeare dataset. |
|
This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079. |
- class fluke.data.datasets.Datasets[source]¶
Static class for loading datasets. Datasets are downloaded (if needed) into the
path
folder. The supported datasets are:MNIST
,MNISTM
,SVHN
,FEMNIST
,EMNIST
,CIFAR10
,CIFAR100
,Tiny Imagenet
,Shakespear
,Fashion MNIST
, andCINIC10
. Each dataset butfemnist
andshakespeare
can be transformed using thetransforms
argument. Each dataset is returned as afluke.data.DataContainer
object.Important
onthefly_transforms
are transformations that are applied on-the-fly to the data through the data loader. This is useful when the transformations are stochastic and should be applied at each iteration. These transformations cannot be configured through the configuration file.- classmethod get(name: str, **kwargs) DataContainer | tuple [source]¶
Get a dataset by name initialized with the provided arguments. Supported datasets are:
mnist
,mnistm
,svhn
,femnist
,emnist
,cifar10
,cifar100
,tiny_imagenet
,shakespeare
,fashion_mnist
, andcinic10
. If name is not in the supported datasets, it is assumed to be a fully qualified name of a custom dataset function (callable[..., DataContainer]
).- Parameters:
name (str) – The name of the dataset to load or the fully qualified name of a custom dataset function.
**kwargs – Additional arguments to pass to construct the dataset.
- Returns:
The
DataContainer
object containing the dataset.- Return type:
- Raises:
ValueError – If the dataset is not supported or the name is wrong.
- classmethod MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None, channel_dim: bool = False) DataContainer [source]¶
Load the MNIST dataset. The dataset is split into training and testing sets according to the default split of the
torchvision.datasets.MNIST
class. If no transformations are provided, the data is normalized to the range [0, 1]. An example of the dataset is a 28x28 image, i.e., a tensor of shape (28, 28). The dataset has 10 classes, corresponding to the digits 0-9.- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.channel_dim (bool, optional) – Whether to add a channel dimension to the data, i.e., the shape of the an example becomes (1, 28, 28). Defaults to
False
.
- Returns:
The MNIST dataset.
- Return type:
- classmethod MNISTM(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the MNIST-M dataset. MNIST-M is a dataset where the MNIST digits are placed on random color patches. The dataset is split into training and testing sets according to the default split of the data at https://github.com/liyxi/mnist-m/releases/download/data/. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 10 classes, corresponding to the digits 0-9.
- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The MNIST-M dataset.
- Return type:
- classmethod EMNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the Extended MNIST (EMNIST) dataset. The dataset is split into training and testing sets according to the default split of the data at https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist as provided by the
torchvision.datasets.EMNIST
class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset has 47 classes, corresponding to the digits 0-9 and the uppercase and lowercase letters.- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The EMNIST dataset.
- Return type:
- classmethod SVHN(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the Street View House Numbers (SVHN) dataset. The dataset is split into training and testing sets according to the default split of the
torchvision.datasets.SVHN
class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the digits 0-9.- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The SVHN dataset.
- Return type:
- classmethod CIFAR10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the CIFAR-10 dataset. The dataset is split into training and testing sets according to the default split of the
torchvision.datasets.CIFAR10
class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 10 classes, corresponding to the following classes:airplane
,automobile
,bird
,cat
,deer
,dog
,frog
,horse
,ship
, andtruck
. The images shape is (3, 32, 32).- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The CIFAR-10 dataset.
- Return type:
- classmethod CINIC10(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the CINIC-10 dataset. CINIC-10 is an augmented extension of CIFAR-10. It contains the images from CIFAR-10 (60,000 images, 32x32 RGB pixels) and a selection of ImageNet database images (210,000 images downsampled to 32x32). It was compiled as a ‘bridge’ between CIFAR-10 and ImageNet, for benchmarking machine learning applications. It is split into three equal subsets - train, validation, and test - each of which contain 90,000 images.
- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The CINIC-10 dataset.
- Return type:
- classmethod CIFAR100(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the CIFAR-100 dataset. The dataset is split into training and testing sets according to the default split of the
torchvision.datasets.CIFAR100
class. If no transformations are provided, the data is normalized to the range [0, 1]. The dataset contains 100 classes, corresponding to different type of objects. The images shape is (3, 32, 32).- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The CIFAR-100 dataset.
- Return type:
- classmethod FASHION_MNIST(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the Fashion MNIST dataset. The dataset is split into training and testing sets according to the default split of the
torchvision.datasets.FashionMNIST
class. The dataset contains 10 classes, corresponding to different types of clothing. The images shape is (28, 28).- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The CIFAR-100 dataset.
- Return type:
- classmethod TINY_IMAGENET(path: str = '../data', transforms: callable | None = None, onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the Tiny-ImageNet dataset. This version of the dataset is the one offered by the Hugging Face. The dataset is split into training and testing sets according to the default split of the data. The dataset contains 200 classes, corresponding to different types of objects. The images shape is (3, 64, 64).
- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"../data"
.transforms (callable, optional) – The transformations to apply to the data. Defaults to
None
.onthefly_transforms (callable, optional) – The transformations to apply on-the-fly to the data through the data loader. Defaults to
None
.
- Returns:
The Tiny-ImageNet dataset.
- Return type:
- classmethod FEMNIST(path: str = './data', batch_size: int = 10, filter: str = 'all', onthefly_transforms: callable | None = None) DataContainer [source]¶
Load the Federated EMNIST (FEMNIST) dataset. This dataset is the one offered by the Leaf project. FEMNIST contains images of handwritten digits of size 28 by 28 pixels (with option to make them all 128 by 128 pixels) taken from 3500 users. The dataset has 62 classes corresponding to different characters. The label-class correspondence is as follows:
classes: 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz labels : 01234567890123456789012345678901234567890123456789012345678901 10 20 30 40 50 60
Important
Differently from the other datasets (but
SHAKESPEARE()
), the FEMNIST dataset can not be downloaded directly fromfluke
but it must be downloaded from the Leaf project and stored in thepath
folder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:path ├── FEMNIST │ ├── train │ │ ├── user_data_0.json │ │ ├── user_data_1.json │ │ └── ... │ └── test │ ├── user_data_0.json │ ├── user_data_1.json │ └── ...
where in each
user_data_X.json
file there is a dictionary with the keysuser_data
containing the data of the user.- Parameters:
path (str, optional) – The path where the dataset is stored. Defaults to
"./data"
.batch_size (int, optional) – The batch size. Defaults to
10
.filter (str, optional) – The filter for the selection of a specific portion of the dataset. The options are:
all
,uppercase
,lowercase
, anddigits
. Defaults to"all"
.
- Returns:
- A tuple containing the training and testing data loaders for the clients. The
server data loader is
None
.
- Return type:
- classmethod SHAKESPEARE(path: str = './data', batch_size: int = 10, onthefly_transforms: callable | None = None) tuple [source]¶
Load the Federated Shakespeare dataset. This dataset is the one offered by the Leaf project. Shakespeare is a text dataset containing dialogues from Shakespeare’s plays. Dialogues are taken from 660 users and the task is to predict the next character in a dialogue (which is solved as a classification problem with 100 classes).
Important
Differently from the other datasets (but
FEMNIST()
), theSHAKESPEARE
dataset can not be downloaded directly fromfluke
but it must be downloaded from the Leaf project and stored in thepath
folder. The datasets must also be created according to the instructions provided by the Leaf project. The expected folder structure is:path ├── shakespeare │ ├── train │ │ ├── user_data_0.json │ │ ├── user_data_1.json │ │ └── ... │ └── test │ ├── user_data_0.json │ ├── user_data_1.json │ └── ...
where in each
user_data_X.json
file there is a dictionary with the keysuser_data
containing the data of the user.
- classmethod FCUBE(path: str = './data', batch_size: int = 10, n_points: int = 1000, test_size: int = 0.1, dimensions: int = 3) tuple [source]¶
This class creates the dataset FCUBE as described in the paper https://arxiv.org/pdf/2102.02079. This implementation generalizes for $n$ dimensions. The number of clients depends on the number of dimensions, i.e., the dataset will be divided in a number of partitions that is equal to $2^{d-1}$ where $d$ is the number of dimensions.
Warning
This procedure may become very slow for large value of dimensions, e.g., dimensions > 8.
- Parameters:
path (str, optional) – The path where to save or load the dataset. Defaults to “./data”.
batch_size (int, optional) – The batch size. Defaults to 10.
n_points (int, optional) – The total number of points to generate. Defaults to 1000.
test_size (int, optional) – The percentage of points to include in the test sets for both the clients and the server. Defaults to 0.1.
dimensions (int, optional) – The number of dimensions of the points. Defaults to 3.
- Returns:
_description_
- Return type: