fluke.data

This module contains the data utilities for fluke.

Submodules

datasets

This module contains the Datasets for loading the supported datasets.

Classes

DataContainer

Container for train and test data.

DummyDataContainer

DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE.

FastDataLoader

A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).

DataSplitter

Utility class for splitting the data across clients.

class fluke.data.DataContainer

class fluke.data.DataContainer(X_train: Tensor, y_train: Tensor, X_test: Tensor, y_test: Tensor, num_classes: int, transforms: callable | None = None)[source]

Container for train and test data.

Parameters:
  • X_train (torch.Tensor) – The training data.

  • y_train (torch.Tensor) – The training labels.

  • X_test (torch.Tensor) – The test data.

  • y_test (torch.Tensor) – The test labels.

  • num_classes (int) – The number of classes.

  • transforms (Optional[callable], optional) – The transformation to be applied to the data when loaded. Defaults to None.

class fluke.data.FastDataLoader

__getitem__

Get the entry at the given index for each tensor.

set_sample_size

Set the sample size.

class fluke.data.FastDataLoader(*tensors: Tensor, num_labels: int, batch_size: int = 32, shuffle: bool = False, transforms: callable | None = None, percentage: float = 1.0, skip_singleton: bool = True, single_batch: bool = False)[source]

A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).

Note

This implementation is based on the following discussion: https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6

Parameters:
  • *tensors (Sequence[torch.Tensor]) – tensors to be loaded.

  • batch_size (int) – batch size.

  • shuffle (bool) – whether the data should be shuffled.

  • transforms (Optional[callable]) – the transformation to be applied to the data. Defaults to None.

  • percentage (float) – the percentage of the data to be used.

  • skip_singleton (bool) – whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to True.

  • single_batch (bool) – whether to return a single batch at each generator iteration.

Caution

When sampling a percentage of the data (i.e., percentage < 1), the data is sampled at each epoch. This means that the data varies at each epoch. If you want to keep the data constant across epochs, you should sample the data once and pass the sampled data to the FastDataLoader and set the percentage parameter to 1.0.

tensors

Tensors of the dataset. Ideally, the first tensor should be the input data, and the second tensor should be the labels. However, this is not enforced and the user is responsible for ensuring that the tensors are used correctly.

Type:

Sequence[torch.Tensor]

batch_size

batch size.

Type:

int

shuffle

whether the data should be shuffled at each epoch. If True, the data is shuffled at each iteration.

Type:

bool

transforms

the transformation to be applied to the data.

Type:

callable

percentage

the percentage of the data to be used. If 1.0, all the data is used. Otherwise, the data is sampled according to the given percentage.

Type:

float

skip_singleton

whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to True.

Type:

bool

single_batch

whether to return a single batch at each generator iteration.

Type:

bool

size

the size of the dataset according to the percentage of the data to be used.

Type:

int

max_size

the total size (regardless of the sampling percentage) of the dataset.

Type:

int

Raises:

AssertionError – if the tensors do not have the same size along the first dimension.

__getitem__(index: int) tuple[source]

Get the entry at the given index for each tensor.

Parameters:

index (int) – the index.

Raises:

IndexError – if the index is out of bounds.

Returns:

the entry at the given index for each tensor.

Return type:

tuple

set_sample_size(percentage: float) int[source]

Set the sample size.

Parameters:

percentage (float) – the percentage of the data to be used.

Returns:

the sample size.

Return type:

int

class fluke.data.DataSplitter

num_classes

Return the number of classes of the dataset.

assign

Assign the data to the clients and the server according to the configuration.

iid

Distribute the examples uniformly across the users.

quantity_skew

Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and a = alpha > 0 with

label_quantity_skew

This method distribute the data across client according to a specific type of skewness of the lables.

label_dirichlet_skew

The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\).

label_pathological_skew

The method first sort the data by label, divide it into n * shards_per_client shards, and assign each of n clients shards_per_client shards.

class fluke.data.DataSplitter(dataset: DataContainer, distribution: str = 'iid', client_split: float = 0.0, sampling_perc: float = 1.0, server_test: bool = True, keep_test: bool = True, server_split: float = 0.0, uniform_test: bool = False, dist_args: DDict | None = None)[source]

Utility class for splitting the data across clients.

assign(n_clients: int, batch_size: int = 32) tuple[tuple[FastDataLoader, FastDataLoader | None], FastDataLoader][source]

Assign the data to the clients and the server according to the configuration. Specifically, we can have the following scenarios:

  1. server_test = True and keep_test = True: The server has a test set that corresponds to the test set of the dataset. The clients have a training set and, if client_split > 0, a test set.

  2. server_test = True and keep_test = False: The server has a test set that is sampled from the whole dataset (training set and test set are merged). The server’s sample size is indicated by the server_split parameter. The clients have a training set and, if client_split > 0, a test set.

  3. server_test = False and keep_test = True: The server does not have a test set. The clients have a training set and a test set that corresponds to the test set of the dataset distributed uniformly across the clients. In this case the client_split is ignored.

  4. server_test = False and keep_test = False: The server does not have a test set. The clients have a training set and, if client_split > 0, a test set.

If uniform_test = False, the training and test set are distributed across the clients according to the provided distribution. The only exception is done for the test set in scenario 3. The test set is IID distributed across clients if uniform_test = True.

Parameters:
  • n_clients (int) – The number of clients.

  • batch_size (Optional[int], optional) – The batch size. Defaults to 32.

Returns:

The clients’ training and testing assignments and the server’s testing assignment.

Return type:

tuple[tuple[FastDataLoader, Optional[FastDataLoader]], FastDataLoader]

iid(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int) list[Tensor][source]

Distribute the examples uniformly across the users.

Parameters:
  • X_train (torch.Tensor) – The training examples.

  • y_train (torch.Tensor) – The training labels. Not used.

  • X_test (torch.Tensor) – The test examples.

  • y_test (torch.Tensor) – The test labels. Not used.

  • n (int) – The number of clients upon which the examples are distributed.

Returns:

The examples’ ids assignment.

Return type:

list[torch.Tensor]

label_dirichlet_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, beta: float = 0.1, min_ex_class: int = 2, balanced: bool = True) list[Tensor][source]

The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\). Here \(\text{Dir}(\cdot)\) denotes the Dirichlet distribution and beta is a concentration parameter \((\beta > 0)\). See: https://arxiv.org/pdf/2102.02079.pdf

Parameters:
  • X_train (torch.Tensor) – The training examples.

  • y_train (torch.Tensor) – The training labels.

  • X_test (torch.Tensor) – The test examples.

  • y_test (torch.Tensor) – The test labels.

  • n (int) – The number of clients upon which the examples are distributed.

  • beta (float, optional) – The concentration parameter. Defaults to 0.1.

  • min_ex_class (int, optional) – The minimum number of examples per class. Defaults to 2.

  • balanced (bool, optional) – Whether to ensure a balanced distribution of the examples.

Returns:

The examples’ ids assignment.

Return type:

list[torch.Tensor]

label_pathological_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, shards_per_client: int = 2) list[Tensor][source]

The method first sort the data by label, divide it into n * shards_per_client shards, and assign each of n clients shards_per_client shards. This is a pathological non-IID partition of the data, as most clients will only have examples of a limited number of classes. See: http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

Parameters:
  • X_train (torch.Tensor) – The training examples. Not used.

  • y_train (torch.Tensor) – The training labels.

  • X_test (torch.Tensor) – The test examples. Not used.

  • y_test (torch.Tensor) – The test labels.

  • n (int) – The number of clients upon which the examples are distributed.

  • shards_per_client (int, optional) – The number of shards per client. Defaults to 2.

Returns:

The examples’ ids assignment.

Return type:

list[torch.Tensor]

label_quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, class_per_client: int = 2) list[Tensor][source]

This method distribute the data across client according to a specific type of skewness of the lables. Specifically: suppose each party only has data samples of class_per_client different labels. We first randomly assign class_per_client different label IDs to each party. Then, for the samples of each label, we randomly and equally divide them into the parties which own the label. In this way, the number of labels in each party is fixed, and there is no overlap between the samples of different parties. See: https://arxiv.org/pdf/2102.02079.pdf

Parameters:
  • X_train (torch.Tensor) – The training examples. Not used.

  • y_train (torch.Tensor) – The training labels.

  • X_test (torch.Tensor) – The test examples. Not used.

  • y_test (torch.Tensor) – The test labels.

  • n (int) – The number of clients upon which the examples are distributed.

  • class_per_client (int, optional) – The number of classes per client. Defaults to 2.

Returns:

The examples’ ids assignment.

Return type:

list[torch.Tensor]

property num_classes: int

Return the number of classes of the dataset.

Returns:

The number of classes.

Return type:

int

quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, min_quantity: int = 2, alpha: float = 4.0) list[Tensor][source]

Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and a = alpha > 0 with

  • alpha = 1: examples are equidistributed across clients;

  • alpha = 2: the examples are “linearly” distributed across clients;

  • alpha >= 3: the examples are power law distributed;

  • alpha \(\rightarrow \infty\): all clients but one have min_quantity examples, and the remaining user all the rest.

Each client is guaranteed to have at least min_quantity examples.

Parameters:
  • X_train (torch.Tensor) – The training examples.

  • y_train (torch.Tensor) – The training labels. Not used.

  • X_test (torch.Tensor) – The test examples.

  • y_test (torch.Tensor) – The test labels. Not used.

  • n (int) – The number of clients upon which the examples are distributed.

  • min_quantity (int, optional) – The minimum number of examples per client. Defaults to 2.

  • alpha (float, optional) – The skewness parameter. Defaults to 4.

Returns:

The examples’ ids assignment.

Return type:

list[torch.Tensor]

class fluke.data.DummyDataContainer

class fluke.data.DummyDataContainer(clients_tr: Iterable[FastDataLoader], clients_te: Iterable[FastDataLoader], server_data: FastDataLoader, num_classes: int)[source]

Bases: DataContainer

DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE.

Parameters:
  • clients_tr (Iterable[FastDataLoader]) – data loaders for the clients’ training set.

  • clients_te (Iterable[FastDataLoader]) – data loaders for the clients’ test set.

  • server_data (FastDataLoader) – data loader for the server’s test set.