`fluke.data`¶

This module contains the data utilities for fluke.

Submodules¶

datasets

This module contains the datasets for loading the supported datasets.

Classes

`DataContainer`	Container for train and test data.
`DummyDataContainer`	DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE.
`FastDataLoader`	A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).
`DataSplitter`	Utility class for splitting the data across clients.

class `fluke.data.DataContainer`

Container for train and test data.

Parameters:

X_train (torch.Tensor) – The training data.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test data.
y_test (torch.Tensor) – The test labels.
num_classes (int) – The number of classes.
transforms (Optional[callable], optional) – The transformation to be applied to the data when loaded. Defaults to None.

class `fluke.data.FastDataLoader`

`__getitem__`	Get the entry at the given index for each tensor.
`as_dataloader`	Convert the `FastDataLoader` to a PyTorch DataLoader.
`batch_size`	Get the batch size.
`set_sample_size`	Set the sample size.

class fluke.data.FastDataLoader(*tensors: Tensor, num_labels: int, batch_size: int = 32, shuffle: bool = False, transforms: callable | None = None, percentage: float = 1.0, skip_singleton: bool = True, single_batch: bool = False, **kwargs)[source]¶

A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).

Note

This implementation is based on the following discussion: https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6

Parameters:

*tensors (Sequence[torch.Tensor]) – tensors to be loaded.
batch_size (int) – batch size.
shuffle (bool) – whether the data should be shuffled.
transforms (Optional[callable]) – the transformation to be applied to the data. Defaults to None.
percentage (float) – the percentage of the data to be used.
skip_singleton (bool) – whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to True.
single_batch (bool) – whether to return a single batch at each generator iteration.

Caution

When sampling a percentage of the data (i.e., percentage < 1), the data is sampled at each epoch. This means that the data varies at each epoch. If you want to keep the data constant across epochs, you should sample the data once and pass the sampled data to the FastDataLoader and set the percentage parameter to 1.0.

tensors¶

Tensors of the dataset. Ideally, the first tensor should be the input data, and the second tensor should be the labels. However, this is not enforced and the user is responsible for ensuring that the tensors are used correctly.

Type:: Sequence[torch.Tensor]

shuffle¶

whether the data should be shuffled at each epoch. If True, the data is shuffled at each iteration.

Type:: bool

transforms¶

the transformation to be applied to the data.

Type:: callable

percentage¶

the percentage of the data to be used. If 1.0, all the data is used. Otherwise, the data is sampled according to the given percentage.

Type:: float

skip_singleton¶

whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to True.

Type:: bool

single_batch¶

whether to return a single batch at each generator iteration.

Type:: bool

size¶

the size of the dataset according to the percentage of the data to be used.

Type:: int

max_size¶

the total size (regardless of the sampling percentage) of the dataset.

Type:: int

Raises:: AssertionError – if the tensors do not have the same size along the first dimension.

__getitem__(index: int) → tuple[source]¶

Get the entry at the given index for each tensor.

Parameters:: index (int) – the index.
Raises:: IndexError – if the index is out of bounds.
Returns:: the entry at the given index for each tensor.
Return type:: tuple

as_dataloader(**kwargs) → DataLoader[source]¶

Convert the FastDataLoader to a PyTorch DataLoader.

Parameters:: **kwargs – additional arguments for the DataLoader.
Returns:: the PyTorch DataLoader.
Return type:: DataLoader

property batch_size: int¶

Get the batch size.

Returns:: the batch size.
Return type:: int

set_sample_size(percentage: float) → int[source]¶

Set the sample size.

Parameters:: percentage (float) – the percentage of the data to be used.
Returns:: the sample size.
Return type:: int

class `fluke.data.DataSplitter`

`assign`	Assign the data to the clients and the server according to the configuration.
`iid`	Distribute the examples uniformly across the users.
`label_quantity_skew`	This method distribute the data across client according to a specific type of skewness of the labels.
`label_dirichlet_skew`	The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\).
`label_pathological_skew`	The method first sort the data by label, divide it into `n * shards_per_client` shards, and assign each of `n` clients `shards_per_client` shards.
`num_classes`	Return the number of classes of the dataset.
`quantity_skew`	Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and `a = alpha > 0` with

class fluke.data.DataSplitter(dataset: DataContainer, distribution: str = 'iid', client_split: float = 0.0, sampling_perc: float = 1.0, server_test: bool = True, keep_test: bool = True, server_split: float = 0.0, uniform_test: bool = False, dist_args: DDict = None)[source]¶

Utility class for splitting the data across clients.

assign(n_clients: int, batch_size: int = 32) → tuple[tuple[list[FastDataLoader], list[FastDataLoader] | None], FastDataLoader][source]¶

Assign the data to the clients and the server according to the configuration. Specifically, we can have the following scenarios:

server_test = True and keep_test = True: The server has a test set that corresponds to the test set of the dataset. The clients have a training set and, if client_split > 0, a test set.
server_test = True and keep_test = False: The server has a test set that is sampled from the whole dataset (training set and test set are merged). The server’s sample size is indicated by the server_split parameter. The clients have a training set and, if client_split > 0, a test set.
server_test = False and keep_test = True: The server does not have a test set. The clients have a training set and a test set that corresponds to the test set of the dataset distributed uniformly across the clients. In this case the client_split is ignored.
server_test = False and keep_test = False: The server does not have a test set. The clients have a training set and, if client_split > 0, a test set.

If uniform_test = False, the training and test set are distributed across the clients according to the provided distribution. The only exception is done for the test set in scenario 3. The test set is IID distributed across clients if uniform_test = True.

Parameters:

n_clients (int) – The number of clients.
batch_size (Optional[int], optional) – The batch size. Defaults to 32.

Returns:

The clients’ training and testing assignments and the server’s testing assignment.

Return type:

tuple[tuple[list[FastDataLoader], Optional[list[FastDataLoader]]], FastDataLoader]

static iid(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int) → tuple[list[ndarray], list[ndarray]][source]¶

Distribute the examples uniformly across the users.

Parameters:

X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels. Not used.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels. Not used.
n (int) – The number of clients upon which the examples are distributed.

Returns:

The examples’ ids assignment.

Return type:

tuple[list[np.ndarray], list[np.ndarray]]

static label_dirichlet_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, beta: float = 0.1, min_ex_class: int = 2, balanced: bool = False) → tuple[list[ndarray], list[ndarray] | None][source]¶

The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\). Here \(\text{Dir}(\cdot)\) denotes the Dirichlet distribution and beta is a concentration parameter \((\beta > 0)\). See: https://arxiv.org/pdf/2102.02079.pdf

Parameters:

X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
beta (float, optional) – The concentration parameter. Defaults to 0.1.
min_ex_class (int, optional) – The minimum number of training examples per class. Defaults to 2.
balanced (bool, optional) – Whether to ensure a balanced distribution of the examples. Defaults to False.

Warning

When balanced is set to True and also min_ex_class > 0, the method will try to ensure both conditions. However, this is not guaranteed to be possible. In this case, the method will issue a warning if any client has less than min_ex_class examples of any class. This is a limitation of the method and should be taken into consideration when using it.

Returns:: The examples’ ids assignment.
Return type:: tuple[list[np.ndarray], list[np.ndarray] | None]

static label_pathological_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, shards_per_client: int = 2) → tuple[list[ndarray], list[ndarray]][source]¶

The method first sort the data by label, divide it into n * shards_per_client shards, and assign each of n clients shards_per_client shards. This is a pathological non-IID partition of the data, as most clients will only have examples of a limited number of classes. See: https://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf

Parameters:

X_train (torch.Tensor) – The training examples. Not used.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples. Not used.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
shards_per_client (int, optional) – The number of shards per client. Defaults to 2.

Returns:

The examples’ ids assignment.

Return type:

tuple[list[np.ndarray], list[np.ndarray]]

static label_quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, class_per_client: int = 2) → tuple[list[ndarray], list[ndarray]][source]¶

This method distribute the data across client according to a specific type of skewness of the labels. Specifically: suppose each party only has data samples of class_per_client different labels. We first randomly assign class_per_client different label IDs to each party. Then, for the samples of each label, we randomly and equally divide them into the parties which own the label. In this way, the number of labels in each party is fixed, and there is no overlap between the samples of different parties. See: https://arxiv.org/pdf/2102.02079.pdf

Parameters:

X_train (torch.Tensor) – The training examples. Not used.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples. Not used.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
class_per_client (int, optional) – The number of classes per client. Defaults to 2.

Returns:

The examples’ ids assignment.

Return type:

tuple[list[np.ndarray], list[np.ndarray]]

property num_classes: int¶

Return the number of classes of the dataset.

Returns:: The number of classes.
Return type:: int

static quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, min_quantity: int = 2, alpha: float = 4.0) → tuple[list[ndarray], list[ndarray] | None][source]¶

Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and a = alpha > 0 with

alpha = 1: examples are equidistributed across clients;
alpha = 2: the examples are “linearly” distributed across clients;
alpha >= 3: the examples are power law distributed;
alpha \(\rightarrow \infty\): all clients but one have min_quantity examples, and the remaining user all the rest.

Each client is guaranteed to have at least min_quantity examples.

Parameters:

X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels. Not used.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels. Not used.
n (int) – The number of clients upon which the examples are distributed.
min_quantity (int, optional) – The minimum number of examples per client. Defaults to 2.
alpha (float, optional) – The skewness parameter. Defaults to 4.

Returns:

The examples’ ids assignment.

Return type:

tuple[list[np.ndarray], list[np.ndarray] | None]

class `fluke.data.DummyDataContainer`

class fluke.data.DummyDataContainer(clients_tr: Sequence[FastDataLoader], clients_te: Sequence[FastDataLoader], server_data: FastDataLoader | None, num_classes: int)[source]¶

Bases: DataContainer

DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE.

Parameters:

clients_tr (Sequence[FastDataLoader]) – data loaders for the clients’ training set.
clients_te (Sequence[FastDataLoader]) – data loaders for the clients’ test set.
server_data (FastDataLoader) – data loader for the server’s test set.

fluke.data¶

Submodules¶

Classes

class fluke.data.DataContainer

class fluke.data.FastDataLoader

class fluke.data.DataSplitter

class fluke.data.DummyDataContainer

`fluke.data`¶

class `fluke.data.DataContainer`

class `fluke.data.FastDataLoader`

class `fluke.data.DataSplitter`

class `fluke.data.DummyDataContainer`