fluke.data
¶
This module contains the data utilities for fluke
.
Submodules¶
This module contains the |
Classes
Container for train and test data. |
|
DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE. |
|
A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow). |
|
Utility class for splitting the data across clients. |
class fluke.data.DataContainer
- class fluke.data.DataContainer(X_train: Tensor, y_train: Tensor, X_test: Tensor, y_test: Tensor, num_classes: int, transforms: callable | None = None)[source]¶
Container for train and test data.
- Parameters:
X_train (torch.Tensor) – The training data.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test data.
y_test (torch.Tensor) – The test labels.
num_classes (int) – The number of classes.
transforms (Optional[callable], optional) – The transformation to be applied to the data when loaded. Defaults to None.
class fluke.data.FastDataLoader
Get the entry at the given index for each tensor. |
|
Set the sample size. |
- class fluke.data.FastDataLoader(*tensors: Tensor, num_labels: int, batch_size: int = 32, shuffle: bool = False, transforms: callable | None = None, percentage: float = 1.0, skip_singleton: bool = True, single_batch: bool = False)[source]¶
A DataLoader-like object for a set of tensors that can be much faster than TensorDataset + DataLoader because dataloader grabs individual indices of the dataset and calls cat (slow).
Note
This implementation is based on the following discussion: https://discuss.pytorch.org/t/dataloader-much-slower-than-manual-batching/27014/6
- Parameters:
*tensors (Sequence[torch.Tensor]) – tensors to be loaded.
batch_size (int) – batch size.
shuffle (bool) – whether the data should be shuffled.
transforms (Optional[callable]) – the transformation to be applied to the data. Defaults to None.
percentage (float) – the percentage of the data to be used.
skip_singleton (bool) – whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to
True
.single_batch (bool) – whether to return a single batch at each generator iteration.
Caution
When sampling a percentage of the data (i.e.,
percentage < 1
), the data is sampled at each epoch. This means that the data varies at each epoch. If you want to keep the data constant across epochs, you should sample the data once and pass the sampled data to theFastDataLoader
and set thepercentage
parameter to1.0
.- tensors¶
Tensors of the dataset. Ideally, the first tensor should be the input data, and the second tensor should be the labels. However, this is not enforced and the user is responsible for ensuring that the tensors are used correctly.
- Type:
Sequence[torch.Tensor]
- shuffle¶
whether the data should be shuffled at each epoch. If
True
, the data is shuffled at each iteration.- Type:
- transforms¶
the transformation to be applied to the data.
- Type:
callable
- percentage¶
the percentage of the data to be used. If 1.0, all the data is used. Otherwise, the data is sampled according to the given percentage.
- Type:
- skip_singleton¶
whether to skip batches with a single element. If you have batchnorm layers, you might want to set this to
True
.- Type:
- Raises:
AssertionError – if the tensors do not have the same size along the first dimension.
class fluke.data.DataSplitter
Return the number of classes of the dataset. |
|
Assign the data to the clients and the server according to the configuration. |
|
Distribute the examples uniformly across the users. |
|
Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and |
|
This method distribute the data across client according to a specific type of skewness of the lables. |
|
The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\). |
|
The method first sort the data by label, divide it into |
- class fluke.data.DataSplitter(dataset: DataContainer, distribution: str = 'iid', client_split: float = 0.0, sampling_perc: float = 1.0, server_test: bool = True, keep_test: bool = True, server_split: float = 0.0, uniform_test: bool = False, dist_args: DDict | None = None)[source]¶
Utility class for splitting the data across clients.
- assign(n_clients: int, batch_size: int = 32) tuple[tuple[FastDataLoader, FastDataLoader | None], FastDataLoader] [source]¶
Assign the data to the clients and the server according to the configuration. Specifically, we can have the following scenarios:
server_test = True
andkeep_test = True
: The server has a test set that corresponds to the test set of the dataset. The clients have a training set and, ifclient_split > 0
, a test set.server_test = True
andkeep_test = False
: The server has a test set that is sampled from the whole dataset (training set and test set are merged). The server’s sample size is indicated by theserver_split
parameter. The clients have a training set and, ifclient_split > 0
, a test set.server_test = False
andkeep_test = True
: The server does not have a test set. The clients have a training set and a test set that corresponds to the test set of the dataset distributed uniformly across the clients. In this case theclient_split
is ignored.server_test = False
andkeep_test = False
: The server does not have a test set. The clients have a training set and, ifclient_split > 0
, a test set.
If
uniform_test = False
, the training and test set are distributed across the clients according to the provided distribution. The only exception is done for the test set in scenario 3. The test set is IID distributed across clients ifuniform_test = True
.- Parameters:
- Returns:
The clients’ training and testing assignments and the server’s testing assignment.
- Return type:
tuple[tuple[FastDataLoader, Optional[FastDataLoader]], FastDataLoader]
- iid(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int) list[Tensor] [source]¶
Distribute the examples uniformly across the users.
- Parameters:
X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels. Not used.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels. Not used.
n (int) – The number of clients upon which the examples are distributed.
- Returns:
The examples’ ids assignment.
- Return type:
list[torch.Tensor]
- label_dirichlet_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, beta: float = 0.1, min_ex_class: int = 2, balanced: bool = True) list[Tensor] [source]¶
The method samples \(p_k \sim \text{Dir}_n(\beta)\) and allocates a \(p_{k,j}\) proportion of the instances of class \(k\) to party \(j\). Here \(\text{Dir}(\cdot)\) denotes the Dirichlet distribution and beta is a concentration parameter \((\beta > 0)\). See: https://arxiv.org/pdf/2102.02079.pdf
- Parameters:
X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
beta (float, optional) – The concentration parameter. Defaults to 0.1.
min_ex_class (int, optional) – The minimum number of examples per class. Defaults to 2.
balanced (bool, optional) – Whether to ensure a balanced distribution of the examples.
- Returns:
The examples’ ids assignment.
- Return type:
list[torch.Tensor]
- label_pathological_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, shards_per_client: int = 2) list[Tensor] [source]¶
The method first sort the data by label, divide it into
n * shards_per_client
shards, and assign each ofn
clientsshards_per_client
shards. This is a pathological non-IID partition of the data, as most clients will only have examples of a limited number of classes. See: http://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf- Parameters:
X_train (torch.Tensor) – The training examples. Not used.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples. Not used.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
shards_per_client (int, optional) – The number of shards per client. Defaults to 2.
- Returns:
The examples’ ids assignment.
- Return type:
list[torch.Tensor]
- label_quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, class_per_client: int = 2) list[Tensor] [source]¶
This method distribute the data across client according to a specific type of skewness of the lables. Specifically: suppose each party only has data samples of
class_per_client
different labels. We first randomly assignclass_per_client
different label IDs to each party. Then, for the samples of each label, we randomly and equally divide them into the parties which own the label. In this way, the number of labels in each party is fixed, and there is no overlap between the samples of different parties. See: https://arxiv.org/pdf/2102.02079.pdf- Parameters:
X_train (torch.Tensor) – The training examples. Not used.
y_train (torch.Tensor) – The training labels.
X_test (torch.Tensor) – The test examples. Not used.
y_test (torch.Tensor) – The test labels.
n (int) – The number of clients upon which the examples are distributed.
class_per_client (int, optional) – The number of classes per client. Defaults to 2.
- Returns:
The examples’ ids assignment.
- Return type:
list[torch.Tensor]
- property num_classes: int¶
Return the number of classes of the dataset.
- Returns:
The number of classes.
- Return type:
- quantity_skew(X_train: Tensor, y_train: Tensor, X_test: Tensor | None, y_test: Tensor | None, n: int, min_quantity: int = 2, alpha: float = 4.0) list[Tensor] [source]¶
Distribute the examples across the clients according to the following probability density function: \(P(x; a) = a x^{a-1}\) where \(x\) is the id of a client (\(x \in [0, n-1]\)), and
a = alpha > 0
withalpha = 1
: examples are equidistributed across clients;alpha = 2
: the examples are “linearly” distributed across clients;alpha >= 3
: the examples are power law distributed;alpha
\(\rightarrow \infty\): all clients but one havemin_quantity
examples, and the remaining user all the rest.
Each client is guaranteed to have at least
min_quantity
examples.- Parameters:
X_train (torch.Tensor) – The training examples.
y_train (torch.Tensor) – The training labels. Not used.
X_test (torch.Tensor) – The test examples.
y_test (torch.Tensor) – The test labels. Not used.
n (int) – The number of clients upon which the examples are distributed.
min_quantity (int, optional) – The minimum number of examples per client. Defaults to 2.
alpha (float, optional) – The skewness parameter. Defaults to 4.
- Returns:
The examples’ ids assignment.
- Return type:
list[torch.Tensor]
class fluke.data.DummyDataContainer
- class fluke.data.DummyDataContainer(clients_tr: Iterable[FastDataLoader], clients_te: Iterable[FastDataLoader], server_data: FastDataLoader, num_classes: int)[source]¶
Bases:
DataContainer
DataContainer designed for those datasets with a fixed data assignments, e.g., FEMNIST, Shakespeare and FCUBE.
- Parameters:
clients_tr (Iterable[FastDataLoader]) – data loaders for the clients’ training set.
clients_te (Iterable[FastDataLoader]) – data loaders for the clients’ test set.
server_data (FastDataLoader) – data loader for the server’s test set.