Using your own data with fluke

This tutorial will guide you through the steps required to use a custom dataset with fluke.

Try this notebook: Open in Colab

Install fluke (if not already done)

pip install fluke-fl

Define your dataset function

In order to make your dataset ready to be used in fluke, you need to define a function that returns a DataContainer object. A DataContainer is a simple class that wraps your data which is expected to be already split into training, and test sets.

Hint

You can have a dataset with no pre-defined test set. To make it work properly with fluke, you must set the training examples and labeles to two empty tensors. Then, in the configuration you must set keep_test to False.

The following is an example of a dataset function that returns a random dataset with 100 examples (80 for training and 20 for testing).

 1from fluke.data.datasets import DataContainer
 2import torch
 3
 4def MyDataset() -> DataContainer:
 5
 6    # Random dataset with 100 2D points from 2 classes
 7    X = torch.randn(100, 2)
 8    y = torch.randint(0, 2, (100,))
 9
10    return DataContainer(X_train=X[:80],
11                         y_train=y[:80],
12                         X_test=X[80:],
13                         y_test=y[80:],
14                         num_classes=2)

Using your dataset with fluke CLI

You can now use your dataset with fluke CLI. You need to specify in the configuration as the name of the dataset the fully qualified name of the function. Let’s say you have saved the function above in a file called my_dataset.py and the function is called my_dataset, then you can use it as follows:

dataset:
  name: my_dataset.MyDataset
  ...

Then, you can run fluke as usual:

fluke --config config.yaml federation fedavg.yaml

where config.yaml is the configuration file and fedavg.yaml is the configuration file for the federated averaging algorithm.

Tip

Make sure to configure the algorithm with a model that is compatible with the dataset!

Using your dataset with fluke API

This use case is really straightforward! Instead of using Datasets.get use your own function to get the dataset!!

Just for running the example, we define a tiny network that can work with our dataset.

 1import torch
 2from torch.functional import F
 3
 4class MyMLP(torch.nn.Module):
 5
 6    def __init__(self):
 7        super(MyMLP, self).__init__()
 8        self.fc1 = torch.nn.Linear(2, 3)
 9        self.fc2 = torch.nn.Linear(3, 2)
10
11    def forward(self, x: torch.Tensor) -> torch.Tensor:
12        x = F.relu(self.fc1(x))
13        x = F.relu(self.fc2(x))
14        return x

Now to run, for example, FedAVG on our dataset we do:

 1from fluke.data import DataSplitter
 2from fluke import DDict
 3from fluke.utils.log import Log
 4from fluke.evaluation import ClassificationEval
 5from fluke import GlobalSettings
 6
 7settings = GlobalSettings()
 8settings.set_seed(42) # we set a seed for reproducibility
 9settings.set_device("cpu") # we use the CPU for this example
10
11dataset = MyDataset() # Here it is our dataset
12
13# we set the evaluator to be used by both the server and the clients
14settings.set_evaluator(ClassificationEval(eval_every=1, n_classes=dataset.num_classes))
15
16splitter = DataSplitter(dataset=dataset,
17                        distribution="iid")
18
19client_hp = DDict(
20    batch_size=10,
21    local_epochs=5,
22    loss="CrossEntropyLoss",
23    optimizer=DDict(
24      lr=0.01,
25      momentum=0.9,
26      weight_decay=0.0001),
27    scheduler=DDict(
28      gamma=1,
29      step_size=1)
30)
31
32hyperparams = DDict(client=client_hp,
33                    server=DDict(weighted=True),
34                    model=MyMLP()) # we use our network :)

Here is where the new federated algorithm comes into play.

1from fluke.algorithms.fedavg import FedAVG
2algorithm = FedAVG(n_clients=2,
3                   data_splitter=splitter,
4                   hyper_params=hyperparams)
5
6logger = Log()
7algorithm.set_callbacks(logger)

We only just need to run it!

1algorithm.run(n_rounds=10, eligible_perc=0.5)