Using your own data with fluke
¶
This tutorial will guide you through the steps required to use a custom dataset with fluke
.
Install fluke
(if not already done)¶
pip install fluke-fl
Define your dataset function¶
In order to make your dataset ready to be used in fluke
, you need to define a function that returns
a DataContainer object. A DataContainer
is a simple class that
wraps your data which is expected to be already split into training, and test sets.
Hint
You can have a dataset with no pre-defined test set. To make it work properly with
fluke
, you must set the training examples and labeles to two empty tensors. Then, in the configuration you must setkeep_test
toFalse
.
The following is an example of a dataset function that returns a random dataset with 100 examples (80 for training and 20 for testing).
1from fluke.data.datasets import DataContainer
2import torch
3
4def MyDataset() -> DataContainer:
5
6 # Random dataset with 100 2D points from 2 classes
7 X = torch.randn(100, 2)
8 y = torch.randint(0, 2, (100,))
9
10 return DataContainer(X_train=X[:80],
11 y_train=y[:80],
12 X_test=X[80:],
13 y_test=y[80:],
14 num_classes=2)
Using your dataset with fluke
CLI¶
You can now use your dataset with fluke
CLI. You need to specify in the configuration as the name
of the dataset the fully qualified name of the function. Let’s say you have saved the function above in a file
called my_dataset.py
and the function is called my_dataset
, then you can use it as follows:
dataset:
name: my_dataset.MyDataset
...
Then, you can run fluke
as usual:
fluke --config config.yaml federation fedavg.yaml
where config.yaml
is the configuration file and fedavg.yaml
is the configuration file for the federated averaging algorithm.
Tip
Make sure to configure the algorithm with a model that is compatible with the dataset!
Using your dataset with fluke
API¶
This use case is really straightforward! Instead of using Datasets.get
use your own function to get the dataset!!
Just for running the example, we define a tiny network that can work with our dataset.
1import torch
2from torch.functional import F
3
4class MyMLP(torch.nn.Module):
5
6 def __init__(self):
7 super(MyMLP, self).__init__()
8 self.fc1 = torch.nn.Linear(2, 3)
9 self.fc2 = torch.nn.Linear(3, 2)
10
11 def forward(self, x: torch.Tensor) -> torch.Tensor:
12 x = F.relu(self.fc1(x))
13 x = F.relu(self.fc2(x))
14 return x
Now to run, for example, FedAVG on our dataset we do:
1from fluke.data import DataSplitter
2from fluke import DDict
3from fluke.utils.log import Log
4from fluke.evaluation import ClassificationEval
5from fluke import GlobalSettings
6
7settings = GlobalSettings()
8settings.set_seed(42) # we set a seed for reproducibility
9settings.set_device("cpu") # we use the CPU for this example
10
11dataset = MyDataset() # Here it is our dataset
12
13# we set the evaluator to be used by both the server and the clients
14settings.set_evaluator(ClassificationEval(eval_every=1, n_classes=dataset.num_classes))
15
16splitter = DataSplitter(dataset=dataset,
17 distribution="iid")
18
19client_hp = DDict(
20 batch_size=10,
21 local_epochs=5,
22 loss="CrossEntropyLoss",
23 optimizer=DDict(
24 lr=0.01,
25 momentum=0.9,
26 weight_decay=0.0001),
27 scheduler=DDict(
28 gamma=1,
29 step_size=1)
30)
31
32hyperparams = DDict(client=client_hp,
33 server=DDict(weighted=True),
34 model=MyMLP()) # we use our network :)
Here is where the new federated algorithm comes into play.
1from fluke.algorithms.fedavg import FedAVG
2algorithm = FedAVG(n_clients=2,
3 data_splitter=splitter,
4 hyper_params=hyperparams)
5
6logger = Log()
7algorithm.set_callbacks(logger)
We only just need to run it!
1algorithm.run(n_rounds=10, eligible_perc=0.5)