Custom evaluation with `fluke`¶

This tutorial will guide you through the steps required to implement a new evaluation that can be tested with fluke.

Install `fluke` (if not already done)¶

!pip install fluke-fl

Weighted accuracy¶

In this tutorial, we will show how to implement a metric that is quite common in Personalized Federated Learning in fluke! In particular, a common technique for evaluating the local model is through using a balanced test set, weighting the accuracy based on the number of samples of each class. Intuitively, the weighted accuracy takes into account the number of samples for each class, allowing to lower the penalty if an error occurs in classifying a less frequent class. This metric is taken from Tackling Data Heterogeneity in Federated Learning with Class Prototypes, Dai et al. and it is defined as follows:

$$ acc_i = \frac{\sum_{x_j,y_j\in D_{test}}\alpha_i(y_j)\mathbb{1}(y_j = \hat{y}j)}{\sum{x_j,y_j\in D_{test}}\alpha_i(y_j)} $$

where $\alpha_i(\cdot)$ is a positive valued function. It is defined as the probability that the sample y is from class c in the $i^{th}$ client. Notice that, for $\alpha_i(\cdot) = 1$ we obtain the traditional accuracy. In this tutorial, we will interpret $\alpha_i(\cdot)$ as the proportion of the local samples of the class $y$ over all the sample of that client. Specifically, we calculate the aforementioned coefficient for client $i$ and class $y_j$ as $\alpha_i(y_j) = \frac{Y^i_j}{Y^i}$, where $Y^i_j$ is the number of samples of class $y_j$ for client $i$ (training set), and $Y^i$ is the total number of examples of client $i$.

In our case $D_{test}$ will be the dataset on the server, that is (usually) the original test set of the dataset.

Implementing the metric¶

In the following, we start from the classification metric present in eval.py and modify it, taking into account the weight for each class. As a sanity check, in the global evaluation accuracy and weighted accuracy will be the same.

import torch
from torchmetrics import Metric

class WeightedAccuracy(Metric):

    def __init__(self, num_classes: int, weights: torch.Tensor):
        super().__init__()
        self.num_classes = num_classes
        self.weights = weights
        self.true_weights = []
        self.pred_weights = []
        self.mask = []

    def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:
        if preds.ndim == 2:
            preds = torch.argmax(preds, dim=1)
        self.true_weights.append(self.weights[target])
        self.pred_weights.append(self.weights[preds])
        self.mask.append(torch.eq(preds, target))

    def compute(self) -> float:
        true_weights = torch.cat(self.true_weights, dim=0)
        pred_weights = torch.cat(self.pred_weights, dim=0)
        mask = torch.cat(self.mask, dim=0)
        pred_weights = pred_weights * mask
        return pred_weights.sum() / true_weights.sum()

Implementing the server-side logic¶

Notice that server.evaluation is called in server.fit with only two arguments (the evaluator and the eligible clients). As a consequence, if we want to modify the server.evaluate to take into account the class weights, we should modify the server.fit as well. However, this is too verbose. The most straightforward solution is to not modify server.fit and the input arguments of server.evaluate function, but modify the evaluator evaluator.evaluate input arguments, adding the class weight.

import numpy as np
import torch

from fluke.data import FastDataLoader  # NOQA
from fluke.evaluation import Evaluator  # NOQA
from fluke.client import Client  # NOQA
from fluke.server import Server  # NOQA

class MyServer(Server):

    def evaluate(self, 
                 evaluator: Evaluator,
                 test_set: FastDataLoader) -> dict[str, float]:
        if test_set is not None:
            return evaluator.evaluate(self.rounds + 1, 
                                      self.model, 
                                      test_set, 
                                      device=self.device, 
                                      additional_metrics={
                                        "weighted_accuracy": WeightedAccuracy(
                                        evaluator.n_classes, torch.ones(evaluator.n_classes)
                                      )})
        return {}
    

Implementing the client-side logic¶

Following the same logic as the server, we modify the evaluator.evaluate instead of the whole client.local_update and the inputs of client.evaluate.

from torch.nn import Module
from typing import Any

from fluke.config import OptimizerConfigurator  # NOQA

class MyClient(Client):

    def __init__(self,
                 index: int,
                 train_set: FastDataLoader,
                 test_set: FastDataLoader,
                 optimizer_cfg: OptimizerConfigurator,
                 loss_fn: Module,
                 local_epochs: int = 3,
                 **kwargs):
        super().__init__(index,
                 train_set,
                 test_set,
                 optimizer_cfg,
                 loss_fn,
                 local_epochs,
                 **kwargs)
        self.class_weights = torch.bincount(self.train_set.tensors[1]).float()
        self.class_weights /= self.train_set.size

        
    def evaluate(self, 
                 evaluator: Evaluator,
                 test_set: FastDataLoader) -> dict[str, float]: 
        if self.model is not None:
            return evaluator.evaluate(self._last_round, 
                                      self.model, 
                                      test_set, 
                                      device=self.device, 
                                      additional_metrics={
                                        "weighted_accuracy": WeightedAccuracy(
                                        evaluator.n_classes, self.class_weights
                                      )})
        return {}

Testing the new metric¶

Now, we are ready to test our metric!

from fluke.algorithms import CentralizedFL

class MyFLAlgorithm(CentralizedFL):

    def get_client_class(self) -> type[Client]:
        return MyClient

    def get_server_class(self) -> type[Server]:
        return MyServer

Ready to test the new federated algorithm¶

The rest of the code is the similar to the First steps with fluke API tutorial. We just replace ClassificationEval with our custom evaluation.

from fluke.data import DataSplitter
from fluke.data.datasets import Datasets
from fluke.evaluation import ClassificationEval
from fluke import DDict
from fluke.utils.log import Log
from fluke import FlukeENV

env = FlukeENV()
env.set_seed(42) # we set a seed for reproducibility
env.set_eval_cfg(pre_fit=True, post_fit=True)
env.set_evaluator(ClassificationEval(eval_every=1, n_classes=10))

dataset = Datasets.get("mnist", path="./data")
splitter = DataSplitter(dataset=dataset, distribution="dir",
                        client_split=0.1, dist_args=DDict(beta=0.5))

client_hp = DDict(
    batch_size=10,
    local_epochs=5,
    loss="CrossEntropyLoss",
    optimizer=DDict(
      lr=0.01,
      momentum=0.9,
      weight_decay=0.0001),
    scheduler=DDict(
      gamma=1,
      step_size=1)
)

# we put together the hyperparameters for the algorithm
hyperparams = DDict(client=client_hp,
                    server=DDict(weighted=True),
                    model="MNIST_2NN")

Here is where the new federated algorithm comes into play.

algorithm = MyFLAlgorithm(n_clients=10, # 10 clients in the federation
                          data_splitter=splitter,
                          hyper_params=hyperparams)

logger = Log()
algorithm.set_callbacks(logger)

We only just need to run it!

algorithm.run(n_rounds=100, eligible_perc=0.5)

Custom evaluation with fluke¶

Install fluke (if not already done)¶