Parallel training [Experimental]¶

Since v0.7.8 (and v0.7.9), it is possible to parallelize the training on multiple GPUs.

Parallel training for a single client (v0.7.8+)¶

In fluke v0.7.8 and later, to enable parallel training on a single client, you should simply set more than one GPU in the device configuration, e.g.:

# This is the algorithm configuration file

...
exp:
    device: [cuda:0, cuda:1]
...

This will automatically enable parallel training on the specified GPUs.

This functionality relies on the torch.nn.DataParallel module, which is a PyTorch feature that allows you to parallelize the training of a single model across multiple GPUs.

Parallel training for multiple clients - one per GPU (v0.7.9+)¶

In fluke v0.7.9 and later, you can also run multiple clients in parallel, each on a separate GPU. In this case, additionally to the device configuration as shown above, your algorithm object must inherit from the ParallelAlgorithm class, and similarly, your client and server objects must inherit from ParallelClient and ParallelServer, respectively.

ParallelAlgorithm is a ready-to-use base class that provides the necessary functionality for parallel FedAVG training.

To use it, you simply need to set the class as algorithm name in the configuration file:

# This is the algorithm configuration file

client:
    ...
server:
    ...
name: fluke.distr.ParallelAlgorithm

Clearly, if you define your own algorithm class, the name should be the full path to your class, e.g. my_package.my_module.MyParallelAlgorithm.

This functionality relies on the torch.multiprocessing module, which is a PyTorch feature that allows you to run multiple processes in parallel, each on a separate GPU.