The following projects are currently pursued by students in our lab, and are therefore not available anymore. They are published for reference and inspiration.
Harnessing Increased Client Participation with Cohort-Parallel Federated Learning
MSc Thesis
Contact: Martijn de Vos ([email protected])
Federated Learning (FL) is a machine learning approach where nodes collaboratively train a global model. As more nodes participate in a round of FL, the effectiveness of individual model updates by nodes also diminishes. In this project, we intend to increase the effectiveness of client updates by dividing the network into smaller partitions, or cohorts.
We name this approach Cohort-Parallel Federated Learning (CPFL), which is a novel learning approach where each cohort independently trains a global model using FL, until convergence. The produced models by each cohort are then unified using an ensemble. We already have preliminary evidence that smaller, isolated networks converge quicker than in a one-network setting where all nodes participate. This project mainly focuses on two aspects:
- designing a practical algorithm for cohort-parallel FL, and
- conducting experiments that will quantify the effectiveness of this approach.
You will experiment with different datasets, data distributions and client clustering methods. These experiments will investigate the balance between the number of cohorts, model accuracy, training time, and compute and communication resources. Experience with PyTorch is highly recommended.
One-shot federated learning benchmark
Master’s thesis or Master’s semester project: 12 credits
Contact: Akash Dhasade ([email protected])
One-shot federated learning (OFL) is an evolving area of research where communication between clients and the server is restricted to a single round. Standard approaches to OFL are based on knowledge distillation, ensemble learning etc. These approaches further take various forms based on the assumptions made, for instance:
- Can clients reveal extra information about their training in the one-shot communication?
- Is there a public dataset available for knowledge transfer?
While availability of such information seems trivial, it can result in significantly different performance properties for competing algorithms.
Besides the assumptions, these algorithms have been evaluated on very specific metrics, e.g., accuracy. Their performance properties on other metrics like computational efficiency and scalability is not well studied. The goal of this benchmark is to exhaustively evaluate OFL algorithms under different assumptions and performance metrics towards a general goal of devising more performant OFL algorithms. Your task will be to implement and evaluate OFL algorithms across different datasets in a unified benchmark. Experience with PyTorch is highly recommended.
Optimizing the Simulation of Decentralized Learning Algorithms
Contact: Martijn de Vos ([email protected])
The simulation of distributed learning algorithms is important for understanding their achievable model accuracy and the overhead in terms of total training time and communication cost. For this purpose, our lab has recently designed and implemented a distributed simulator specifically devised to simulate decentralized learning (DL) algorithms [1].
The main idea of this simulator is that first, the timestamps of all events in the system (training, model transfers, aggregation, etc.) are determined using a discrete-event simulator. Meanwhile, the simulator devises a compute graph containing all compute tasks (train, aggregate, or test). Then, the compute graph is solved in a distributed manner, possibly using different machines and multiple workers. An advantage of this simulator over others, such as DecentralizePy, is that it supports the integration of real-world mobile traces that include each node’s training and network capacity.
Because of the discrete-event simulation, we maintain full control over the passing of time, enabling the evaluation of DL algorithms with nodes with different hardware characteristics. The FedScale simulator [2] uses a similar idea but only supports Federated Learning, which is generally easier to simulate than decentralized learning. While the first version of our simulator is already in use for various projects involving DL, there is a significant opportunity to enhance its scalability regarding the number of nodes it can support.
This project holds the potential to significantly improve the scalability of our simulator by identifying and implementing various optimization techniques. For instance, the current simulator is limited by the available memory, as many DL algorithms induce a memory footprint that scales linearly with the number of nodes in the DL network. One potential approach is to reduce the simulator’s memory footprint by strategically aggregating models as soon as they arrive in a machine. Other optimizations can revolve around the manipulation of the compute graph that is generated during the discrete-event simulation.
Affinity and experience with implementing distributed systems are required for this project. Since the project primarily focuses on the performance of simulation rather than on the ML algorithms, a deep understanding of ML/DL algorithms is optional but can be helpful during the project.
[1] Source code: https://github.com/sacs-epfl/decentralized-learning-simulator
[2] FedScale simulator: https://fedscale.ai
Scalable and Distributed LoRA Adapter Selection and Serving
Contact: Martijn de Vos ([email protected])
The fine-tuning of LLMs to specialize its performance on a particular task is essential for unlocking their full capabilities. It has become an important paradigm in the field of LLMs. LoRA has recently gained much attention [1]. LoRA introduces trainable parameters, or adapters, that interact with the pre-existing ones through low-rank matrices, allowing the model to adapt to new tasks without retraining it fully. It is based on the assumption that the differences between the pre-trained and fine-tuned model exhibit low-rank properties. LoRA keeps the pre-trained model parameters frozen and uses auxiliary low-rank matrices that are randomly initialized.
In some settings, there may be many LoRA adapters for diverse types of downstream tasks, such as text translation or summarization. Our goal is to build a disaggregated, distributed system architecture for fetching and utilizing the right adapters for a particular query input while minimizing the end-to-end inference latency. Related work can be found in [2] and [3]. The main goal of this project is to design, implement and evaluate (parts of) a scalable and distributed system that efficiently selects and serves LoRA adapters based on specific query inputs, aiming to reduce the end-to-end inference latency for various downstream tasks. Previous expertise with distributed ML systems is highly recommended. Potential research questions:
- How can a disaggregated system architecture speed up inference with LoRA and multiple adapters?
- How can we build a distributed adapter routing and selection mechanism?
- How can we adapt our system architecture to support batched requests?
[1] Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).
[2] Sheng, Ying, et al. “S-lora: Serving thousands of concurrent lora adapters.” arXiv preprint arXiv:2311.03285 (2023).
[3] Zhao, Ziyu, et al. “LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild.” arXiv preprint arXiv:2402.09997 (2024).
Dynamic Expert Model Management: Smart Caching and Real-time Swapping in Mixture of Experts
Contact: Martijn de Vos ([email protected])
In the context of Mixture of Experts (MoE), we aim to develop a system that manages the storage and retrieval of expert models, allowing for efficient real-time swapping during inference. Given the varying popularity and usage of different experts, our system will implement smart caching mechanisms to optimize the performance and availability of the necessary experts. By leveraging a disaggregated architecture, we propose to separate storage and compute responsibilities: storage nodes will not only hold the expert models but also have some computational power for pre-processing tasks, while compute nodes will focus on the heavy lifting of model inference and the gating network.Potential research questions:
- How can we implement an effective caching strategy that dynamically adjusts to the popularity and demand of specific expert models in MoE systems?
- In what ways can we design and orchestrate the interaction between storage and compute nodes to facilitate rapid and efficient retrieval and loading of
expert models, ensuring that the system scales dynamically with varying
workloads? - How can hot swapping of expert chunks be achieved seamlessly during
inference to ensure minimal latency and maximum throughput? - Can we quantify the benefits of this smart caching and disaggregated architecture approach in terms of response time, resource utilization, and overall system scalability?
End-to-end auditing of decentralized learning
Contact: Martijn de Vos <[email protected]>
Motivation
Decentralized learning involves local training of models at nodes and aggregation. In this case, detecting nodes may deviate from honest behavior and attempt to disrupt the training process or poison the learning. Detecting this malicious behavior in decentralized learning is very difficult. One possible way to detect malicious behavior is to verify the learning process at the end in case of suspected malicious activity. This however entails many challenges from the trust, data transfer, communication, storage and computation fronts. We would like to develop and study a system that can verify a learning process and identify the culprit node in the event of suspicion.
Components of the system
1. An optimistic mechanism to suspect malicious behavior in the learning process based on the current state. 2. A data structure (for example, a computational graph) on a trusted party (say, a server) which records (like a tape) the entire learning process: nodes being the state and inputs, and edges denoting the transition (gradient descent and averaging). 3. An communication-efficient and trustworthy way of data exchange between the learning parties and the server for future verification. 4. An incentive or stake mechanism for learning parties to behave honestly to reduce the chances of invoking the verification system. The intent is to design this verification system with the trusted server code executing on a Trusted Execution Environment and compare it against alternatives such as Homomorphic Encryption. Ideally, this project is intended as 1-2 independent master semester projects or a master thesis project.
Must-haves
- Python programming experience
- Knowledge of Machine Learning
Good to know
- Computer networks
- Experience with Pytorch
- Calculus
- Concurrency
Contact: Martijn de Vos <[email protected]>
Decentralized Learning (DL) is a relatively new class of ML algorithms where the learning process takes place on a network of interconnected devices with no central server that supervises the training. In standard DL algorithms such as DP-SGD, each device in the network independently updates its own model in reach round, based on the data available locally, sends its model to some neighbours, and merges all received models received in a round with its own model. Since training progresses in discrete rounds, these algorithms are synchronous and therefore require synchronisation amongst all processes to determine when to move to the next round. Slow nodes, or stragglers, can thus significantly prolong the model training.
There have been a few proposals that propose asynchronous DL algorithms. Nodes in such algorithms make individual progression and usually do not have to wait for other nodes in order to make progression. While this avoids global synchronisation, one has to handle the situation where some fast nodes get ahead of other nodes, therefore impacting model convergence. To address this, asynchronous ML algorithms such as Gossip Learning merge received models based on the model age.
The goal of this project is to design, develop, implement and empirically analyse the performance of asynchronous learning algorithms. A theoretical contribution in this project could, for example, be a convergence proof, showing that asynchronous DL algorithms result in a consensus model, even with varying levels of model staleness during training. Another focus can be on the real-world performance, resource usage and convergence speed of such algorithms.
A Comparative Evaluation of Decentralized Learning Algorithms using Realistic Real-world Traces
Contact: Martijn de Vos <[email protected]>
Decentralized Learning (DL) has gained significant attention in recent years due to its potential to enhance the privacy, fault tolerance scalability of machine learning compared to centralized settings. Various algorithms have been proposed for DL, such as Decentralized Stochastic Gradient Descent (D-SGD), AllReduce-SGD (in which nodes are connected in a ring topology and use AllReduce to average their models), Asynchronous Distributed Parallel Stochastic Gradient Descent (AD-PSGD) and Gossip Learning (GL). While these algorithms all share the same objective – collaboratively train a model without sharing data – they make different trade-offs, e.g., whether there is round synchronisation and how models are averaged across peers. To better understand the trade-offs of these algorithms, a comprehensive experimental evaluation is needed. In this project, we propose a detailed comparative analysis of these DL algorithms using realistic real-world traces that capture compute power, network capacity, data heterogeneity, and node availability in realistic FL settings.
The primary goal of this project is to compare and evaluate state-of-the-art DL algorithms under realistic conditions. We will provide you with real-world traces to mimic the actual behaviour of compute power, network capacity, data heterogeneity, and node availability in FL settings. You will integrate these traces and implement different DL algorithms in DecentralizePy, a framework to develop and deploy DL algorithms. By doing so, we provide a more accurate and practical assessment of these algorithms’ strengths and weaknesses, which will guide future research and development in the field of DL.
Building Inclusive ML Models with Decentralized Learning
Master project
Contact: Sayan Biswas <[email protected]>
One of the recently popularised ways to efficiently handle the rapid growth of size and complexity of the currently deployed Machine Learning (ML) models is by taking a decentralised approach which ameliorates various challenges associated with traditional, centralized ML paradigms including but not limited to data privacy, ownership and control, scalability, robustness and fault tolerance, and communication overhead. Decentralised Learning (DL) is a relatively new class of ML algorithms where the learning process takes place collaboratively on a network of interconnected devices without the reliance on any central server supervising the training.
On the other hand, a series of recent unfortunate incidents like Facebook mislabelling black men as primates and facial-analysis software having 0.8% error for light-skinned men and 34.7% for dark-skinned women have indicated the lack of representation of minorities in the ML models in use. Thus, the need for training personalised ML models catering to the differing requirements, data distribution, and attributes pertaining to the different communities has been unequivocally acknowledged. Clustering-based approaches such as the Iterative Federated Clustering Algorithm (IFCA) to achieve the personalisation of ML models have recently been in the spotlight primarily in the context of Federated Learning (FL).
The main goal of this project is to lay down the foundational framework needed to carry out such clustering-based personalised model training in DL. In particular, we wish to develop a privacy-preserving, communication-efficient, and decentralised way to estimate key statistical summaries of the data/models held by the nodes (possibly using some techniques based on sampling or sketching) iteratively over the training rounds to compare their similarity and, eventually, furnish a dynamic way to cluster the network based on that. This will, in turn, help in the development of a DL equivalent of some of the state-of-the-art clustering-based personalised model training algorithms.
To contribute effectively to this project, we highly value:
- A strong mathematical grasp and interest in probability theory, combinatorics, and analysis.
- Proficiency in basic machine learning implementation.
Boosting Decentralized Learning with Bandwidth Pooling
Contact: Martijn de Vos <[email protected]>
Decentralized Learning (DL) is a relatively new class of ML algorithms where the learning process takes place on a network of interconnected devices with no central server that supervises the training. While DL initially has been applied within data centers to improve the efficiency and scalability of large-scale ML tasks in homogeneous environments, it is increasingly being used to train ML models between end-user devices in heterogeneous environments. With DL, each device in the network independently updates its own model based on the data available locally and directly shares the updated model with other clients. Then, each client periodically aggregates received models. DL uses a peer-to-peer communication topology that prescribes which clients share their model with which other clients.
As DL moves beyond homogeneous data centers to large-scale, heterogeneous end-user environments such as smartphone networks, the variability in computational and communication resources becomes a substantial issue. The discrepancies in bandwidth among nodes can lead to inefficiencies in model dissemination, which is critical to the DL process and directly affects the duration of a round. This project aims to design and evaluate a bandwidth pooling strategy where nodes with surplus bandwidth can assist other nodes in the dissemination of their models. The main research question we seek to address is: “How can a node in DL effectively utilize the surplus bandwidth of neighboring nodes to accelerate dissemination of its model in the network?”.
Building Inclusive ML Models with Decentralized Learning
Master project
Contact: Sayan Biswas <[email protected]>
One of the recently popularised ways to efficiently handle the rapid growth of size and complexity of the currently deployed Machine Learning (ML) models is by taking a decentralised approach which ameliorates various challenges associated with traditional, centralized ML paradigms including but not limited to data privacy, ownership and control, scalability, robustness and fault tolerance, and communication overhead. Decentralised Learning (DL) is a relatively new class of ML algorithms where the learning process takes place collaboratively on a network of interconnected devices without the reliance on any central server supervising the training.
On the other hand, a series of recent unfortunate incidents like Facebook mislabelling black men as primates and facial-analysis software having 0.8% error for light-skinned men and 34.7% for dark-skinned women have indicated the lack of representation of minorities in the ML models in use. Thus, the need for training personalised ML models catering to the differing requirements, data distribution, and attributes pertaining to the different communities has been unequivocally acknowledged. Clustering-based approaches such as the Iterative Federated Clustering Algorithm (IFCA) to achieve the personalisation of ML models have recently been in the spotlight primarily in the context of Federated Learning (FL).
The main goal of this project is to lay down the foundational framework needed to carry out such clustering-based personalised model training in DL. In particular, we wish to develop a privacy-preserving, communication-efficient, and decentralised way to estimate key statistical summaries of the data/models held by the nodes (possibly using some techniques based on sampling or sketching) iteratively over the training rounds to compare their similarity and, eventually, furnish a dynamic way to cluster the network based on that. This will, in turn, help in the development of a DL equivalent of some of the state-of-the-art clustering-based personalised model training algorithms.
To contribute effectively to this project, we highly value:
- A strong mathematical grasp and interest in probability theory, combinatorics, and analysis.
- Proficiency in basic machine learning implementation.