Student Projects ‒ SaCS ‐ EPFL

The following projects are available for Master and Bachelor students. They are performed in close collaboration with an experienced member of the lab. Apply for a project by sending an email to the contact mentioned for the project.

You may also suggest new projects, ideally close enough to our ongoing, or previously completed projects. In that case, you will have to convince Anne-Marie Kermarrec that it is worthwhile, of reasonable scope, and that someone in the lab can mentor you!

Projects available for Spring 2025.

Scalable and Distributed LoRA Adapter Selection and Serving

Master’s Thesis or MSc semester project

Contact: Martijn de Vos ([email protected])

The fine-tuning of LLMs to specialize its performance on a particular task is essential for unlocking their
full capabilities. It has become an important paradigm in the field of LLMs. LoRA has recently gained much attention [1]. LoRA introduces trainable parameters, or adapters, that interact with the pre-existing ones through low-rank matrices, allowing the model to adapt to new tasks without retraining it fully. It is based on the assumption that the differences between the pre-trained and fine-tuned model exhibit low-rank properties. LoRA keeps the pre-trained model parameters frozen and uses auxiliary low-rank matrices that are randomly initialized.

In some settings, there may be many LoRA adapters for diverse types of downstream tasks, such as text translation or summarization. Our goal is to build a disaggregated, distributed system architecture for fetching and utilizing the right adapters for a particular query input while minimizing the end-to-end inference latency. Related work can be found in [2] and [3]. The main goal of this project is to design, implement and evaluate (parts of) a scalable and distributed system that efficiently selects and serves LoRA adapters based on specific query inputs, aiming to reduce the end-to-end inference latency for various downstream tasks. Previous expertise with distributed ML systems is highly recommended. Potential research questions:

How can a disaggregated system architecture speed up inference with LoRA and multiple adapters?
How can we build a distributed adapter routing and selection mechanism?
How can we adapt our system architecture to support batched requests?

[1] Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.” arXiv preprint arXiv:2106.09685 (2021).
[2] Sheng, Ying, et al. “S-lora: Serving thousands of concurrent lora adapters.” arXiv preprint arXiv:2311.03285 (2023).
[3] Zhao, Ziyu, et al. “LoraRetriever: Input-Aware LoRA Retrieval and Composition for Mixed Tasks in the Wild.” arXiv preprint arXiv:2402.09997 (2024).

Decentralized Learning with Trust Graphs

Master’s Thesis or MSc semester project

Contact: Martijn de Vos ([email protected])

Decentralized Learning (DL) has opened new possibilities in privacy-preserving and scalable machine learning. However, current approaches generally treat nodes as equals, without taking into account trust or privacy concerns between nodes. In many enterprise settings, there are varying trust levels among participating entities. This project explores a DL approach where nodes are connected in a weighted graph, with edge weights indicating the trust levels between users. These weights guide the application of Differential Privacy (DP) in the model updates exchanged between nodes, where trust levels determine the degree of privacy noise added to model updates. By incorporating trust-based DP, we aim to provide a more realistic and adaptable privacy framework suited to heterogeneous collaborative environments.

The primary objectives of this project are as follows. First, you will implement a DL protocol that operates on weighted trust graphs, applying privacy-preserving mechanisms based on trust levels between nodes. Second, to accelerate convergence, you will explore optimizing the graph topology and weights to balance privacy requirements with training/convergence efficiency. Third, you will analyze the privacy implications of this approach, examining how trust-based DP influences privacy guarantees and model accuracy. This analysis will provide insights into the trade-offs between privacy and performance in decentralized settings.

Using an existing simulator, you will implement this trust-weighted DL algorithm and benchmark it against standard DL techniques on various network topologies. The simulator integrates traces that capture realistic settings, including compute power, network capacity, and data heterogeneity, allowing for a practical assessment of privacy and convergence trade-offs in trust-weighted decentralized learning.

[1] Source code: https://github.com/sacs-epfl/decentralized-learning-simulator

Boosting Decentralized Learning with Bandwidth Pooling

Master’s Thesis or MSc semester project

Contact: Martijn de Vos ([email protected])

Decentralized Learning (DL) is a relatively new class of ML algorithms where the learning process takes place on a network of interconnected devices with no central server that supervises the training. While DL has initially been applied within data centers to improve the efficiency and scalability of large-scale ML tasks in homogeneous environments, it is increasingly being used to train ML models between end-user devices in heterogeneous environments. With DL, each device in the network independently updates its own model based on the data available locally and directly shares the updated model with other clients. Then, each client periodically aggregates received models. DL uses a peer-to-peer communication topology that prescribes which clients share their model with which other clients.

As DL moves beyond homogeneous data centers to large-scale, heterogeneous end-user environments such as smartphone networks, the variability in computational and communication resources becomes a substantial issue [1]. The discrepancies in bandwidth among nodes can lead to inefficiencies in model dissemination, which is critical to the DL process and directly affects the duration of a round. This project aims to design and evaluate a bandwidth pooling strategy where nodes with surplus bandwidth can assist other nodes in the dissemination of their models. The main research question we seek to address is: “How can nodes in DL effectively utilize the surplus bandwidth of other nodes to accelerate model dissemination?”. Some related work on this topic can be found in references [2-4].

[1] Lai, Fan, et al. “Fedscale: Benchmarking model and system performance of federated learning at scale.” International conference on machine learning. PMLR, 2022.

[2] Chen, Yifan, Shengli Liu, and Dingzhu Wen. “Communication Efficient Decentralized Learning over D2D Network: Adaptive Relay Selection and Resource Allocation.” IEEE Wireless Communications Letters (2024).

[3] Yemini, Michal, et al. “Robust Semi-Decentralized Federated Learning via Collaborative Relaying.” IEEE Transactions on Wireless Communications (2023).

[4] Tang, Zhenheng, et al. “Gossipfl: A decentralized federated learning framework with sparsified and adaptive communication.” IEEE Transactions on Parallel and Distributed Systems 34.3 (2022): 909-922.

Optimizing the Simulation of Decentralized Learning Algorithms

Master’s Thesis or MSc semester project

Contact: Martijn de Vos ([email protected])

The simulation of distributed learning algorithms is important for understanding their achievable model accuracy and the overhead in terms of total training time and communication cost. For this purpose, our lab has recently designed and implemented a distributed simulator specifically devised to simulate decentralized learning (DL) algorithms [1].

The main idea of this simulator is that first, the timestamps of all events in the system (training, model transfers, aggregation, etc.) are determined using a discrete-event simulator. Meanwhile, the simulator devises a compute graph containing all compute tasks (train, aggregate, or test). Then, the compute graph is solved in a distributed manner, possibly using different machines and multiple workers. An advantage of this simulator over others is that it supports the integration of real-world mobile traces that include each node’s training and network capacity.

Because of the discrete-event simulation, we maintain full control over the passing of time, enabling the evaluation of DL algorithms with nodes with different hardware characteristics. The FedScale simulator [2] uses a similar idea but only supports Federated Learning, which is generally easier to simulate than decentralized learning. While the first version of our simulator is already in use for various projects involving DL, there is a significant opportunity to enhance its scalability regarding the number of nodes it can support.

This project aim to improve the scalability of our simulator by identifying and implementing various optimization techniques. For instance, the current simulator is limited by the available memory, as many DL algorithms induce a memory footprint that scales linearly with the number of nodes in the DL network. One potential approach is to reduce the simulator’s memory footprint by strategically aggregating models as soon as they arrive in a machine. Other optimizations can revolve around the manipulation of the compute graph that is generated during the discrete-event simulation.

Affinity and experience with implementing distributed systems are required for this project. Since the project primarily focuses on the performance of simulation rather than on the ML algorithms, familiarity with ML/DL algorithms is optional but can be helpful during the project.

[1] Source code: https://github.com/sacs-epfl/decentralized-learning-simulator

[2] FedScale simulator: https://fedscale.ai

A Comparative Evaluation of Decentralized Learning Algorithms using Realistic Real-world Traces

Master’s Thesis or MSc semester project

Contact: Martijn de Vos ([email protected])

Decentralized Learning (DL) has gained significant attention in recent years due to its potential to enhance the privacy, fault tolerance, and scalability of machine learning compared to centralized settings. Various algorithms have been proposed for DL, such as Decentralized Stochastic Gradient Descent (D-SGD), AllReduce-SGD (in which nodes are connected in a ring topology and use AllReduce to average their models), Asynchronous Distributed Parallel Stochastic Gradient Descent (AD-PSGD) and Gossip Learning (GL). While these algorithms all share the same objective – collaboratively train a model without sharing data – they make different trade-offs, e.g., whether there is round synchronisation and how models are averaged across peers. To better understand the trade-offs of these algorithms, a comprehensive experimental evaluation is needed. In this project, we propose a detailed comparative analysis of these DL algorithms using realistic real-world traces that capture compute power, network capacity, data heterogeneity, and node availability in realistic FL settings.

The primary goal of this project is to compare and evaluate state-of-the-art DL algorithms under realistic conditions. You will implement these algorithms in an existing simulator [1]. This simulator already has different traces integrated that mimic the actual behaviour of compute power, network capacity, data heterogeneity, and node availability in real-world settings. You will leverage these traces and implement different DL algorithms. By doing so, we provide a more accurate and practical assessment of these algorithms’ strengths and weaknesses, which will guide future research and development in the field of DL.

[1] Source code: https://github.com/sacs-epfl/decentralized-learning-simulator

Privacy-preserving personalized decentralized learning

Master’s Thesis or MSc semester project

Contact: Sayan Biswas ([email protected])

Decentralized learning (DL) is an emerging collaborative framework that enables nodes in a network to train machine learning models without sharing their private datasets or relying on a centralized entity (e.g., a server) [1].

However, the growing heterogeneity of data used in model training, alongside recent incidents—such as Facebook mislabeling Black men as primates [2] and facial-analysis software exhibiting a 0.8% error rate for light-skinned men compared to a 34.7% error rate for dark-skinned women [3]—has exposed the lack of minority representation in current ML models. This has underscored the pressing need for training personalized ML models that account for the diverse data distributions and attributes of various communities.

While personalized model training has recently gained attention in centralized ML and Federated Learning (FL), it remains relatively underexplored in DL. Moreover, the limited research on personalized DL has primarily focused on fairness and efficiency [4].

However, personalized model training raises concerns about potential privacy risks. Models trained on data from different communities may inadvertently leak sensitive information, making it easier for adversaries to identify members of minority groups and compromise their privacy.
The primary objective of this project is to analyze the trade-off between data privacy—examined from both empirical (e.g., privacy-invasive attacks such as membership inference and gradient inversion) and information-theoretical perspectives (e.g., differential privacy)—and model personalization in decentralized frameworks. We aim to establish a foundational framework to characterize this trade-off and explore methods for developing privacy-aware personalized DL algorithms. This will contribute to creating privacy-preserving and fair approaches for training models in a decentralized manner, taking a pioneering step toward ethical and trustworthy DL.

To contribute effectively to this project, we highly value:

A strong mathematical foundation and interest in probability theory, algebra, and analysis.
Proficiency in basic machine learning implementation.

[1] Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. Lian et al. NeurIPS 2017.
[2] https://www.bbc.com/news/technology-58462511
[3] https://news.mit.edu/2018/study-finds-gender-skin-type-bias-artificial-intelligence-systems-0212
[4] Fair Decentralized Learning. Biswas et al. Under Review.