Available Projects Spring 2025 ‒ IVRL ‐ EPFL

If you are interested in doing a research project (“semester project”) or a master’s project at IVRL, you can do this through the Master Programs in Communication Systems or in Computer Science. Note you must be accredited to EPFL. This page lists available semester/master’s projects for the Spring 2025 semester.

For any other type of applications (research assistantship, internship, etc), please check this page.

Description:

Recent advances in neural rendering-based 3D scene reconstruction have demonstrated strong capacity in representing visually plausible real-world scenes. However, most of them rely heavily on dense multi-view captures and therefore restricted from broader applicability.

In this project, we aim to exploit the strong priors of latent-based video diffusion model for synthesizing high-fidelity novel views of generic scenes from single or sparse input captures. We adopt radiance field as the scene representation and explore the implicit 3D understanding and intra-frame attention-correlation exhibited in video diffusion models in place for the multi-view capture input in prior works.

Type of work:

MS Level: semester project/master project
65% research, 35% development

Prerequisite:

Knowledge in deep learning frameworks (e.g., PyTorch or Tensorflow), image processing and Computer Vision.
Experience with 3D vision is required (e.g. course taken, independent projects, etc. )

Supervisor:

Dongqing Wang, dongqing.wang@epfl.ch

Reference Literature:

High-resolution image synthesis with latent diffusion models
ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis
FSGS: Real-time few-shot view synthesis using gaussian splatting

Startup company Innoview has developed a software framework to create hidden watermarks printed on paper and to acquire and decode them by a smartphone. The acquisition by smartphone comprises many separate parametrizable parts. The project consists in improving some of the parts of the acquisition pipeline in order to optimize the recognition rate of the hidden watermarks (under Android).

Deliverables:

Report and running prototype.

Prerequisites:

basic knowledge of image processing and computer vision,
Coding skills in Java Android, C#, and/or Matlab

Level: BS or MS semester project

Supervisors:

Dr Romain Rossier, Innoview Sàrl, romain.rossier@innoview.ch, tel 078 664 36 44

Prof. Roger D. Hersch, BC110, rd.hersch@epfl.ch, cell: 077 406 27

Startup company Innoview has developed arrangements of lenslets that can be used to create document security features. The goal is to improve these security features and to optimize them by simulating the interaction of light with these 3D lenslet structures, using the Blender software.

Deliverables:

Report and running prototype (Matlab). Blender lenslet simulations.

Prerequisites:

knowledge of computer graphics, interaction of light with 3D mesh objects,
basic knowledge of Blender,
Coding skills in Matlab

Level: BS or MS semester project

Supervisors:

Prof. Roger D. Hersch, BC110, rd.hersch@epfl.ch, cell: 077 406 27

Dr Romain Rossier, Innoview Sàrl, romain.rossier@innoview.ch, tel

078 664 36 44

Startup company Innoview has developed new moiré features that can prevent counterfeits. Some types of moiré features rely on grayscale images. The present project aims at creating a grayscale image editor. Designers should be able to shape their grayscale image by various means (interpolation between spatially defined grayscale values, geometric transformations, image warping, etc…).

Deliverables: Report and running prototype (Matlab). Blender lenslet simulations.

Prerequisites:

– knowledge of image processing / computer vision

– coding skills in Matlab

Level: BS or MS semester project

Supervisors:

Prof. Roger D. Hersch, BC110, rd.hersch@epfl.ch, cell: 077 406 27

Dr Romain Rossier, Innoview Sàrl, romain.rossier@innoview.ch, , tel

078 664 36 44

This project aims to explore whether there is any semantic information encoded by off-the-shelf diffusion model that helps us and other deep learning models understand what is the content of an image or the relationship between images.

Diffusion models [1] have been the new paradigm for generative modeling in computer vision. Despite its success, it remains to be a black box during generation. At each step, it provides a direction, namely the score, towards the data distribution. As shown in recent work [2], the score can be decomposed into different meaningful components. The first research question is: does the score encode any semantic information of the generated image?

Moreover, there is evidence that the representation learned by diffusion models is helpful to discriminative models. For example, it can boost the classification performance by knowledge distillation [3]. Furthermore, diffusion model itself can be used as a robust classifier [4]. It can be seen that discriminative information can be extracted from the diffusion model. Then the second question is: What is the information about? Is it about the object shape? Location? Texture? Or other kinds of information.

This is an exploratory project. We will try to interpret the black box in diffusion model and dig semantic information that it encodes. Together, we will also brainstorm the application of diffusion model other than image generation. This project can be a good chance for you to develop interest and skills in scientific research.

References:

[1] Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models[J]. Advances in neural information processing systems, 2020, 33: 6840-6851.

[2] Alldieck T, Kolotouros N, Sminchisescu C. Score Distillation Sampling with Learned Manifold Corrective[J]. arXiv preprint arXiv:2401.05293, 2024.

[3] Yang X, Wang X. Diffusion model as representation learner[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 18938-18949.

[4] Chen H, Dong Y, Shao S, et al. Your diffusion model is secretly a certifiably robust classifier[J]. arXiv preprint arXiv:2402.02316, 2024.

Deliverables: Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the results.

Prerequisites: Python and PyTorch. Basic understanding of diffusion models.

Level: MS research project

Number of students: 1

Contact: Yitao Xu, yitao.xu@epfl.ch

Introduction
3D mesh generation plays a pivotal role in virtual reality, gaming, and digital content creation, but generating high-quality, detailed meshes remains a challenging task. Traditional methods often fail to capture fine-grained details or optimize computational efficiency, especially for complex, textured surfaces. This proposal seeks to enhance 3D mesh generation by incorporating frequency decomposition models, leveraging multi-resolution analysis to capture both broad structural features and intricate details.

Objective
The primary goal of this research is to develop a frequency-based decomposition model for 3D mesh generation, enabling precise control over the detail level of generated meshes. By decomposing spatial and frequency components, we aim to improve mesh quality, reduce processing times, and enhance texture and surface detail.

Methodology

Frequency Decomposition: Apply discrete wavelet transforms (DWT) on the spatial and normal maps of 3D meshes, separating high-frequency components (surface details) from low-frequency components (broad structural shapes).
Component-specific Optimization: Tailor the mesh generation model to optimize specific frequency components. For example, low-frequency structures can be prioritized for smooth topology, while high-frequency details can be preserved in texture-rich areas.
Multi-level Reconstruction: Iteratively reconstruct the mesh from frequency components using an inverse wavelet transform (IDWT), allowing for customizable levels of detail depending on the desired quality.
Evaluation and Benchmarking: Compare the proposed approach against existing methods on benchmarks, measuring structural consistency, texture fidelity, and computational efficiency.

Expected Contributions

A novel, frequency-based approach for enhancing 3D mesh quality.
A multi-level decomposition and reconstruction framework that allows selective detail optimization.

An efficient algorithm capable of handling complex surfaces without compromising mesh detail.

Prerequisites: Python and PyTorch. Basic understanding of diffusion models.

Level: MS research project

Number of students: 1

Contact: Yufan Ren, yufan.ren@epfl.ch

Introduction:

Images captured under low-light conditions often suffer from significant noise. Existing deep-learning-based denoising networks [1,2,3,4] typically require a large dataset of paired noisy-clean samples for effective training. However, collecting such paired data is both labor-intensive and time-consuming. This project aims to address this challenge by synthesizing paired noisy-clean data using generative models, such as diffusion models [5]. Noise in RAW images can be broadly classified into signal-dependent and signal-independent components. We plan to model these components separately and then combine them to simulate realistic noise for clean images.

Objective:

The primary goal of this project is to develop a robust method for noise synthesis that enables the generation of high-quality paired data. This synthesized data will be used to train denoising networks, allowing us to evaluate its impact on denoising performance.

Methodology:

Collecting Dark Frames: To model the signal-independent noise component, we will capture dark frames in a controlled darkroom environment. These frames provide data on noise inherent to the sensor, such as thermal noise, read noise and banding pattern noise [6,7].
Modeling Signal-Independent Noise: Using the collected dark frames, we will train a generative model (e.g., diffusion models) to learn the distribution of signal-independent noise.
Simulating Signal-Dependent Noise: For the signal-dependent noise component, we will use the Poisson noise model, which effectively captures the particle-like nature of light. This approach is well-supported by existing research [6,8].
Combining Noise Components: By merging the signal-dependent and signal-independent noise components, we will synthesize realistic noisy images from clean ones, enabling us to generate an unlimited number of paired noisy-clean samples.

Type of work:

master semester project

65% research, 35% development

Prerequisite:

Proficiency in deep learning frameworks (e.g., PyTorch)

Familiarity with image processing and computer vision

(Optional) Prior knowledge of diffusion models is advantageous

Supervisor:

Liying Lu (liying.lu@epfl.ch)

Reference:

[1]. Abdelhamed, Abdelrahman, Stephen Lin, and Michael S. Brown. “A high-quality denoising dataset for smartphone cameras.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2]. Anaya, Josue, and Adrian Barbu. “Renoir–a dataset for real low-light image noise reduction.” Journal of Visual Communication and Image Representation 51 (2018): 144-154.

[3]. Chen, Chen, et al. “Learning to see in the dark.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[4]. Flepp, Roman, et al. “Real-World Mobile Image Denoising Dataset with Efficient Baselines.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5]. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.

[6]. Wei, Kaixuan, et al. “Physics-based noise modeling for extreme low-light photography.” IEEE Transactions on Pattern Analysis and Machine Intelligence 44.11 (2021): 8520-8537.

[7]. Costantini, Roberto, and Sabine Susstrunk. “Virtual sensor design.” Sensors and Camera Systems for Scientific, Industrial, and Digital Photography Applications V. Vol. 5301. SPIE, 2004.

[8]. Zhang, Yi, et al. “Rethinking noise synthesis and modeling in raw denoising.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Introduction
With the advancement of Vision-Language Models (VLMs), generating coherent, contextually relevant narratives from images has become an exciting yet challenging frontier. Current models struggle to maintain narrative consistency, often introducing contradictory details or missing contextually vital elements when interpreting a sequence of images. This proposal seeks to enhance the storytelling capabilities of VLMs by introducing a self-consistency mechanism, aimed at reinforcing coherence, maintaining character continuity, and upholding narrative flow across multiple image inputs.

Objective
The main objective of this research is to develop a self-consistency framework for VLMs that enables improved narrative coherence in visual storytelling tasks. This mechanism will monitor and enforce consistency in story elements such as character traits, setting, actions, and progression, producing narratives that align closely with human expectations for logical and consistent storytelling.

Methodology

Self-Consistency Module: Integrate a self-consistency module within the VLM architecture, which will cross-reference details across sequential images, ensuring that entities, actions, and story elements remain logically consistent. This module will evaluate consistency by tracking character attributes, scene elements, and temporal relationships, adjusting model outputs to rectify inconsistencies.
Memory and Reference Mechanisms: Implement a memory-based mechanism to store narrative elements identified in each image, maintaining a “story memory” that captures the main characters, locations, and story arcs. This will allow the VLM to reference earlier parts of the story and avoid contradictions or omissions as it progresses.
Training with Self-Supervision: Use self-supervised learning to fine-tune the model on datasets where story coherence is crucial. During training, the model will be penalized for introducing inconsistencies in narrative elements or disrupting logical story progression.
Evaluation and Benchmarking: Develop a new visual storytelling benchmark focused on self-consistency, assessing narrative coherence, character consistency, and story progression. The model will be evaluated on metrics such as narrative accuracy, coherence, and alignment with human story interpretation.

Expected Contributions

A novel self-consistency mechanism for VLMs to enhance coherence in multi-image storytelling tasks.
A memory-based reference model that maintains continuity across scenes, characters, and settings.
A new benchmark and evaluation framework for testing and measuring consistency in visual storytelling.

Prerequisites: Python and PyTorch. Basic understanding of diffusion models.

Level: MS research project

Number of students: 1

Contact: Yufan Ren, yufan.ren@epfl.ch

Description

Diffusion models have advanced the field of image generation, enabling one to create realistic and detailed images of almost any scene. However, these models depend more on rote learning of example scenes rather than a true understanding of a scene and its geometry. As a consequence, generated images can feature incorrect perspectives and geometric features. On the contrary, natural photographs feature specific geometric features. In particular, lines that are parallel in a scene converge on the photograph to a vanishing point, and all vanishing points derived from lines on parallel planes lie on the same vanishing line. When these principles are broken on generated images, the images can lack realism. In augmented or virtual reality systems, breaking these issues can lead to a disrupted viewer immersion. Geometric accuracy is furthermore crucial for applications such as architectural visualization. Thus, improving perspective in generated images would not only enhance the aesthetic quality of images, but also expand the utility of generative models in professional domains. Addressing this challenge could push the boundaries of what generative models can achieve. On the other hand, current geometric artefacts could be analysed to distinguish generated images from real ones and detect deepfakes.

In this project, we aim to investigate the geometry of images, both real and generated. We will review geometry analysis methods, most notably vanishing points detection. This problem has been studied in the literature for a long time, both with geometric and algorithmic methods and with more recent learning-based tools. However, it remains to be seen which of these methods still apply when geometrical correctness cannot be assumed in the first place. Furthermore, many generative models focus on generating faces, which are more difficult to analyse due to the absence of straight lines.

The developed tools will be used to assess and quantify the geometry inaccuracies obtained with various diffusion models. Then, depending on the interests and early results, this project could fork into two possible topics. A first application would be to develop a deepfake detection tool based on geometry analysis. Beyond deepfake detection, one could also seek to improve diffusion model generation to ensure geometric correctness of the generated images.

Deliverables

The final report should contain a review, both experimental and theoretical, of existing vanishing point detection methods. It might be relevant to implement one or several of the older methods, for which code is not always available. The review should focus as well on the specificities of generated image analysis detailed above. It should contain experiments assessing to which extent different diffusion models create geometric artefacts.

The report should also detail the proposed innovations on at least one of the following topics:

Improvements made to vanishing points detection
Deepfake detection using geometric analysis
Improving the perspective of generated images.

Overall, the report should be structured, and present experiments done during the project and conclusions that can be drawn from them. New proposed methods, as well as reimplemented ones, should be explained in a reproducible manner, for example with pseudo-code. If any training is involved, training details should be comprehensively explained.

In addition to the report, a clean, well-documented code enabling the reproduction of experiments will be expected.

Prerequisites

Strong skills in geometry
Proficiency in writing clean code, ideally with Python and Pytorch
Depending on the directions of this project, statistics and probability and/or a basic understanding of diffusion models (ideally both)

Type of work and number of students

Either one Master’s thesis student, or one or two (ideally two) MS research project students (semester projects)

Supervision

Quentin Bammey, quentin.bammey@epfl.ch

Main references

As the first reference contains important details pertaining to this project, please read it before applying.

Farid, Hany. “Perspective (in) consistency of paint by text.” arXiv preprint arXiv:2206.14617 (2022). https://arxiv.org/abs/2206.14617
Desolneux, Agnes, Lionel Moisan, and Jean-Michel Morel. From gestalt theory to image analysis: a probabilistic approach. Vol. 34. Springer Science & Business Media, 2007. (Chapter 8 is of particular interest to this project, but the whole book will be relevant if focusing on deepfake detection)
Almansa, Andrés, Agnes Desolneux, and Sébastien Vamech. “Vanishing point detection without any a priori information.” IEEE Transactions on Pattern Analysis and Machine Intelligence 25.4 (2003): 502-507.
Upadhyay, Rishi, et al. “Enhancing diffusion models with 3d perspective geometry constraints.” ACM Transactions on Graphics (TOG) 42.6 (2023): 1-15.
Santana-Cedrés, Daniel, et al. “Automatic correction of perspective and optical distortions.” Computer Vision and Image Understanding 161 (2017): 1-10.
Tehrani, Mahdi Abbaspour, Aditi Majumder, and M. Gopi. “Correcting perceived perspective distortions using object specific planar transformations.” 2016 IEEE International Conference on Computational Photography (ICCP). IEEE, 2016.
Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. “Adding conditional control to text-to-image diffusion models.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
Tutorial on diffusion models: https://cvpr2022-tutorial-diffusion-models.github.io/

Description

Between the capture of a raw image by camera sensors and its storage in a processed digital format, numerous operations occur. Image signal processing pipelines perform various tasks—such as denoising, demosaicing, optical and colour correction, end-to-end processing, and compression—and employ different methods for each operation (e.g., various denoising and demosaicing algorithms). Most modern pipelines, such as those found on smartphones or commercial software, are closed-source.
While many functions simply take an image-like signal (and parameters) as input and return a processed image, some of the newer methods may input or output multiple images (e.g., burst imaging, high dynamic range imaging, multi-sensor imaging), make use of masks to process regions in different manners, or handle metadata. Consequently, modern image processing pipelines can no longer be represented as a simple series of functions but rather as an acyclic graph of different methods.
Adding to this challenge, many of these functions are learning-based and have varied software requirements, such as different Python versions or libraries, that makes interoperability difficult.
The goal of this project is to develop image signal processing pipeline software that enables users to design or run a pipeline in a modular way, specifying their own functions either from stock ones or from their own code. The pipeline should be representable in a way that is legible both visually (as an acyclic graph) and in a text-based format. Users should be able to add and share their own encapsulated functions into the pipeline system with as few technical changes as possible, possibly using Docker containers or the uv Python packaging system.
The pipeline should support the application of masks (computed or user-provided) in an integrated way, allowing different parts of the pipeline to process different regions based on the mask. It should include a scientific image viewer, such as vpv (https://github.com/kidanger/vpv), capable of displaying and comparing intermediate results.
If successful, this software could have applications beyond raw image processing, such as in image restoration or image forensics toolboxes.

Deliverables

The students will work with their supervisors on deciding and prioritizing the software requirements, both compulsory and out of scope. They will develop the pipeline in a way that matches the compulsory requirements, while leaving room for the out of scope requirements to be developed in a later stage.
They will deliver a clean, maintainable and well-documented code.
While most methods in the pipelines are expected to be written in Python, the developed software interfacing the code should be written in a cross-platform, robust and future-proof language such as Rust.
Developing parts of the pipeline to integrate in the software is not part of this project, but the students should demonstrate how to interface example methods

Requirements

Proficiency in the language used to develop the interface (preferably but not necessarily Rust)
Some proficiency in Python and deep-learning libraries are expected. While the students will not have to develop the software in Python itself, the developed software should be able to interface methods developed in Python or other languages
Experience writing clean, documented and maintainable code
A basic understanding of image processing is desirable, for instance through following the Computational Photography course (CS413)

Supervision

Quentin Bammey <quentin.bammey@epfl.ch>, Raphaël Achddou <raphael.achddou@epfl.ch>

Description:

The goal of this research project is to explore Gaussianization techniques for high-dimensional data distributions (e.g. distribution of natural images). Gaussianization aims transforming non-Gaussian data distribution (usually high-dimensional) into a standard Gaussian distribution.

Gaussianization can be beneficial for various statistical and machine learning tasks (e.g., generation of images), because it makes each component of the data distribution statistically independent, reducing the curse of dimensionality and allowing to process (e.g., estimate the density or learn how to generate) each dimension independently.

Different techniques can be used/combined to transform a distribution towards a Gaussian distribution:

– Classical whitening methods (PCA, ICA, ZCA)

– Iterative methods (eg, normalize each dimension, apply a random rotation, and repeat [RBIG])

– Non-iterative methods [RG, NIG] (e.g., assume the data is symmetrically distributed and Gaussianize the distribution of the norm of the data [RG])

– Using Neural Networks (e.g., VAEs [VAE], Diffusion models [Diffusion])

– Connections between Gaussanization and BatchNormalization [BM] / BatchWhitening [BW] in neural networks could also be made

In this project, you will explore such techniques and apply them to toy examples and real datasets of images, and propose new improvements of these methods.

References:

[RBIG] Laparra, Valero, Gustavo Camps-Valls, and Jesús Malo. “Iterative gaussianization: from ICA to random rotations.” IEEE transactions on neural networks 22.4 (2011): 537-549.

[RG] Lyu, Siwei, and Eero Simoncelli. “Reducing statistical dependencies in natural signals using radial Gaussianization.” Advances in neural information processing systems 21 (2008).

[NIG] Rui, Rongxiang, and Maozai Tian. “Non-iterative Gaussianization.” arXiv preprint arXiv:2203.14526 (2022).

[VAE] Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[Diffusion] Song, Yang, et al. “Score-Based Generative Modeling through Stochastic Differential Equations.” International Conference on Learning Representations. 2020.

[BN] Ioffe, Sergey, and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” International conference on machine learning. pmlr, 2015.

[BW] Cho, Yooshin, et al. “Improving generalization of batch whitening by convolutional unit optimization.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

Laparra, Valero, Gustavo Camps-Valls, and Jesús Malo. “PCA gaussianization for image processing.” 2009 16th IEEE International Conference on Image Processing (ICIP). IEEE, 2009.

Deliverables: Deliverables should include code, well cleaned up and easily reproducible, as well as a written report, explaining the models, the steps taken for the project and the performances of the models.

Prerequisites: Python and PyTorch.

Level: Ideally MS research project (semester project), potentially BS research project (semester project)

Number of students: 1

Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)

Startup company Innoview has developed arrangements of lenslets that can be used to create document security features. The goal is to improve these security features and to optimize them possibly by 3D printing of the lenslet arrangements at a large scale.

Deliverables: Report and running prototype (Matlab), Blender lenslet simulations, 3D mesh objects in the wavefront .obj format, 3D printed mesh objects,

Prerequisites:

– knowledge of computer graphics, interaction of light with 3D mesh objects,

– basic knowledge of Blender,

– Coding skills in Matlab

Level: BS or MS semester project

Supervisors:

Prof. Roger D. Hersch, BC110, rd.hersch@epfl.ch, cell: 077 406 27 09

Dr Romain Rossier, Innoview Sàrl, romain.rossier@innoview.ch, tel: 078 664 36 44

Description:

Diffusion models generate images or data by iteratively denoising an initial noise sample, typically sampled from pure independent white Gaussian noise.
Recent works in our lab, such as Diffusion in Style [1], Signal Leak Bias [2], and Covariance Mismatch [3], as well as studies by others [4, 5], have shown that a more careful choice of the initial noise distribution can lead to generations that align better with desired outcomes.
Many academic works in the diffusion model literature are based on the HuggingFace ? Diffusers library [6]. The goal of this project is to develop a modular method to initialize the generation process from specific noise distributions, and contribute it to the ? Diffusers library via a Pull Request on their Github repo.

Background and details:
The HuggingFace ? Diffusers library is primarily organized around two/three main components:
1. Pipelines [7]: These encapsulate the entire generation process. For example, the StableDiffusionPipeline integrates a denoising model (typically a U-Net [8, 9]) and a scheduler (see next point).
2. Schedulers [10] (e.g., DDPM [11, 12], DDIM [13, 14]): These define the step-by-step generation procedure, from the initial noise sample and the predicted denoising directions.
3. In addition, Loaders [15] (e.g., LoRA [16, 17], Textual Inversion [18, 19], IP-Adapters [20, 21]) are modular components used to modify pipeline behavior. For instance, LoRA adjusts the denoising model weights, Textual Inversion adds personalization capabilities, and IP-Adapters introduce image conditioning. One way to allow users to start the generation process from a specified noise distribution (rather than the default white Gaussian noise) would be to implement an “Initial Noise Sampler” Loader. The proposed loader should accept arguments such as a method to define or compute the specific noise distribution (from some statistics), and a repository of precomputed statistics or a list of images for on-the-fly computation.
To illustrate the use of this loader, we will reimplement together the methods from Signal Leak Bias [2] and DDIM Inversion [22].

References:
[1] Everaert, Martin Nicolas, et al. “Diffusion in style.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[2] Everaert, Martin Nicolas, et al. “Exploiting the signal-leak bias in diffusion models.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024.
[3] Everaert, Martin Nicolas, et al. “Covariance Mismatch in Diffusion Models.” Infoscience preprint https://infoscience.epfl.ch/handle/20.500.14299/242173 . 2024.
[4] Zhang, Jeffrey, et al. “Preserving Image Properties Through Initializations in Diffusion Models.” Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024
[5] Wu, Tianxing, et al. “Freeinit: Bridging initialization gap in video diffusion models.” European Conference on Computer Vision. Springer, Cham, 2025.
[6] von Platen, P., et al. “Diffusers: State-of-the art diffusion models.” URL: https://github.com/huggingface/diffusers [7] https://huggingface.co/docs/diffusers/en/api/pipelines/overview
[8] Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer International Publishing, 2015
[9] https://huggingface.co/docs/diffusers/en/api/models/unet2d
[10] https://huggingface.co/docs/diffusers/en/api/schedulers/overview
[11] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in neural information processing systems 33 (2020): 6840-6851.
[12] https://huggingface.co/docs/diffusers/en/api/schedulers/ddpm
[13] Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising Diffusion Implicit Models.” International Conference on Learning Representations, 2021.
[14] https://huggingface.co/docs/diffusers/en/api/schedulers/ddim
[15] https://huggingface.co/docs/diffusers/main/en/api/loaders
[16] Hu, Edward J., et al. “LoRA: Low-Rank Adaptation of Large Language Models.” International Conference on Learning Representations, 2022.
[17] https://huggingface.co/docs/diffusers/en/api/loaders/lora
[18] Gal, Rinon, et al. “An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion.” The Eleventh International Conference on Learning Representations, 2023.
[19] https://huggingface.co/docs/diffusers/en/api/loaders/textual_inversion
[20] Ye, Hu, et al. “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models.” arXiv preprint arXiv:2308.06721 (2023).
[21] https://huggingface.co/docs/diffusers/en/api/loaders/ip_adapter
[22] https://huggingface.co/learn/diffusion-course/en/unit4/2

Deliverables: Deliverables should include documentation and code, well cleaned up and documented, as well as a written report, explaining the implementation and the steps taken for the project.
Prerequisites: Python, ideally familiarity with PyTorch. Experience with HuggingFace/Diffusers is a strong plus.
Level: Ideally BS research project (semester project), potentially MS research project (semester project)
Number of students: 1
Supervisor: Martin Nicolas Everaert (martin.everaert [at] epfl.ch)

Description

Recent work by Kaplan et al. (2020) on “Scaling Laws for Neural Language Models” [1] demonstrated striking power-law relationships between model size, dataset size, and performance for large language models. These insights highlight that model improvements follow consistent trends as we scale up compute and parameters.

Implicit Neural Representations (INRs)—such as neural fields used to represent continuous signals (images, shapes, videos, etc.) Unlike traditional discrete sampling, INRs parameterize signals as neural networks that map coordinates (e.g., spatial or temporal) to signal values (e.g., color or density). Examples include Neural Radiance Fields (NeRF) [2] for 3D scenes or SIREN [3] for 2D images. However, the relationship between model size, training compute, signal complexity, and final reconstruction quality for INRs remains largely unexplored.

Type of Work

MS Level: semester project/master project
65% research, 35% development

Goal

This project aims to investigate and characterize “scaling laws” for implicit neural representations. We will attempt to answer questions such as:

How does reconstruction quality (e.g., PSNR, SSIM, or other error metrics) scale with network parameters for a given signal complexity?
How does the signal’s intrinsic complexity (measured by entropy, Fourier spectrum, fractal dimension, etc.) affect the scaling curves?
What is the optimal balance between model size and number of samples (compute budget) for training INRs on different classes of signals?

Prerequisites

Proficiency in Python and experience with PyTorch.
Familiarity with neural network architectures and basic principles of machine learning.
Some background in signal processing (Fourier transforms, entropy measures, etc.) is beneficial.

Supervisor

Zhuoqian (Zack) Yang, zhuoqian.yang@epfl.ch

References

[1] Kaplan, Jared, et al. “Scaling laws for neural language models.” arXiv preprint arXiv:2001.08361 (2020).

[2] Mildenhall, Ben, et al. “Nerf: representing scenes as neural radiance fields for view synthesis (2020).” arXiv preprint arXiv:2003.08934 (2020).

[3] Sitzmann, Vincent, et al. “Implicit neural representations with periodic activation functions.” Advances in neural information processing systems 33 (2020): 7462-7473.

Description

Film photography’s distinctive “look” is partly due to its ability to record and compress light information of high dynamic range, especially in the highlights, without clipping [1]. By preserving subtle gradations in highlight and shadow areas and compressing, film naturally reveals rich color nuances, which is a key contributor to its signature aesthetic.

Digital film emulation has become increasingly popular, but most applications (e.g., Dazz, Dehancer, VSCO) assume availability of high-quality captures, while working off of images captured by relatively limited consumer camera sensors. These images tend to have a low dynamic range and lose highlight and shadow detail that film retains, making it impossible for current emulators to reproduce nuanced tones via compression.

This project aims to explore an approach that recovers or generates a high dynamic range RAW-equivalent image from the limited RGB input. By doing so, we can feed a simulated sensor output with higher bit-depth and more accurate color response into the film simulation pipeline, ensuring that the final result retains the highlight compression and color nuances that define the “film look.”

Type of work:

BS / MS Level: semester project/master project
65% Development 35% Research

Goal

The goal of this semester project is to build a framework that recovers the lost highlight and shadow detail from standard RGB images—effectively synthesizing a RAW-like image—and then apply physically-inspired film simulation techniques on top of this enhanced data. We will investigate state-of-the-art RAW synthesis methods (e.g., diffusion-based [2] U-Net-based [3-5]) and compare them with alternative approaches. The final framework should enable a more faithful reproduction of film’s high dynamic range properties, highlight compression and color “feel.”

Prerequisites

Proficiency in Python and experience with PyTorch.
Familiarity with digital imaging pipelines and RAW image formats.
Interest in photography and knowledge of film characteristics.

Supervisor

Zhuoqian (Zack) Yang, zhuoqian.yang@epfl.ch

References

[1] Attridge, G. G. “The characteristic curve.” The Journal of photographic science 39.2 (1991): 55-62.

[2] Reinders, Christoph, et al. “RAW-Diffusion: RGB-Guided Diffusion Models for High-Fidelity RAW Image Generation.” arXiv preprint arXiv:2411.13150 (2024).

[3] Brooks, Tim, et al. “Unprocessing images for learned raw denoising.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

[4] Zamir, Syed Waqas, et al. “Cycleisp: Real image restoration via improved data synthesis.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

[5] Kim, Woohyeok, et al. “Paramisp: learned forward and inverse ISPS using camera parameters.” arXiv preprint arXiv:2312.13313 (2023).