Here is a list of master and semester projects currently proposed at the DHLAB. For most projects, descriptions are initial seeds and the work can be adjusted depending on the skills and the interests of the students. For a list of already completed projects (with code and reports), see this GitHub page.
- Are you interested in a project listed below and it is marked as available? Write an email to the person(s) of contact mentioned in the project description, saying in which section and year you are, and possibly including a statement of your last grades.
- You want to propose a project or are interested by the work done at the DHLAB? Write an email to Frédéric Kaplan, Emanuela Boros, and Maud Ehrmann, explaining what you would like to do.
Spring 2025
Available
Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1
Context: The Impresso Project enriches large collections of radio and newspaper archives using image and text processing techniques. Among many processings that are applied to a dataset of over 130 digitised historical newspapers containing millions of pages and images (15B tokens), traditional named entity recognition and linking is applied. While location names are recognised and linked to Wikidata, a crucial dimension that is still missing is to accurately georeference relevant location names, in order to enable the integration of the spatial dimension to the temporal one.
Objective. This project aims to scale multilingual location detection and georeferencing across the Impresso corpus, with a particular focus on sub-city levels.
Main Steps
- Familiarising yourself with the Impresso project and data.
- Literature Review: Explore existing research on multilingual location detection and georeferencing.
- Data Analysis: Examine location names already recognised in the corpus, analyzing their statistical profiles, common errors, and areas for improvement, particularly at the sub-city level. Additionally, identify which location entities in historical newspapers are most relevant for mapping purposes.
- System Implementation: Develop or adapt a system for fine-grained location name recognition and linking. Various directions could be followed.
- Relevance Filtering: Design a method to determine which recognised place names are meaningful and should be georeferenced.
- Evaluation: Assess the system’s performance using appropriate metrics and benchmarks.
- Application: Deploy the system on the entire Impresso corpus.
The student can leverage tools such as the T-Res library, DeezyMatch, the HIPE entity evaluation pipeline, and the TopRes19th dataset.
Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with natural language processing, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.
A few references:
- Ardanuy, M. C., Nanni, F., Beelen, K., & Hare, L. (2023). The past is a foreign place: Improving toponym linking for historical newspapers. Proceedings http://ceur-ws. org ISSN, 1613, 0073.
- Meijers, E., & Peris, A. (2019). Using toponym co-occurrences to measure relationships between places: Review, application and evaluation. International Journal of Urban Sciences, 23(2), 246-268.
Type of project: Semester
Supervisor: Maud Ehrmann
Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :
1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:
- the annotation of a small training set (this step is best done in collaboration with project on image classification);
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
UPDATE: an annotated dataset already exists.
2/ Learn a model for map classification (which country or region of the world is represented)
- first exploration and qualification of map types in the corpus.
- building of a training set, prob. with external sources
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
Required skills:
- basic knowledge in computer vision
- ideally experience with PyTorch
Type: MSc (12 ECTS) Semester project
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2
Context
This project aims to explore, retrieve, and analyze data from various sources to monitor and understand Russia’s manipulation and weaponisation of cultural heritage in Ukraine. By employing data analysis techniques, this project seeks to document and provide insights into these actions, contributing to the preservation of cultural heritage and supporting international awareness and policy-making.
Objective and (Possible) Main Steps:
One can choose to analyse Wikipedia to:
- Track Editing Histories: The revision history of contentious articles may reveal politically or ideologically motivated edits (i.e., articles about Ukraine cities or cultural artifacts)
- Automated Page Tracking: Use tools like Wikimedia’s API to monitor changes in articles about cultural heritage in real-time.
- Cross-Check Narratives: Compare Wikipedia content with scholarly sources and publications from multiple perspectives.
- Investigate Talk Pages: The discussions on an article’s talk page often reveal disputes and biases.
- Web Scraping for Content and Metadata: Use scraping libraries like BeautifulSoup or scrapy to collect article text, editor information, and metadata not available through the API.
-
Revision Analysis:
-
Compare successive revisions of articles using diff algorithms (e.g., difflib) to detect content additions, deletions, or modifications.
-
Highlight changes in sentiment, bias, or framing.
-
- Semantic Page Selection: Employ embedding models like Alibaba-NLP/gte-multilingual to identify articles with semantic relevance to “cultural heritage” or “cultural manipulation.”
Requirements: Excellent Python knowledge, scraping, large language models knowledge
Significance:
This project will contribute to understanding cultural heritage manipulation in Ukraine, providing valuable insights and digital resources for future academic and cultural research. It will also contribute to the creation of strategies to protect cultural heritage in conflict zones. The methodologies developed in this project can also be applied to other conflict areas, enhancing global efforts to safeguard cultural heritage.
Type: MSc (12 ECTS) Semester project
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2
Context
This project aims to explore, retrieve, and analyse data from the resouces, focusing on a select collection of books related to epigraphy and cultural heritage. The primary objective is to gain insights into Armenian epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.
Objective
This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Objectives and Main steps
- Data Retrieval: Collect and aggregate data from academic resources, specifically targeting books and resources about epigraphy and cultural heritage.
- Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
- Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
- Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.
Requirements
- Proficiency in Python, knowledge of NLP techniques.
Significance
This project will contribute to the understanding of Armenian’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.
Type: MSc (12 ECTS) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Alexander Rusnak
Number of students: 1
Context: The field of AI ethics has become increasingly relevant as language models have proliferated into the public sphere.
Objective: Find novel ways of quantifying normative ethics and persistence of ethical frameworks across various scenarios.
Taken
Type: MSc (12 ECTS), BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)
Type: MSc (12 ECTS), BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)
Type: MSc (12 ECTS), BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)
Type: MSc (12 ECTS), BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisor: Paul Guhennec
Number of students: 1–2 (or more – together or separate)
Fall 2024
Taken
Type: MA (12 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1
Context:
The study of urban history is a complex and multidimensional field that involves analyzing various types of historical data, including cadastre (land registry) records. Traditionally, this process has been manual and time-consuming. However, with the advent of Large Language Models (LLMs) and their ability to process and analyze vast amounts of data, there is an opportunity to automate and enhance historical discoveries. Identifying divergences between present and past data is a critical starting point for many historical investigations, as it allows researchers to uncover patterns, transformations, and anomalies in the urban landscape over time.
Cadastre Data:
Cadastre data typically includes detailed information about land ownership, property boundaries, land use, and the value of properties. This data is crucial for understanding the historical layout and development of urban areas. Importantly, all data points in cadastre records are geolocalized, which facilitates direct comparison with today’s data from sources like OpenStreetMap.
Objective:
The primary objective of this project is to develop an automated system that leverages LLM agents to compare historical cadastre data with present-day data. The LLM agent would rely on a coding assistant as in [1] to efficiently convert hypotheses in natural language into python programs that efficiently make operations on tabular data.
Main Steps:
- Data Collection: Gather historical cadastre data and current urban data from various sources.
- Preprocessing: Clean and preprocess the collected data to ensure compatibility and accuracy.
- LLM Integration: Integrate LLM agents to analyze and compare the historical and contemporary datasets.
- Analysis: Conduct a detailed analysis to identify significant changes and patterns in the urban landscape.
Additional Comparisons:
In addition to comparing cadastre data with present-day geolocalized data from OpenStreetMap, other comparisons can be envisioned. For instance, leveraging genealogical databases or other open registers from today can provide further insights into the socio-economic transformations and population dynamics over time.
References:
[1] Majumder, Bodhisattwa Prasad, Harshit Surana, Dhruv Agarwal, Sanchaita Hazra, Ashish Sabharwal, and Peter Clark. “Data-Driven Discovery with Large Generative Models.” arXiv, February 21, 2024. http://arxiv.org/abs/2402.13610.
Requirements:
Proficiency in data science and machine learning. Familiarity with LLMs and natural language processing. Experience with Langchain, Huggingface or Autogen is a plus.
Type: Master or Bachelor research project (12/8 ECTS)
Sections: Data Science
Supervisors: Pauline Conti, Maud Ehrmann
Number of students: 1
Ideal as an optional semester project for a data science student.
Context: The Impresso project semantically enriches large collections of radio and newspaper archives by applying image and text processing techniques. A complex pipeline of data preparation and processing steps is applied to millions of content elements, creating and manipulating millions of data points.
Objective: The aim of this project is to implement a data visualisation dashboard to enable monitoring and quality control of the different data and their processing steps. Based on different sources of information, i.e. data processing manifests, inventories and statistics, the dashboard should provide an overview of what data is at what stage of the pipeline, allow a comparative view of different processing stages and support general understanding.
The solution adopted should ideally be modular and lightweight, and will ultimately be deployed online to allow everyone from the project (and perhaps more) to follow the data processing pipeline.
Steps:
- Understanding of the different Impresso processes and data, and the associated visualisation needs
- Detailed review of existing open-source dashboard data visualisation tools
- Implementation of tools, customisation to meet needs, visualisation proposals based on opportunities
- Test/revision loop
- Online deployment
Requirements:
Background in data science and data visualisation, basics of software engineering, good knowledge of Python, interest in data management.
Organisation of work
- Weekly meeting with supervisor(s)
- The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
- The student is advised to document his/her work in a logbook regularly and to document updates on progress, potential questions or problems in the logbook before the weekly meeting (at least 4 hours before).
- A slack channel is used for communication outside the weekly meeting.
- The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.
Type: Master thesis project (30 ECTS)
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1
Context: The Impresso project comprises a dataset of 130 digitised historical newspapers, totalling approximately 7.5 million pages. About half of these newspapers were digitised with only optical character recognition (OCR), while the rest also underwent optical layout recognition (OLR), separating text zones (lines, paragraphs, margins) and organising and labelling them into logical units corresponding to the various areas of the page (articles, headlines, section heads, tables, footnotes, etc).
For newspapers lacking OLR, the text from different content units is not differentiated, which negatively impacts the performance of NLP tools (and often compromises their relevance when applied to mixed contents). Identifying the bounding regions of the various content areas on newspaper pages could help us disentangle their respective texts and allow for separate processing.
The objective of the project is to investigate the ability of Large Vision Models (LVMs) tp interpret the physical layout of digitised print documents, in this case historical newspaper facsimiles. Specifically, the project aims to test and evaluate different models at segmenting and labelling logical units on pages such as text-paragraph, title, subtitle, table and image (either semantic or instance segmentation). The project will benefit from existing OLRed data from the Impresso corpus, which could be sampled to create a training set. The project will address, among others, the following research questions:
- Can LVMs (multimodal, vision-only) accurately recognise the layout of historical newspaper pages, and which approach is best suited?
- Can the identified approach be generalised to a large-scale dataset spanning around 300 years with significant variation in layout?
Main Steps:
- Familiarise with the Impresso project and data to understand the specific needs
- Review literature on document layout recognition and instance segmentation with the goal of identifying most promising approaches and recent models.
- Programmatically create a dataset based on existing OLR data, showcasing layout variety across newspaper titles and over time.
- Explore, apply, and evaluate selected multimodal and/or large vision model(s)
- Depending on results and progress, potentially explore post-processing to order or group regions corresponding to the same articles or contents.
Requirements: Knowledge of machine learning (ML) and deep learning (DL), familiarity with computer vision, proficiency in Python, experience with a DL framework (preferably Pytorch), interest in historical data.
References:
- Galal M. Binmakhashen and Sabri A. Mahmoud. 2019. Document Layout Analysis: A Comprehensive Survey. ACM Comput. Surv. 52, 6 (October 2019), 109:1-109:36.
- Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, and Weining Li. 2021. LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. In Document Analysis and Recognition – ICDAR 2021, 2021. Springer International Publishing, Cham, 131–146; Links: Layout Parser, Model Zoo — Layout Parser 0.3.2 documentation
- Christoph Auer, Ahmed Nassar, Maksym Lysak, Michele Dolfi, Nikolaos Livathinos, and Peter Staar. 2023. ICDAR 2023 Competition on Robust Layout Segmentation in Corporate Documents. In Document Analysis and Recognition – ICDAR 2023, 2023. Springer Nature Switzerland, Cham, 471–482.
- Florian Bordes et al. 2024. An Introduction to Vision-Language Modeling.
- Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In Proceedings of the 28th International Conference on Computational Linguistics, December 2020.
- Timo Lüddecke and Alexander Ecker. 2022. Image Segmentation Using Text and Image Prompts. 2022. 7086–7096. See also: timojl/clipseg.
- Mikaela Funkquist. 2022. Layout Analysis on modern Newspapers using the Object Detection model Faster R-CNN.
And also:
- Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-Pixel Classification is Not All You Need for Semantic Segmentation. In Advances in Neural Information Processing Systems, 2021. Curran Associates, Inc.
- Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. 2023. Segment Anything. 2023.
- Boyi Li, Kilian Q. Weinberger, Serge Belongie, Vladlen Koltun, and René Ranftl. 2022. Language-driven Semantic Segmentation.
- Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. 2021. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems, 2021. Curran Associates, Inc., 12077–12090.
Type: MA (12 ECTS) or BA (8 ECTS) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2
Context:
Large Language Models (LLMs) have revolutionised how we interact with and extract information from vast textual datasets. These models have been trained on extensive corpora, incorporating a broad spectrum of knowledge. However, a significant challenge arises in determining whether a given piece of text presents new information or reflects content that the model has already encountered during its training.
Evaluating the novelty of texts is crucial for applications in historical research. As the volume of historical literature grows, particularly with the digitization of vast archives, historians face the challenge of navigating and synthesising information from these extensive datasets.
Competent LLMs, adept at recognizing and integrating new knowledge, can significantly enhance this process. They would allow historians to uncover previously unseen patterns, connections, and insights, leading to groundbreaking historical discoveries and more robust applications in the digital humanities.
Objective:
This project aims to develop algorithms that can effectively evaluate whether textual sources represent new pieces of knowledge that were never distilled in open-source LLMs during pre-training.
Research Questions:
- What algorithms can be developed to assess the novelty of texts with respect to LLM training data?
- How can these algorithms be combined with standard retrieval approaches [1] to improve them in the domain of interest.
Main Steps:
- Literature Review: Conduct a comprehensive review of existing approaches to novelty detection [2,3,4], hallucination detection [5] and knowledge evaluation [6] in the context of LLMs.
- Problem Definition: Formally define knowledge in the context of textual data: information (content) vs novel pattern of language (form)
- Data Preparation:
- Novel data selection (Sources curated by EPFL – Secondary sources about Venice, EPFL thesis or new data)
- Standard statistical analysis of data (unsupervised NLP technics)
- Algorithm Development: Develop algorithms for novelty detection that may include:
- Statistical comparison methods.
- Embedding-based similarity measures.
- Anomaly detection techniques.
- Token / phrase-level / chunk analysis.
- Testing and Evaluation: Test the developed algorithms using various datasets to assess their accuracy and effectiveness in identifying novel texts. This includes:
- Benchmarking against known LLM training data.
- Evaluating performance across different genres and languages.
References
[1] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks – NeurIPS 2020.
[2] Shi, Weijia, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. “Detecting Pretraining Data from Large Language Models.” arXiv, March 9, 2024. http://arxiv.org/abs/2310.16789.
[3] Golchin, Shahriar, and Mihai Surdeanu. “Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Language Models.” arXiv, February 10, 2024. http://arxiv.org/abs/2311.06233.
[4] Hartmann, Valentin, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. “SoK: Memorization in General-Purpose Large Language Models.” arXiv, October 24, 2023. https://doi.org/10.48550/arXiv.2310.18362.
[5] Farquhar, Sebastian, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. “Detecting Hallucinations in Large Language Models Using Semantic Entropy.” Nature 630, no. 8017 (June 2024): 625–30. https://doi.org/10.1038/s41586-024-07421-0.
[6] Wang, Cunxiang, Sirui Cheng, Qipeng Guo, Yuanhao Yue, Bowen Ding, Zhikun Xu, Yidong Wang, Xiangkun Hu, Zheng Zhang, and Yue Zhang. “Evaluating Open-QA Evaluation.” arXiv, October 23, 2023. https://doi.org/10.48550/arXiv.2305.12421.
Requirements: Good programming skills, a strong interest for LLM research, experience with LLMs is a plus.
Type: MA Research project or MA thesis
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Pauline Conti, Emanuela Boros, Maud Ehrmann
Number of students: 1
The Impresso project applies natural language and computer vision processing techniques to enrich large collections of radio and newspaper archives and develops new ways for historians to explore and use them. Exploring, analysing and interpreting historical media sources and their enrichment is only possible with contextual information about the sources themselves (e.g. what is the political orientation of a newspaper), and the processes applied to them (what is the accuracy of the tools that produced this or that enrichment).
Information on newspapers, or metadata, already exists but can always be supplemented. In this respect, the “Swiss Press Bibliography“, published by Fritz Blaser in 1956, is a treasure trove of information on the origins and history of Swiss newspapers. This bibliography documents 483 and around a thousand periodicals published in Switzerland between 1803 and 1958 and documents them in great detail according to a given template – a database on paper.
The objective of the project is to extract the semi-structured information from Blaser’s newspaper bibliography (PDF files) and build a lightweight database (possibly in JSON only, or graph DB).
The extracted information will be used to
- document the newspapers present in the Impresso web application;
- support the study of the newspaper ecosystem in Switzerland at that time, e.g. by studying clusters of publications by political orientation over time, tracking publishers or editors, etc.
Steps
- Review tools that can be used to correct/redo the OCR of PDF files and select one;
- Define a data model based on the information contained in the bibliography;
- Extract and systematically store the information
- Devise a way of assessing the quality of the extraction process.
- If time permits, carry out an initial analysis of the database created.
If taken as a Master project:
- Additional steps:
- Perform named entity recognition and linking on the information present in some descriptive fields
- Conduct a first analysis of the database, e.g.
- Map printing locations of newspapers in Switzerland, and their evolution through time
- Create a network of main actors (editors, publishers)
- …and more, this is a very rich source.
- Similar sources at the European level could be integrated.
This project will be done in collaboration with researchers from the History Department of UNIL, members of the Impresso project.
Requirements
Good knowledge of Python, basics of software and data engineering, interest in historical data. Medium to good knowledge of French or German is required.
Organisation of work
- Weekly meeting with supervisor(s)
- The student is asked to submit a detailed project plan (envisaged steps, milestones) by the end of week 2.
- The student is advised to document his/her work in a logbook regularly. Updates on progress, potential questions or problems will be listed in the logbook before the weekly meeting (at least 4 hours before).
- A Slack channel is used for communication outside the weekly meeting.
- The student is advised to start his/her project report between 3 and 2 weeks before the end of the project. Report on overleaf using the EPFL template.
Type: MSc (12 ECTS), BA (8ECST) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Isabella di Lenardo, Raimund Schnürer
Number of students: 1–2
Context:
In recent years, various features have been extracted from historical maps thanks to advancements in machine learning. While the content of maps is relatively well studied, elements around the maps still deserve some attention. These elements are used, amongst others, for decoration (e.g. ornamentation, cartouches), orientation (e.g. scale bar, wind rose, north arrow), illustration (e.g. heraldic, figures, landscape scenes), and description (e.g. title, explanations, legend). The analysis of the style and arrangement of these elements will give valuable hints about the cartographer’s background.
Objective:
In this project, map layout elements shall be analysed in depth using a given dataset of 400.000 historical maps.
Main Steps:
- Review literature about extracting map layout elements
- Detect map layout elements in historical maps using artificial neural networks (e.g. segmentation)
- Find similar elements between maps (e.g. by t-SNE)
- Identify clusters among authors, between different regions and time periods
- Visualize these connections
Research Questions:
- How accurately can the elements be detected on historic maps?
- Which visual properties are suited to find similarities between the elements?
- Which connections exist between different maps?
References:
- Petitpierre et al. 2024: A fragment-based approach for computing the long-term visual evolution of historical maps. https://doi.org/10.1057/s41599-024-02840-w
- Schnürer et al. 2021: Detection of Pictorial Map Objects with Convolutional Neural Networks. https://doi.org/10.1080/00087041.2020.1738112
Requirements:
Good programming skills, familiarity with machine learning, interest in historical maps
Type: MSc research project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1
Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)
Context
Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.
Research Questions
- Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
- Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?
Objective
To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.
Instruction/Prompt: When was the Red Cross founded?
Example Answer: 1864
Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”
Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.
Main Steps
- Data Curation:
- Collect OCR-based datasets.
- Analyze historical newspaper articles to understand common features and challenges.
- Dataset Creation:
- Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
- Train or finetune a LLaMA language model based on this dataset.
- Model Training/Fine-Tuning:
- Train or fine-tune a language model like LLaMA on this dataset.
- Evaluation:
- Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
- Compare models trained on the new dataset with those trained on standard datasets.
- Employ metrics like accuracy, perplexity, F1 score.
Requirements
- Proficiency in Python, ideally PyTorch.
- Strong writing skills.
- Commitment to the project.
Output
- Potential publications in NLP and historical document processing.
- Contribution to advancements in handling historical texts with LLMs.
Deliverables
- A comprehensive dataset for training LLMs on historical texts.
- A report or paper detailing the methodology, findings, and implications of the project.
References
Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
- impresso App
- Text-To-Text Transfer Transformer (T5)
- Denoising Autoencoder for Pretraining Sequence-to-sequence Models (BART)
- Multiligual BART (mBART)
- ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Historical News Collections
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- Finetuning LLAMA on instruction-based datasets
- Finetuning LLaMA with Megatron
Deferred
Type: MA (12 ECTS), BA (8ECST) Research project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Tristan Karch
Number of students: 1–2
Context:
Language shapes the way we take actions. We use it all the time, to plan our day, organize our work and most importantly, to think about the world. Through language, we not only communicate with others but also internally process and understand our experiences. This ability to think about the world through language is also what enables us to engage in counterfactual reasoning [1], imagining alternative scenarios and outcomes to refine and enhance our experience of the world.
There is a growing body of work that investigates the deep interactions between language and decision-making systems [2]. In this context, large language models (LLMs) are used to design autonomous agents [3,4] that achieve complex tasks involving different reasoning patterns in textual interactive environments [5,6]. Such agents are equipped with different mechanisms such as reflexive [7] and memory [8] modules to continually adapt to new sets of tasks and foster generalisation. These modules rely on efficient prompts that help agents combine their environmental trajectories with their foundational knowledge of the world to solve advanced tasks.
Objective:
The objective of this project is to design and evaluate counterfactual learning mechanisms in LLM agents evolving in textual environments.
Research Questions:
- Can we design reflexive mechanisms that autonomously generate counterfactuals from behavioral traces of agents evolving in textual environments?
- What is the effect of counterfactuals on exploration? Adaptation? Generalization?
Main Steps:
- Interdisciplinary literature review (LLM agents, language and reasoning, counterfactual reasoning);
- Get familiar with benchmarks (Science world, Alf word, others?);
- Re-implement baselines: Reflexion [7] and Clin [8];
- Design counterfactual generation;
- Derive metrics to analyze the impact of the proposed approach.
Requirements: Good programming skills, Experience working with RL and LLM is a plus.
References:
[1] The Functional Theory of Counterfactual Thinking – K. Epstude and N. Roese, Pers Soc Psychol Rev. 2008 May;12(2):168-92. doi: 10.1177/1088868308316091. PMID: 18453477; PMCID: PMC2408534.
[2] Language and Culture Internalisation for Human-Like Autotelic AI – Cédric Colas, Tristan Karch, Clément Moulin-Frier, Pierre-Yves Oudeyer. Nature Machine Intelligence.
[3] ReAct: Synergizing Reasoning and Acting in Language Models – Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao, https://arxiv.org/abs/2210.03629
[4] Language Modes are Few-Shot Butlers, Vincent Micheli, Francois Fleuret, https://arxiv.org/abs/2104.07972
[5] ALFWorld: Aligning Text and Embodied Environments for Interactive Learning – Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, Matthew Hausknecht, https://arxiv.org/abs/2010.03768
Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark, https://arxiv.org/abs/2310.10134
[6] ScienceWorld: Is your Agent Smarter than a 5th Grader? – Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, Prithviraj Ammanabrolu, https://arxiv.org/abs/2203.07540
[7] Reflexion: Language Agents with Verbal Reinforcement Learning – Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao, https://arxiv.org/abs/2303.11366
[8] CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization – Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Peter Jansen, Oyvind Tafjord, Niket Tandon, Li Zhang, Chris Callison-Burch, Peter Clark. https://arxiv.org/pdf/2310.10134
Spring 2024
Available
Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)
Context
The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research. The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.
Figure: By Julian Nyča – Own work, CC BY-SA 3.0
Research questions
- How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
- What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
- How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
- What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
- How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?
Objective
This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Main steps
- Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
- Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
- 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
- Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
- Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
- Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localised inscription.
Explored methods
- Proportional analysis
- 3D modelling using Rhino
- 3D segmentation and annotation with the inscription
- Exploration of visualization methodologies for this additionally embedded information
Requirements
- Previous experience with architectural 3D modelling using Rhino.
[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1
Context
- To investigate trajectory divergence in LLMs: We will study how minor variations in input (such as small changes in text) can lead to significantly different outputs, illustrating sensitivity to initial conditions.
- To identify attractors in LLMs: We will explore if there are recurring themes or patterns in the model’s outputs that act as attractors, regardless of varied inputs.
- To analyze chaotic sequences in model responses: By feeding a series of chaotic or nonlinear inputs, we aim to understand how the model’s responses demonstrate characteristics of chaotic systems.
- To utilize reinforcement learning in training LLMs: To observe how the introduction of reward-based training influences the development of these complex behaviors.
- Data Collection and Preparation: We will generate a diverse set of input data to feed into the LLM, ensuring a range that can test for trajectory divergence and chaotic behavior.
- Model Training: An LLM will be trained using reinforcement learning techniques to adapt its response strategy based on predefined reward systems.
- Experimentation: The trained model will be subjected to various tests, including slight input modifications and chaotic input sequences, to observe the outcomes and patterns.
- Analysis and Visualization: Data analysis tools will be used to interpret the results, and visualization techniques will be applied to illustrate the complex dynamics observed.
- A deeper understanding of how complex system theories apply to LLMs.
- Insights into the stability, variability, and predictability of LLMs.
- Identification of potential attractor themes or patterns in LLM outputs.
- A contribution to the broader discussion on AI behavior and its implications.
Requirements
Excellent technical skills, previous practical experience with LLMs and passion for the subject.
- Generative Models as a Complex Systems Science: How can we make sense of large language model behavior
- A Theory for Emergence of Complex Skills in Language Models
- Can Large Language Models Understand Real-World Instructions?
- Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
- A Watermark for Large Language Models
- Explaining the Complex Task Reasoning of Large Language Models with Template-Content Structure
- Are Emergent Abilities of Large Language Models a Mirage?
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1
Context
The project aims to develop a compiler for GPT models that transcends traditional prompt engineering, enabling the creation of more structured and complex written pieces. Drawing parallels from the early days of computer programming, where efficiency and compactness in commands were crucial due to memory and space constraints, this project seeks to elevate the way we interact with and utilize Large Language Models (LLMs) for text generation.
Objectives
- To Create a Compiler for Enhanced Text Generation: Develop a compiler that translates user intentions into complex, structured narratives, moving beyond simple prompt responses.
- To Establish ‘Libraries’ for Complex Writing Projects: Similar to programming libraries, these would contain comprehensive information about characters, settings, and narrative logic, which can be loaded at the start of a writing session.
- To Facilitate Hierarchical Abstraction in Writing: Implement a system that allows for the creation of high-level abstractions in storytelling, akin to programming.
- To Enable Specialization in Narrative Elements: Support the development of specialized modules for characters, settings, narrative logic, and stylistic effects.
Methodology
- Compiler Design: Designing a compiler capable of interpreting and translating complex narrative instructions into executable text generation tasks for LLMs like GPT.
- Library Development: Creating a framework for users to build and store detailed narrative elements (characters, settings, etc.) that can be referenced by the compiler.
- Abstraction Layers Implementation: Developing a system to manage and utilize different levels of narrative abstraction.
- Integration with Various LLMs: Ensuring the compiler is adaptable to different LLMs, including OpenAI, Google, or open-source models.
- Testing and Iteration: Conducting extensive testing to refine the compiler and its ability to handle complex writing tasks.
Expected Outcomes
- A tool that allows for the creation of detailed and structured written works using LLMs.
- A new approach to text generation that mirrors the evolution and specialization seen in computer programming.
- Contributions to the field of AI-driven creative writing, enabling more complex and nuanced storytelling.
guidance
is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1
Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)
Context
Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis in the context of impresso – Media Monitoring of the Past II. Beyond Borders: Connecting Historical Newspapers and Radio project.
Research Questions
- Is it feasible to create a dataset for training LLMs to better comprehend historical documents, using semi-automatic or automatic methods?
- Can a specialized, resource-efficient LLM effectively process and understand noisy, historical digitized documents?
Objective
To develop an instruction-based dataset to improve LLMs’ capabilities in interpreting historical documents. The focus will be on sourcing and analyzing historical Swiss and Luxembourgish newspapers (spanning 200 years) and other historical collections in ancient Greek or Latin.
Instruction/Prompt: When was the Red Cross founded?
Example Answer: 1864
Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”
Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.
Main Steps
- Data Curation:
- Collect OCR-based datasets.
- Analyze historical newspaper articles to understand common features and challenges.
- Dataset Creation:
- Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents, similar to the method presented in ArchivalQA.
- Train or finetune a LLaMA language model based on this dataset.
- Model Training/Fine-Tuning:
- Train or fine-tune a language model like LLaMA on this dataset.
- Evaluation:
- Assess LLMs’ performance on NLP tasks (NER, EL) using historical documents.
- Compare models trained on the new dataset with those trained on standard datasets.
- Employ metrics like accuracy, perplexity, F1 score.
Requirements
- Proficiency in Python, ideally PyTorch.
- Strong writing skills.
- Commitment to the project.
Output
- Potential publications in NLP and historical document processing.
- Contribution to advancements in handling historical texts with LLMs.
Deliverables
- A comprehensive dataset for training LLMs on historical texts.
- A report or paper detailing the methodology, findings, and implications of the project.
References
Optical character recognition (OCR) or optical character reader is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene photo or from subtitle text superimposed on an image.
- impresso App
- Text-To-Text Transfer Transformer (T5)
- Denoising Autoencoder for Pretraining Sequence-to-sequence Models (BART)
- Multiligual BART (mBART)
- ArchivalQA: A Large-scale Benchmark Dataset for Open Domain Question Answering over Historical News Collections
- Llama 2: Open Foundation and Fine-Tuned Chat Models
- Finetuning LLAMA on instruction-based datasets
- Finetuning LLaMA with Megatron
Taken
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Frédéric Kaplan, Emanuela Boros, Maud Ehrmann
Number of students: 1
Context
This project aims to design and develop an innovative interface for iterative text composition, leveraging the capabilities of Large Language Models (LLMs) like GPT. The interface will enable users to collaboratively compose texts with the LLM, providing control and flexibility in the creative process.
Objectives
- To create a user-friendly interface for text composition: The interface should allow users to input, modify, and refine text generated by the LLM.
- To enable iterative interaction: Users should be able to interact iteratively with the LLM, adjusting and fine-tuning the generated text according to their needs and preferences.
- To incorporate customization options: The system should offer options to tailor the style, tone, and thematic elements of the generated text.
Methodology
- Interface Design: Designing a user-friendly interface that allows for easy input and manipulation of text generated by the LLM.
- LLM Integration: Integrating a LLM into the interface to generate text based on user inputs and interactions.
- Customization and Control Features: Implementing features that allow users to customize the style and tone of the text and maintain control over the content.
- User Testing and Feedback: Conducting user testing sessions to gather feedback and refine the interface and its functionalities.
Expected Outcomes
- A functional interface that allows for collaborative text composition with a LLM.
- Enhanced user experience in text creation, providing a blend of AI-generated content and human creativity.
- Insights into how users interact with AI in creative processes.
guidance
is a programming paradigm that offers superior control and efficiency compared to conventional prompting and chaining.
Type: BA (8ECST) Semester project, MSc (12 ECTS)
Sections: Digital Humanities, Data Science, Computer Science
Supervisor: Hamest Tamrazyan, Emanuela Boros
Number of students: 1–2
Context
This project aims to explore, retrieve, and analyse data from the Digital Laboratory of Ukraine, focusing on a select collection of approximately ten books related to epigraphy and cultural heritage. The primary objective is to gain insights into Ukraine’s epigraphic and cultural heritage through detailed data analysis, term extraction, and database management.
Objective
This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Objectives and Main steps
- Data Retrieval: Collect and aggregate data from the Digital Laboratory of Ukraine, specifically targeting books and resources about epigraphy and cultural heritage.
- Data Cleaning and Formatting: Implement data preprocessing techniques to ensure data quality. This includes removing irrelevant or corrupt data, handling missing values, and standardizing formats.
- Database Setup: Design and implement a database to store and manage retrieved data efficiently. The database should allow easy access and manipulation of the data for analysis.
- Term Extraction and Analysis: Employ natural language processing (NLP) techniques to extract key terms, concepts, and thematic elements from the texts. This will help us understand the predominant themes and patterns in Ukrainian epigraphy and cultural heritage.
Requirements
- Proficiency in Python, knowledge of NLP techniques.
Significance
This project will contribute to the understanding of Ukraine’s rich cultural heritage. It will provide valuable digital resources for future academic and cultural research in this field.
Type: MSc Semester project (12 ECTS)
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Alexander Rusnak
Number of students: 1-3
Figure: Illustration of the SALMON training pipeline.
Context
All language models are taught some form of ethical system, whether implicitly through the curation of the dataset or by utilising some form of explicit training and prompting scheme. One type of ethical guidance framework is the Constitutional AI system proposed by the Anthropic team; this approach is predicated on prompting a language model to revise its own responses relative to a set of values which are then used to retrain the language model utilising supervised finetuning and reinforcement learning tutored by a preference model as per the standard RLHF setup. This approach has shown very strong results in improving both the ‘harmlessness‘ (i.e. ethical behaviour) and ‘helpfulness‘ of language models. However, the deontological ethical system they utilised has some key drawbacks.
Objective
This project will attempt to encode a virtue ethics framework into the model both in the selection of the values by which the responses are revised but also in the architectural structure itself. Virtue ethics focuses on three types of evaluation: the ethicality of the action itself, the motivation behind the action, the utility of the action towards promoting a virtuous character in the agent. To this end, the student will implement a separate preference model specifically for each of these three avenues of moral evaluation that will then be used for RL training of an LLM assistant. This should result in a model that has increases in both harmlessness and helpfulness, but also in explainability.
Main Steps
- Curate a dataset of adversarial prompts and ethics-oriented prompts to be used for training.
- Implement a reinforcement learning from AI feedback training structure following from Anthropic’s Claude or IBM’s SALMON.
- Create a custom prompting pipeline for the virtuous action preference model, motivational explanation preference model, and virtue formation preference model.
- Train the chatbot using each of the preference models separately and finally combined, and measure their comparative performance on difficult ethical questions.
Requirements
-
Knowledge of machine learning and deep learning principles, familiarity with language models, proficiency in Python and Pytorch, and interest in ethics and philosophy.
References
Fall 2023
Available
Type: MSc Semester project
Sections: Architecture, Digital Humanities, Data Science, Computer Science
Supervisor: Beatrice Vaienti, Hamest Tamrazyan
Keywords: 3D modelling
Number of students: 1–2 (min. 12 ECTS in total)
Context: The Armenian cultural heritage in Artsakh [1] is in danger after explicit threats of irreversible destruction coordinated by Azerbaijan authorities [2]. As part of a series of actions coordinated by the EPFL, the Digital Humanities Institute is currently prototyping methods to offer rapid deployment of DH technology in situations of crisis. As part of these actions, over 200 Armenian inscriptions of Artsakh with essential information such as the language data including diplomatic and interpretive transcriptions, the translation into English, the location of the inscription on the monument (if applicable), geographical and chronological data, the type of monument, and the type of inscription systematize and digitize the inscriptions on the monuments of Armenian cultural heritage in Nagorno-Karabakh. This digitized data will help not only to preserve the invaluable inscriptions but also can be used for further investigations and research. The aim of this project is to create a 3D model of the church successfully, accurately locate the inscriptions, and contribute to preserving, studying, and promoting Armenian cultural heritage in Nagorno-Karabakh.
Figure: By Julian Nyča – Own work, CC BY-SA 3.0
Research questions:
- How can advanced imaging and 3D modelling technologies be utilized to accurately capture and represent the intricate details of the church and the inscriptions?
- What methods can be employed to ensure the precise alignment and placement of the digitized inscriptions within the 3D model of the church?
- How can the digital representation of the church and its inscriptions be effectively integrated with the database of Armenian inscriptions in Nagorno-Karabakh?
- What insights can be gained from analysing the spatial distribution and arrangement of the inscriptions within the church, shedding light on the historical and cultural context in which they were created?
- How can the integration of the 3D model and the digitized inscriptions contribute to the preservation, documentation, and study of Armenian epigraphic heritage in Nagorno-Karabakh?
Objectives: This project aims to deepen the knowledge of the architectural and epigraphic significance of the church, explore innovative techniques for digitizing and visualizing cultural heritage, and contribute to the preservation and accessibility of Armenian inscriptions in Nagorno-Karabakh.
Main steps:
- Research and Planning: Conduct thorough research on the church and its inscriptions, architectural features, and existing documentation, and develop a detailed plan for creating a 3D model and locating the inscriptions within the model.
- Study of the plan, section, and elevation views from a survey: Proportional and compositional study aimed at the 3D reconstruction.
- 3D reconstruction: A low level of detail mesh already exists, but with this project, the student will try to transfer the information from the architectural survey to a refined architectural 3D model with interiors and exteriors.
- Data Processing and Digitization: Process the collected data to digitally represent the church and the inscriptions.
- Inscription Localization: Analyse the collected data and determine the precise location of each inscription within the 3D model of the church.
- Data Integration: Ensure that the essential information, such as language data, translations, geographical and chronological data, monument and inscription types, bibliographic references, and photographs, are correctly linked to the localized inscription.
Explored methods:
- Proportional analysis
- 3D modelling using Rhino
- 3D segmentation and annotation with the inscription
- Exploration of visualization methodologies for this additionally embedded information
Requirements: previous experience with architectural 3D modelling using Rhino.
[1] A toponym used by the local Armenians to refer to Nagorno-Karabagh territory
[2] the European Parliament resolution on the destruction of cultural heritage in Nagorno-Karabakh (2022/2582(RSP)) dated 09.03.20022.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1–2
Figure: Historical Instruction Mining for (finetuning) Large Language Models (Generated with Midjourney)
Context: Historical collections present multiple challenges that depend either on the quality of digitization, the need to handle documents deteriorated by the effect of time, the poor quality printing materials or inaccurate scanning processes such as optical character recognition (OCR) or optical layout recognition (OLR). Moreover, historical collections can pose another challenge due to the fact that documents are distributed over a long enough period of time to be affected by language change. This is especially true in the case of Western European languages, which only acquired their modern spelling standards roughly around the 18th or 19th centuries. At the same time, language models (LMs) have been leveraged for several years now, obtaining state-of-the-art performance in the majority of NLP tasks, by generally being fine-tuned on downstream tasks (such as entity recognition). LLMs or instruction-following models have taken over with relatively new capabilities in solving some of these tasks in a zero-shot manner through prompt engineering. For example, the generative pre-trained transformer (GPT) family of LLMs refers to a series of increasingly powerful and large-scale neural network architectures developed by OpenAI. Starting with GPT, the subsequent models have witnessed substantial growth, such as ChatGPT, GPT-4. These increased sizes allow the models to capture more intricate patterns in the training data, resulting in better performance on various tasks (like acing exams). Nevertheless, they seem to fail in understanding and reasoning when it handles historical documents. This project aims to at building a dataset in a semi-automatic manner for improving the application of LLMs in historical data analysis.
Research Questions:
- Can we create a dataset in a (semi-automatic/automatic) manner for training an LLM to better understand historical documents?
- Can a specialized, resource-efficient LLM effectively process noisy historical digitised documents?
Objective: The objective of this project is to develop an instruction-based dataset to enhance the ability of LLMs to understand and interpret historical documents. This will involve sourcing historical Swiss and Luxembourgish newspapers spanning 200 years, as well as other historical collections such as those in ancient Greek or Latin. Two fictive examples:
Instruction/Prompt: When was the Red Cross founded?
Example Answer: 1864
Instruction / Prompt: Given the following excerpt from a Luxembourgish newspaper from 1919, identify the main event and key figures involved. Excerpt: “En 1919, la Grande-Duchesse Charlotte est montée sur le trône du Luxembourg, succédant à sa sœur, la Grande-Duchesse Marie-Adélaïde, qui avait abdiqué en raison de controverses liées à la Première Guerre mondiale.”
Example Response: Grand Duchess Charlotte and her sister, Grand Duchess Marie-Adélaïde.
Main Steps:
- Data Curation: Gather datasets based on OCR level, and familiarize with the corpus by exploring historical newspaper articles. Identify common features of historical documents and potential difficulties.
- Dataset Creation: Decide on what type of instruction should be generated and utilise other existing LLMs such as T5, BART, etc., to generate instructions (or questions) from Swiss historical documents.
- (Additional) Train or finetune a LLaMA language model based on this dataset.
- (Additional) Evaluation: Utilise the newly created dataset to evaluate existing LLMs and assess their performance on various NLP tasks such as named entity recognition (NER) and linking (EL) in historical documents. Use standard metrics such as accuracy, perplexity, or F1 score for the evaluation. Compare the performance of models trained with the new dataset against those trained with standard datasets to ascertain the effectiveness of the new dataset.
Requirements: Proficiency in Python, preferably PyTorch, excellent writing skills, and dedication to the project.
Outputs: The project’s results could potentially lead to publications in relevant research areas and would contribute to the field of historical document processing.
Type: MSc Semester project (12 ECTS) or MSc project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros, Maud Ehrmann
Number of students: 1-2
Figure: Forecasting News (Generated with Midjourney)
Context: The rapid evolution and widespread adoption of digital media have led to an explosion of news content. News organizations are continuously publishing articles, making it increasingly challenging to keep track of daily developments and their implications. To navigate this overwhelming amount of information, computational methods capable of processing and making predictions about this content are of significant interest. With the advent of advanced machine learning techniques, such as Generative Adversarial Networks (GANs), and Large Language Models (LLMs), it’s possible to forecast future content based on existing articles. This project proposes to leverage the strengths of both GANs and LLMs to predict the content of next-day articles based on current-day news. This approach will not only allow for a better understanding of how events evolve over time, but also could serve as a tool for news agencies to anticipate and prepare for future news developments.
Research Questions:
- Can we design a system that effectively leverages GANs and LLMs to predict next-day article content based on current-day articles?
- How accurate are the generated articles when compared to the actual articles of the next day (how close to reality are they)?
- What are the limits and potential biases of such a system and how can they be mitigated?
Objective: The objective of this project is to design and implement a system that uses a combination of GANs and LLMs to predict the content of next-day news articles. This will be measured by the quality, coherence, and accuracy of the generated articles compared to the actual articles from the following day.
Main Steps:
- Dataset Acquisition: Procure a dataset consisting of sequential daily articles from multiple sources.
- Data Preprocessing: Clean and preprocess the data for training. This involves text normalization, tokenization, and the creation of appropriate training pairs.
- Generator Network Design: Leverage an LLM as the generator network in the GAN. This network will generate the next-day article based on the input from the current-day article.
- Discriminator Network Design: Build a discriminator network capable of distinguishing between the actual next-day article and the generated article.
- GAN Training: Train the GAN system by alternating between training the discriminator to distinguish real vs generated articles, and training the generator to fool the discriminator.
- Evaluation: Assess the generated articles based on measures of text coherence, relevance, and similarity to the actual next-day articles.
- Bias and Limitations: Examine and discuss the potential limitations and biases of the system, proposing ways to address these issues.
Master Project Additions:
If the project is taken as a master project, the student will further:
- Refine the Model: Apply advanced training techniques, and perform a detailed hyperparameter search to optimize the GAN’s performance.
- Multi-Source Integration: Extend the model to handle and reconcile articles from multiple sources, aiming to generate a more comprehensive next-day article.
- Long-Term Predictions: Investigate the model’s capabilities and limitations in making longer-term predictions, such as a week or a month in advance.
Requirements: Knowledge of machine learning and deep learning principles, familiarity with GANs and LLMs, proficiency in Python, and experience with a deep learning framework, preferably PyTorch.
Taken
Type: BA (8ECST) Semester project
Sections: Data Science, Computer Science, Digital Humanities
Supervisors: Emanuela Boros
Number of students: 1–2
Keywords: Document processing, Natural Language Processing (NLP), Information Extraction, Machine Learning, Deep Learning
Figure: Assessing Climate Change Perceptions and Behaviours in Historical Newspapers (Generated with Midjourney)
With emissions in line with current Paris Agreement commitments, global warming is projected to exceed 1.5°C above pre-industrial levels, even if these commitments are complemented by very difficult increases in magnitude and intensity and ambition of mitigation after 2030. Despite this slight increase, the consequences of global warming are already observable today, with the number and intensity of certain natural hazards continuing to increase (e.g., extreme weather events, floods, forest fires). Near-term warming and increased frequency, severity, and duration of extreme events will place many terrestrial, freshwater, coastal and marine ecosystems at high or very high risks of biodiversity loss. Exploring historical documents can help to address gaps in our understanding of the historical roots of climate change, and possibly uncover evidence of early efforts to address environmental issues, as well as explore how environmentalism has evolved over time. This project aims to fill gaps in our understanding by examining a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus).
Research Questions:
- How have perceptions of climate change evolved over time, as seen in historical newspapers?
- What behavioural trends towards climate change can be identified from historical newspapers?
- Can we track the frequency and intensity of extreme weather events over time based on historical documents?
- Can we identify any patterns or trends in early efforts to address environmental issues?
- How has the sentiment towards climate change and environmentalism evolved over time?
Objective: This work explores several NLP techniques (text classification, information extraction, etc.) for providing a comprehensive understanding of the evolution and reporting of extreme weather events in historical documents.
Main Steps:
- Data Preparation: Identify relevant keywords and phrases related to climate change and environmentalism, such as “global warming”, “carbon emissions”, “climate policy”, or “hurricane”, “flood”, “drought”, “heat wave”, and others. Compile a training dataset of articles that are around these relevant keywords.
- Data Analysis: Analyse the data and identify patterns in climate change perceptions and behaviours over time. This includes the identification of changes in the frequency of climate-related terms, changes in sentiment towards climate change, changes in the topics discussed in relation to climate change, the detection of mentions of locations, persons, or events, and the extraction of important keywords in weather forecasting news.
Requirements: Candidates should have a background in machine learning, data engineering, and data science, with proficiency in NLP techniques such as Named Entity Recognition, Topic Detection, or Sentiment Analysis. A keen interest in climate change, history, and media studies is also beneficial.
Resources:
- Analyzing the impact of climate change on critical infrastructure from the scientific literature: A weakly supervised NLP approach
- Climate of scepticism: US newspaper coverage of the science of climate change
- We provide a nlp-beginner-starter jupyter notebook.
Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Computer Science, Data Science, Digital Humanities
Supervisors: Maud Ehrmann, Emanuela Boros
Number of students: 1–2
Figure: Exploring Large Vision-Language Pre-trained Models for Historical Images Classification and Captioning (Generated with Midjourney)
Context: The impresso project features a dataset of around 90 digitised historical newspapers containing approximately 3 million images. These images have no labels, and only 10% of them have a caption, two aspects that hinder their retrieval.
Two previous student projects focused on the automatic classification of these images, trying to identify 1. the type of image (e.g. photograph, illustration, drawing, graphic, cartoon, map), and 2. in the case of maps, the country or region of the world represented on that map. Good performances on image type classification were achieved by fine-tuning the VGG-16 pre-trained model (see report).
Objective: On the basis of these initial experiments and the annotated dataset compiled on this occasion, the present project will explore recent large-scale language-vision pre-trained models. Specifically, the project will attempt to:
- Evaluate zero-shot image type classification of the available dataset using the CLIP and LLaMA-Adapter-V2 Multi-modal models, and compare with previous performances;
- Explore and evaluate image captioning of the same dataset, including trying to identify countries or regions of the world. This part will require adding caption information on the test part of the dataset. In addition to the fluency and accuracy of the generated captions, a specific aspect that might be taken into account is distinctiveness, i.e. whether the image contains details that differentiate it from similar images.
Given the recency of the models, further avenues of exploration may emerge during the course of the project, which is exploratory in nature.
Requirements: Knowledge of machine learning and deep learning principles, familiarity with computer vision, proficiency in Python, experience with a deep learning framework (preferably PyTorch), and interest in historical data.
References:
- CLIP repository and model: https://github.com/openai/CLIP
- LLaMA-Adapter-V2 Multi-modal: https://github.com/ZrrSkywalker/LLaMA-Adapter/tree/main/llama_adapter_v2_multimodal
- Zhang, Y., Wang, J., Wu, H., Xu, W. (2023). Distinctive Image Captioning via CLIP Guided Group Optimization. In: Karlinsky, L., Michaeli, T., Nishino, K. (eds) Computer Vision – ECCV 2022 Workshops. ECCV 2022. Lecture Notes in Computer Science, vol 13804. Springer, Cham.
- Learning Transferable Visual Models From Natural Language Supervision. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever Proceedings of the 38th International Conference on Machine Learning, PMLR 139:8748-8763, 2021.
Spring 2023
Available
Taken
Type: MSc Semester project
Sections: Digital humanities, Data Science, Computer Science
Supervisor: Rémi Petitpierre
Keywords: Data visualisation, Web design, IIIF, Memetics, History of Cartography
Number of students: 1–2 (min. 12 ECTS in total)
Context: Cultural evolution can be modeled as an ensemble of ideas, and conventions, that spread from one mind to another. This is referred to as memes, elementary replicators of culture inspired by the concept of a gene. A meme can be a tune, a catch-phrase, or an architectural detail, for instance. If we take the example of maps, a meme can be expressed in the choice of a certain texture, a colour, or a symbol, to represent the environment. Thus, memes spread from one cartographer to another, through time and space, and can be used to study cultural evolution. With the help of computer vision, it is now possible to track those memes, by finding replicated visual elements through large datasets of digitised historical maps.
Below is an example of visual elements extracted from a 17th century map (original IIIF image). By extending the process to tens of thousands of maps and embedding these elements in a feature space, it becomes possible to estimate which elements correspond to the same replicated visual idea, or meme. This opens up new ways to understand how ideas and technologies spread across time and space.
Despite the immense potential that such data holds for better understanding cultural evolution, it remains difficult to interpret, since it involves tens of thousands memes, corresponding to millions of visual elements. Somewhat like genomics research, it is now becoming essential to develop a microscope to observe memetic interactions more closely.
Objectives: In this project, the student will tackle the challenge of designing and building a prototype interface for the exploration of the memes on the basis of replicated visual elements. The scientific challenge is to create a design that reflects the layered and interconnected nature of the data, composed of visual elements, maps, and larger sets. The student will develop its project by working with digital embeddings of replicated visual elements, and digitised images in IIIF framework. The interface will make use of the metadata to visualise how time, space, as well as the development of new technologies influence cartographic figuration, by filtering the visual elements. Finally, to reflect the multi-layered nature of the data, the design must be transparent and provide the ability to switch between visual elements and their original maps.
The project will draw on an exceptional dataset of tens of thousands of American, French, German, Dutch, and Swiss maps published between 1500 and 1950. Depending on the student’s interests, an interesting case study to demonstrate the benefits of the interface could be to investigate the impact of the invention of lithography, a revolutionary technology from the end of the 18th century, on the development of modern cartographic representations.
Prerequisites: Web Programming, basics of JavaScript.
Type: MSc (12 ECTS) or BA (8ECST) Semester project
Sections: Digital humanities, Computer Science
Supervisor: Beatrice Vaienti
Keywords: OCR, database
Number of students: 1
Context: An ongoing project at the Lab is focusing on the creation of a 4D database of the urban evolution of Jerusalem between 1840 and 1940. This database will not only depict the city in time as a 3D model, but also embed additional information and metadata about the architectural objects, thus employing the spatial representation of the city as a relational database. However, the scattered, partial and multilingual nature of the available historical sources underlying the construction of such a database makes it necessary to combine and structure them together, extracting from each one of them the information describing the architectural features of the buildings.
Two architectural catalogues, “Ayyubid Jerusalem” (1187-1250) and “Mamluk Jerusalem: an Architectural Study”, contain respectively 22 and 64 chapters, each one describing a building of the Old City in a thorough way. The information they provide includes for instance the location of the buildings, their history (founder and date), an architectural description, pictures and technical drawings.
Objectives: Starting from the scanned version of these two books, the objective of the project is to develop a pipeline to extract and structure their content in a relational spatial database. The content of each chapter is structured in sections that cover systematically the various aspects of each building’s architecture and history. Along with this already structured information photos and technical drawings are accompanying the text: the richness of the images and architectural representations in the books should also be considered and integrated in the project outcomes. Particular emphasis will be placed on the extraction of information about the architectural appearance of buildings, which can then be used for procedural modelling tasks. Given these elements, three main sub-objectives are envisioned:
- OCR ;
- Organization of the extracted text in the original sections (or in new ways) and extraction of the information pertaining the architectural features of the buildings;
- Encoding the information in a spatial DB, using the locational information present in each chapter to geolocate the building, eventually associating its position with the existing geolocated building footprints.
Prerequisites: basic knowledge of database technologies and Python
Type: MSc Semester project or Master project
Sections: Computer Science, Data Science, Digital humanities
Supervisor: Maud Ehrmann, as well as historians and network specialists from the C2DH Center from Luxembourg University.
Keywords: Document processing, NLP, machine learning
Context: News agencies (e.g. AFP, Reuters) have always played an important role in shaping the news landscape. Created in the 1830s and 1840s by groups of daily newspapers in order to share the costs of news gathering (especially abroad) and transmission, news agencies have gradually become key actors in news production, responsible for providing accurate and factual information in the form of agency releases. While studies exist on the impact of news agency content on contemporary news, little research has been done on the influence of news agency releases over time. During the 19C and 20C, to what extent did journalists rely on agency content to produce their stories? How was agency content used in historical newspapers, as simple verbatims (copy and paste) or with rephrasing? Were they systematically attributed or not? How did news agency releases circulate among newspapers and which ones went viral?
Objective: Based on a corpus of historical Swiss and Luxembourgish newspapers spanning 200 years (i.e., the impresso project’s corpus), the goal of this project is to develop a machine learning-based classifier to distinguish newspaper articles based on agency releases from other articles. In addition to detecting agency releases, the system could also identify the news agency behind the release.
Main steps:
- Ground preparation with a) exploration of the corpus (via the impresso interface) to become familiar with historical newspaper articles and identify typical features of agency content as well as potential difficulties; b) compilation of a list of news agencies active throughout the 19C and 20C.
- Construction of a training corpus building (in French), based on:
- sampling and manually annotation;
- a collection of manually labelled agency releases, which could be automatically extended by using previously computed text reuse clusters (data augmentation).
- Training and evaluation of two (or more) agency release classifiers:
- a keyword baseline (i.e. where the article contains the name of the agency);
- a neural-based classifier
Master project – If the project is taken as a master project, the student will also work on the following:
Processing:
- Multilingual processing (French, German, Luxembourgish);
- Systematic evaluation and comparison of different algorithms;
- Identification of the news agency behind the release;
- Fine-grained characterisation of text modification;
- Application of the classifier to the entire corpus.
Analysis:
- Study of the distribution of agency content and its key characteristics over time (in a data science fashion).
- Based on the computed data, study of information flows based on a network representation of news agencies (node), news releases (edges) and newspapers (node).
Eventually, such work will enable the study of news flows across newspapers, countries, and time.
Requirements: Good knowledge in machine learning, data engineering and data science. Interest in media and history.
Fall 2022
There were a few projects only and we no longer have the capacity to host projects for that period. Check out around December 2022 what will be proposed for Spring 2023!
Spring 2022
Type of project: Semester
Supervisors: Sven Najem-Meyer (DHLAB) Matteo Romanello (UNIL).
Context: Optical Character Recognition aims at transforming images into machine-readable texts. Though neural networks helped to improve performances, historical documents remain extremely challenging. Commentaries to classical Greek literature epitomize this difficulty, as systems must cope with noisy scans, complex layouts and mixed Greek and Latin scripts.
Objective: The objective of the project is to produce a system that can solve the problem of this highly complex data. Depending on your skills and interests, the project can focus on :
- Image pre-processing
- Multitasking : can we improve OCR, by processing task like layout analysis or named-entity recognition in parallel?
- Benchmarking and fine-tuning available frameworks
- Optimizing post-processing with NLP
Prerequisites:
- Good skills in python ; libraries such as OpenCV or PyTorch are a plus.
- Good knowledge in machine learning is advised, bases in computer vision and image processing would be a real plus.
- No knowledge of ancient Greek/literature is required.
Type of project: Semester
Supervisor: Maud Ehrmann
Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :
1/ Learn a model to classify images according to their types e.g. map, photograph, illustration, comics, ornament, drawing, caricature).
This first step will consist in:
- the definition of the typology by looking at the material – although the targeted typology will be rather coarse;
- the annotation of a small training set;
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
2/ Apply this model on a large-scale collection of historical newspapers (inference), and possibly do the statistical profile of the recognized elements through time
Required skills:
- basic knowledge in computer vision
- ideally experience with PyTorch
Type of project: Semester (ideally done in loose collaboration with the other semester project on image classification)
Supervisor: Maud Ehrmann
Objective: Given a large archive of historical newspapers (cf. the impresso project) containing both text and image material, the objective of this project is to :
1/ Learn a model for binary image classification: map vs. non-map.
This first step will consist in:
- the annotation of a small training set (this step is best done in collaboration with project on image classification);
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
2/ Learn a model for map classification (which country or region of the world is represented)
- first exploration and qualification of map types in the corpus.
- building of a training set, prob. with external sources
- the training of a model by fine-tuning an existing visual model;
- the evaluation of the said model.
Required skills:
- basic knowledge in computer vision
- ideally experience with PyTorch
Type of project: Semester
Supervisors: Matteo Romanello (DHLAB), Maud Ehrmann (DHLAB), Andreas Spitz (DLAB)
Context: Digitized newspapers constitute an extraordinary goldmine of information about our past, and historians are among those who can most benefit from it. Impresso, an ongoing, collaborative research project based at the DHLAB, has been building a large-scale corpus of digitized newspapers: it currently contains 76 newspapers from Switzerland and Luxembourg (written in French, German and Luxembourgish) for a total of 12 billion tokens. This corpus was enriched with several layers of semantic information such as topics, text reuse and named entities (persons, locations, dates and organizations). The latter are particularly useful for historians as co-occurrences of named entities often indicate (and help to identify) historical events. The impresso corpus currently contains some 164 million entity mentions, linked to 500 thousand entities from DBpedia (partly mapped to Wikidata).
Yet, making use of such a large knowledge graph in an interactive tool for historians — such as the tool impresso has been developing — requires an underlying document model that facilitates the retrieval of entity relations, contexts, and related events from the documents effectively and efficiently. This is where LOAD comes into play, which is a graph-based document model that supports browsing, extracting and summarizing real world events in large collections of unstructured text based on named entities such as Locations, Organizations, Actors and Dates.
Objective: The student will build a LOAD model of the impresso corpus. In order to do so, an existing Java implementation, which uses MongoDB as its back-end, can be used as a starting point. Also, the student will have access to the MySQL and Solr instances where impresso semantic annotations have already been stored and indexed. Once the LOAD model is built, an already existing front-end called EVELIN will be used to create a first demonstrator of how LOAD can enable entity-driven exploration of the impresso corpus.
Required skills:
- good proficiency in Java or Scala
- familiarity with graph/network theory
- experience with big data technologies (Kubernetes, Spark, etc.)
- experience with PostgreSQL, MySQL or MongoDB
Note for potential candidates: In Spring and Fall 2020, two students have already been working on this project, but work and research perspectives are far from being exhausted and many things remain to be explored. We therefore propose to pursue, and the focus of the next project’s edition will be adapted according to the candidate’s background and preferences. Do not hesitate to get in touch!
Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at exploring the relevance of Transformers architectures for 3D cloud points. In a first series of experiments, the student will use the large 3D cloud points of models produced at the DHLAB for the city of Venice or the City of Sion and tries to predict missing parts. In a second series of experiments the student will use the SRTM (Shuttle Radar Topography Mission, a NASA mission conducted in 2000 to obtain elevation data for most of the world) to encode / decode terrain prediction.
Contact: Prof. Frédéric Kaplan
Type of project: Semester
Supervisors: Didier Dupertuis and Paul Guhennec.
Context: The Federal Office of Topography (Swisstopo) has recently made accessible high-resolution digitizations of historical maps of Switzerland, covering every year between 1844 and 2018. In order to be able to use these assets for geo-historical research and analysis, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques, but might prove challenging for datasets with a larger figurative diversity, like in the case of Swiss historical maps, whose style varies over time.
Objective: The ambition of this work is to develop a pipeline capable of transforming the buildings and roads of the Swisstopo set of historical maps into geometries correctly positioned in an appropriate Geographic Coordinates System. The student will have to train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset. Depending on the interest of the student, some first diachronic analyses of the road network evolution in Switzerland can be considered.
Prerequisites:
- Good skills with Python.
- Basics in machine learning and computer vision are a plus.
Type of project: Semester
Supervisors: Rémi Petitpierre (IAGS), Paul Guhennec (DHLAB)
Context: The Bibliothèque Historique de la Ville de Paris digitised more than 700 maps covering in detail the evolution of the city from 1836 (plan Jacoubet) to 1900, including the famous Atlases Alphand. They are of a particular interest in the urban studies of Paris, which was at the time heavily transfigured by Haussmanian transformations. For administrative and political reasons, the City of Paris did not benefit from the large cadastration campaigns that occurred throughout Napoleonic Europe at the beginning of the 19th century. Therefore the Atlas’s sheets are the finest source of information available on the city over the 19th century. In order to make use of the great potential of this dataset, the information contained in the maps must be converted from its visual representation to an abstract, geometrical, form, on which to base quantitative studies. This conversion from an input raster to an output vector geometry is typically well-performed by combining Convolutional Neural Networks for pixelwise classification to standard computer vision techniques. In a second time, the student will develop quantitative techniques to investigate the transformation of the urban fabric.
Tasks
- Semantically segment the Atlas de Paris maps as well as the 1900 plan, train a pre-existing semantic segmentation neural network on a manually annotated training set, evaluate its success, and fine-tune it for it to be applicable on the entire dataset.
- Develop a semi-automatic pipeline to align the vectorised polygons on a geographic coordinate system.
- Depending on the student’s interest, tackle some questions such as:
- analysing the impact of the opening of new streets on mobility within the city;
- detecting the appearing of locally aligned neighbourhoods (lotissements);
- investigating the relation between the unique Parisian hygienist “ilot urbain” and the housing salubrity
- studying the morphology of the city’s infrastructure network (e.g. sewers)
Prerequisites
Good skills with Python.
Bases in machine learning and computer vision are a plus.
Supervisors: Paul Guhennec and Federica Pardini.
Context: In the early days of the 19th century and made possible by the Napoleonic invasions of Europe, a vast scale administrative endeavour started to cartography as faithfully as possible the geometry of most cities of Europe. What results from these so-called Napoleonic cadasters is a very precious testimony of the state of these cities in the past at the European scale. For matters of practicality and detail, most cities are represented on separate sheets of paper, with margin overlaps between them and indications on how to reconstruct the complete picture.
Recent work at the laboratory has shown that it is possible to make use of the great homogeneity in the visual representations of the parcels of the cadaster to automatically vectorize them and classify them according to a fixed typology (private building, public building, roads, etc). However, the process of aligning these cadasters with the “real” world, by positioning them in a Geographic Coordinate System, and thus allowing large-scale quantitative analyses, remains challenging.
Objective: A first problem to tackle is the combination of the different sheets to make up the full city. Building on a previous student project, the student will develop a process to automatically align neighbouring sheets, while accounting for the imperfections and misregistrations in these historical documents. In a second stage, a pipeline will be developed in order to align the combined sheets obtained at the previous step on contemporary geographic data.
Prerequisites:
- Good skills in Python.
- Experience with computer vision libraries
Type of project: Semester project or Master thesis
Supervisors: Didier Dupertuis and Frédéric Kaplan.
Context: In 1798, after a millennium as a republic, Venice was taken over by Napoleonic armies. A new centralized administration was erected in the former city-state. It went on to create many valuable archival documents giving a precise image of the city and its population.
The DHLAB just finished the digitization of two complementary sets of documents: the cadastral maps and its accompanying registries, the “Sommarioni”. The cadastral maps give an accurate picture of the city with clear delination of numbered parcels. The Sommarioni registers contain information about each parcel, including a one-line description of its owner(s) and type of usage (housing, business, etc.).
The cadastral maps have been vectorized, with precise geometries and numbering for each parcel. The 230’000 records of the Sommarioni have been transcribed. Resulting datasets have been brought together and can be explored in this interactive platform (only available via EPFL intranet or VPN).
Objective: The next challenge is to extract structured data from the Sommarioni owner descriptions, i.e. to recognize and disambiguate people, business and institution names. The owner description is a noisy text snippet mentioning several relatives’ names; some records only contain the information that they are identical to the previous one; institution names might have different spellings; and there are many homonyms among people names.
The ideal output would be a list of disambiguated institutions and people with, for the latter, the network of their family members.
The main steps are as follows:
- Definition of entity typology (individual, family or collective, business, etc.);
- Entity extraction in each record, handling the specificities of each type;
- Entity disambiguation and linking between records;
- Creation of a confidence score for the linking and disambiguation to quantify uncertainty, and of different scenarios for different degrees of uncertainty;
- If time permits, analysis and discussion of results in relation to the Venice of 1808;
- If time, integration of the results in the interactive platform.
Prerequisites:
- Good knowledge of python and data-wrangling;
- No special knowledge of Venetian history is needed;
- Proficiency in Italian is not necessary but would be a plus.
More explorative projects
For students who want to explore exciting uncharted territories in an autonomous manner.
Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project studies the property of text kernels, sentences that are invariant after automatic translation back and forth into a foreign language. The goal is to develop a prototype of text editor / transformation pipeline permitting to associate any sentence with its invariant form.
Contact: Prof. Frédéric Kaplan
Type of project : Master thesis
Supervisors: Frédéric Kaplan
Project summary: This project aims at studying the potential of Transformers architecture for new kinds of language games between artificial agent and evolution of artificial languages. Artificial agents interact with one another about situation in the “world” and autonomously develop their own language on this basis. This project extends a long series of experiment that started in the 2000s.
Contact: Prof. Frédéric Kaplan
Type of project : Master/Semester thesis
Supervisors: Frédéric Kaplan
Project summary: This explorative project consists in mining a large collection of novels to automatically extract characters and create an automatically generated dictionary of the novel characters. The characters will be associated with the novels in which they appear, with the characters they interact with and possibly with specific traits.
Contact: Prof. Frédéric Kaplan
Past projects
Autumn 2022
Project type: Master
Supervisors: Maud Ehrmann and Simon Clematide (UZH)
Context: The impresso project aims at semantically enriching 200 years of newspapers archives by applying a range of NLP techniques (OCR correction, named entity processing, topic modeling, text reuse, etc.). The source material comes from Swiss and Luxembourg national libraries and corresponds to the fac-similes and OCR outputs of ca. 200 newspaper titles in German and French.
Problems
- Automatic processing of these sources is greatly hindered by the sometimes low quality of legacy OCR and OLR (when present) processes: articles are incorrectly transcribed and incorrectly segmented. This has consequences on downstream text processing (e.g. a topic with only badly OCRized tokens). One solution is to filter out elements before they enter the text processing pipeline. This would imply recognizing specific parts of newspaper pages known to be particularly noisy such as: meteo tables, transport schedules, cross-words, etc.
- Additionally, besides advertisement recognition, OLR does not provide any section classification (is this segment a title banner, a feuilleton, an editorial, etc.) and it would be useful to provide basic section/rubrique information.
Objectives: Building on a previous master thesis which explored the interplay between textual and visual features for segment recognition and classification in historical newspapers (see project description, master thesis, and published article), this master project will focus on the development, evaluation, application and release of a documented pipeline for the accurate recognition and fine-grained semantic classification of tables.
Tables present several challenges, among others:
- as usual, difference across time and sources;
- visual clues can be confusing:
- presence of mixed layout: table-like part + normal text
- existence “quasi-tables” (e.g. lists)
- variety of semantic classes: stock exchanges, TV/Radio program, transport schedules, sport results, events of the day, meteo, etc.
Main objectives will be:
- Creation and evaluation of table recognition and classification models.
- Application of these models on large-scale newspaper archives thanks to a software/pipeline which will be documented and released in view of further usage in academic context. This will support the concrete use case of specific dataset export by scholars.
- (bonus) Statistical profile of the large-scale table extraction data (frequencies, proportion in title/pages, comparison through time and titles).
Spring 2021
Type of project: Semester
Supervisors: Albane Descombes, Frédéric Kaplan.
Photogrammetry is a 3D modelling technique which enables making highly precise models of our built environment, thus collecting lots of digital data on our architectural heritage – at least, as it remains today.
Over the years, a place could have been recorded by drawing, painting, photographing, scanning, depending on the evolution of measuring techniques.
For this reason, one has to mix various media to show the evolution of a building through the centuries. This project proposes to study the techniques which enable to overlay images over 3D photogrammetric models of Venice and of the Louvre museum. The models are provided by the DHLab, and were computed in the past years. The images of Venice come from the photo library of the Cini Foundation, and the images of Paris can be collected on Gallica (the digital French Library).
Eventually this project will deal with the issues of perspective, pattern and surface recognition in 2D and 3D, customizing 3D viewers to overlay images, and showcasing the result on a web page.
Type of project: Master
Supervisors: Albane Descombes, Frédéric Kaplan, Alain Dufaux.
A collection rich of hundreds of magnetic tapes contains the association’s first radio shows, recorded in the early 90’s after its creation. They include interviews and podcasts about Vivapoly, Forum EPFL or Balélec, which set the pace of every students’ life on campus since many years.
This project aims at studying the existing methods for digitizing magnetic tapes in the first place, then building a browsable database with all the digitized radio shows. The analysis of this audio content will be done using adapted speech recognition models.
This project is done in collaboration with Alain Dufaux, from the Cultural Heritage & Innovation Center.
Type of project : Master/Semester thesis
Supervisors: Paul Guhennec, Fabrice Berger, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: The project consists in the development of a scriptable pipeline for producing procedural architecture using the Houdini 3D Procedural environment. The project will start from existing Procedural models of the Venice in 1808 developed by at the DHLAB and automatise a pipeline to script the model out historical information recorded about each parcel.
Contact: Prof. Frédéric Kaplan
Type of project : Master thesis
Supervisors: Maud Ehrmann, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: Thanks the digitisation and transcription campaign conducted during the Venice Time Machine, a digital collection of secondary sources is offering a arguably complete covering about all the historiography concerning Venice and its population at the 19th century. Through a manual and automatic process, the project will identify a series of hypotheses concerning the evolution of Venice functions, morphology and proprietaries network. These hypotheses will be translated in a formal language, keeping a direct link with the books and journals where they are expressed and systematically tested against the model of the city of Venice established through the integration of the models of the cadastral maps. In some cases, the data of the computational model of the city may go in contradiction with the hypotheses of the database and this will lead either to a revision of the hypotheses or a revision of the computational models of Venice established so far.
Type of project : Master/Semester thesis
Supervisors: Didier Dupertuis, Paul Guhennec, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: As part of the ScanVan project, the city of Sion has been digitised in 3D. The goal is now to generate images of the city by days and nights and for the different seasons (summer, winter, spring and autumn) of the city. Contrastive Unpaired Translation architecture like the one used for transforming Venice images will be used for this project.
Contact: Prof. Frédéric Kaplan
Type of project : Master/Semester thesis
Supervisors: Albanes Descombes, Frédéric Kaplan
Semester of project: Spring 2021
Project summary: This project aims at automatically transforming YouTube videos into 3D models using photogrammetry techniques. It extends the work of several Master / Semester projects that have made significant progress in this direction. The goal here is to design a pipeline that permits to georeference the extracted models and explore them with a 4D navigation interface developed at the DHLAB.
Contact: Prof. Frédéric Kaplan