Cross-lingual transfer learning for low-resource languages

Unsupervised representation learning, from pre-trained word embeddings to pre-trained contextual representations and transformer-based language models, has had a significant impact on natural language processing (NLP) and understanding. To achieve their impressive performance, such models rely on large-scale annotated datasets. While such datasets might exist for a few high-resource languages, such as English, Spanish, and German, most languages are short of labeled data required for training deep neural nets.
Collecting annotated data for all languages is too expensive and too time-consuming to be practical. Cross-lingual transfer learning offers the possibility of learning models for a high-resource language using its annotated data and then transferring it to other low-resource languages. A large body of work employs cross-lingual transfer by leveraging one or more similar high-resource languages to train models for low-resource languages that can achieve significantly higher performance across various NLP tasks.

In this project, the candidate will explore a specific cross-lingual knowledge transfer method focusing on challenging low-resource languages. The task is to implement the desired method and conduct a series of experiments to analyze different aspects of the problem.

Prerequisites
The candidate should have programming experience (ideally in Python) and familiarity with machine learning and natural language processing.

References
Artetxe, Mikel, Sebastian Ruder, and Dani Yogatama. “On the cross-lingual transferability of monolingual representations.” arXiv preprint arXiv:1910.11856 (2019).

Lample, Guillaume, and Alexis Conneau. “Cross-lingual language model pretraining.” arXiv preprint arXiv:1901.07291 (2019).

Contact: negar.foroutan@epfl.ch, mohammadreza.banaei@epfl.ch