Datasets and Models for Historical Newspaper Article Segmentation ‒ DHLAB ‐ EPFL

This release consists of:

A dataset of manually segmented and classified newspaper articles according to 4 classes (serial, stock exchange tables, weather forecast and death notices) in 4 historical newspapers.
Models trained to recognize such elements in historical newspapers (with dhSegment-text).

For more information please check:

Github repository: https://github.com/dhlab-epfl/dhSegment-text
Zenodo record: https://doi.org/10.5281/zenodo.3706863

License: these items are under different licenses, please check the details in the Zenodo record.

Contact: Maud Ehrmann

Project: impresso – Media Monitoring of the Past

Related publications:

Warning

Please note that the publication lists from Infoscience integrated into the EPFL website, lab or people pages are frozen following the launch of the new version of platform. The owners of these pages are invited to recreate their publication list from Infoscience. For any assistance, please consult the Infoscience help or contact support.

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

R. Barman; M. Ehrmann; S. Clematide; S. Ares Oliveira; F. Kaplan

The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. Although the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of more fine-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. We introduce a multimodal neural model for the semantic segmentation of historical newspapers that directly combines visual features at pixel level with text embedding maps derived from, potentially noisy, OCR output. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to the wide variety of our material.

Journal of Data Mining & Digital Humanities. 2021. Vol. 2021, num. Special Issue on HistoInformatics: Computational Approaches to History, p. 1-26. DOI : 10.5281/zenodo.4065271.

Detailed record

Full text – View at publisher

Warning

Datasets and Models for Historical Newspaper Article Segmentation

R. Barman; M. Ehrmann; S. Clematide; S. Ares Oliveira

Dataset and models used and produced in the work described in the paper “Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers”: https://infoscience.epfl.ch/record/282863?ln=en

2021.

Detailed record

View at publisher