is a Command Line Interface (CLI) to export METS/ALTO documents to other
formats, such as Dublin Core XML files. It parses the raw data (METS and ALTO)
and extracts the full text and meta data of every single article, section,
Command Line Interface tool aimed at enhancing the quality of original OCR data using Artificial Intelligence. The software inputs and outputs METS/ALTO packages.
Read the papers
Combining Morphological and Histogram based Text Line Segmentation in the OCR Context
Text line segmentation is one of the pre-stages of modern optical character recognition systems. The algorithmic approach proposed by this paper has been designed for this exact purpose. Because of the promising segmentation results that are joined by low computational cost, the algorithm was incorporated into the OCR pipeline of the National Library of Luxembourg, in the context of the initiative of reprocessing their historic newspaper collection.
Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction
Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement.
Nautilus – An End-To-End METS/ALTO OCR Enhancement Pipeline
Enhancing OCR in a digital library not only demands improved machine learning models, but also requires a coherent reprocessing strategy in order to apply them efficiently in production systems. The newly developed software tool, Nautilus, fulfils these requirements using METS/ALTO as a pivot format. This paper covers the creation of the ground truth, the details of the reprocessing pipeline, its production use on the entirety of the BnL collection, along with the estimated results.