Tools

Open source software, tools & libraries

Our data has been produced using state of the art technologies and standards. Through international standards, we can reach a wider audience, maximize the reusability of the data and support long-term preservation goals.

However, cultural heritage data can be daunting at first for some users. For this reason, the BnL started open sourcing software that has been internally developed to work with that data.

Check out the BnL on GitHub

BnLMetsExporter

This tool is a Command Line Interface (CLI) to export METS/ALTO documents to other formats, such as Dublin Core XML files. It parses the raw data (METS and ALTO) and extracts the full text and meta data of every single article, section, advertisement.

Export Format

The default export XML-based and follows the Dublin Core format. The fields are described below. The Dublin Core data is wrapped in a OAI-PMH envelope. Every XML file corresponds to one article. The following sections will cover the most important tags.

<header>

This element contains a generated unique identifier (<identifier>) as well as a datestamp (<datestamp>) of when the data has been exported. It is not recomended to work with this identifier. Instead, use the value in <dc:identifier>.

<dc:identifier>

This is a unique and persistent identifier using ARK. The BnL is in the progress of transitioning to ARK. That is why PID-based identifiers are still provided in other fields.

<dc:source>

Describes the source of the document. For example
<dc:source>newspaper/luxwort/1848-12-15</dc:source>
means that this article comes from the newspaper “luxwort” (ID for Luxemburger Wort) issued on 15.12.1848.

<dcterms:isPartOf>

The complete title of the source document e.g. “Luxemburger Wort”.

<dcterms:isReferencedBy>

Another generated string that uniquely identifies the exported resource.

<dc:date>

The publishing date of the document e.g “1848-12-15”.

<dc:publisher>

The publisher of the document e.g. “Verl. der St-Paulus-Druckerei”.

<dc:relation>

The unique identifier of the parent document (e.g. newspaper issue), also referred to as PID.

<dcterms:hasVersion>

The link to the BnLViewer on eluxemburgensia.lu to view the resource online.

<dc:title>

The main title of the article, section, advertisement, etc.

<dc:description>

The full text of the entire article, section, advertisement etc. It includes any titles and subtitles as well. The content does not contain layout information, such as headings, paragraphs or lines.

<dc:type>

The type of the exported data e.g. ARTICLE, SECTION, ADVERTISEMENT, …

<dc:language>

The detected language of the text.

<dcterms:extent>

The number of words in the <dc:description> field.