Web Archive

About the Luxembourg Web Archive

The National Library collects and preserves the Luxembourg web as part of our digital heritage

Unique Source of Information

With the growing emphasis on digitalization in all aspects of society, more and more information is solely published online, making its distribution and accessibility faster and easier. On the other hand, the lack of physical copies of publications accentuates the dangers of data loss. Relevant information, discussions and data that are not printed or preserved in any other shape or form could be lost, depriving future generations of the sources of knowledge available to us today.

Web Harvesting

As a component of BnL’s digital legal deposit, harvesting the web ensures the long term access to the contents of the Internet which are publicly available.

webarchive.lu

List of websites containing Luxembourgish language content

After the BnL collects the Luxembourg web, an automatic language detection algorithm tries to determine what the language of each downloaded document is. This list contains the hosts of websites with a sizeable portion of content in Luxembourgish. It can be used as a starting point to identify web resources that may be useful for NLP pipelines.

Dataset

The Luxembourg web is downloaded 4 times a year in full and selected sites are downloaded more often. That means that the collection contains duplicates from different dates. A language identification algorithm is used on the text extracted from the documents to determine the natural language of each one. This is an imprecise tool and sometimes a text in a mix of German and French is incorrectly labeled as Luxembourgish.

The file contains 1 JSON document per line, where each line contains:
– the Internet host, which is the part of a URL that follows the “https://” part.
– The count of documents from that host in Luxembourgish, in the BnL webarchive