Named-Entity Dataset for Medieval Latin, Middle High German and Old Norse

Clément Besnier; William Mattingly

1 Overview

Repository location

Context

The annotations were originally from a character-network analysis paper (). The aim was to compare the evolution of character sets in stories with similar backgrounds over time and space.

2 Method

Steps

We first retrieved normalized (Decem Libri Historium/DLH, Völsunga saga/VÖL) or transcribed (Nibelungenlied/NIB) texts. We then tokenized the texts with the CLTK package () and picked the tokens that start with a capital letter, since this often marks proper nouns. When such forms were also found in lowercase, we decided they were not proper nouns. For the remaining tokens and sentence-initial tokens, expert knowledge was needed to classify the results. Finally, we used translations and indexes () () to classify tokens as proper nouns or not.

Sampling strategy

We fully annotated the three texts.

Quality control

We checked that our method was correct with indexes () (), when available. Most mistakes occurred when word forms belonging to the same lemma were assigned different lemmata: spelling variation, for example, led us to consider Abiti and Avitus as coming from different lemmata, whereas they were both considered by the index () as belonging to Abitus. The peculiarity of the Latin imparisyllabic third-declension of nouns made us doubt that Agila, on the on hand and Agilane, Agilanem, on the other hand, were from a same lemma. For VÖL and NIB, such mistakes were easily avoidable because of the rather low number of proper nouns in the texts.

3 Dataset Description

Object name

The annotations are lists of lemmata of proper nouns in three texts: the Decem Libri Historium by Gregory of Tours written in Medieval Latin, the Völsunga saga written in Old Norse (ON), and the Nibelungenlied written in Middle High German (MHG). For each lemma, we provide a list of associated tokens.

Format names and versions

CSV file. The column names are “text” (the name of the text), “category” (category of the lemma/tokens: PERSON, PLACE and GROUP), “language” (latin, middle-high-german, old-norse), “lemma” (the nominative singular form) and “tokens” (forms present in the text belonging to the lemma; tokens are separated by semi-colons). Current version 1.0.0.

Statistics on the dataset and its texts.


–	DLH	NIB	VÖL

Number of tokens	123920	83961	26779

Number of PERSONS lemmata	812	67	111

Number of PERSONS tokens	1787	168	196

Number of PLACES lemmata	349	65	32

Number of PLACES tokens	990	86	37

Number of GROUPS lemmata	–	3	5

Number of GROUPS tokens	–	4	5

Creation dates

Start: 2020-06-05. End: 2020-12-29.

Languages

Data: Latin, Middle High German and Old Norse; Metadata: English.

License

CC BY 4.0

Repository name

https://zenodo.org/record/4571507

Publication date

2021-03-01

4 Reuse Potential

The dataset combines named-entity annotations of texts written in several pre-modern languages. As there are not many digital language resources for these languages, they are a good starting point to train models for named-entitiy recognition (NER). Despite the fact that Classical Latin has received much attention in Natural Language Processing (NLP), compared to other ancient and medieval languages, Medieval Latin requires different training sets because of the nominal and syntactic changes in the language during the Middle Ages. Medieval Latin has received less attention then its Classical counterpart. These datasets provide the first steps to formally incorporate Medieval Latin into Latin NLP models.

The datasets for ON and MHG allowed us to train NER neural network models for both ON and MHG and provide the necessary first steps toward wider applicability and reuse across ancient and medieval languages. When studying the datasets, we developed a workflow to automate the training of NER models for highly inflected ancient and medieval languages (). Our Classical Latin model can identify PERSON (e.g. Romanus – a Roman man), PLACE (e.g. Roma – Rome), and GROUP (e.g. Romani – Romans).

Latin presents certain challenges for model-based NER with two meriting particular attention. Firstly, for PERSON entities, names are complex and often multi-word tokens. They possess a praenomen, nomen, and cognomen. Secondly, Latin is highly inflected, with all entities possessing three to six forms. A model must, therefore, be able to identify names that appear in various forms.

In the initial implementation of the ON and MHG datasets, we only identified PERSON and PLACE. We noticed that a potential disadvantage to this dataset was the absence of entities in the GROUP category. For Latin, our machine learning models struggled with GROUP entities, such as “Romans”. The models switched between PERSON and PLACE for such entities. By incorporating GROUP into the model, we improved the identification of PERSON. Later, we annotated the GROUP category in ON and MHG.

Unlike our manual annotations for ON and MHG, we automated the creation of a training set for Latin. To do this, we first gathered as many potential praenomen, nomen, and cognomen instances as we could for Latin PERSON entities from Wikipedia. Second, we used Orbis Latinus (hosted by Columbia University) () to collate a list of places Third, we manually compiled a list of GROUP instances from Caesar’s Gallic Wars. Next, we generated all potential variations of these words in their declined form via a Python script () that identified a noun’s declension (class) and declined it accordingly. With all potential forms for each word, we created an EntityRuler in spaCy (). We ran the EntityRuler over a single text: Caesar’s Gallic Wars. Finally, we used this auto-generated training set to train a spaCy model.

Overall, the datasets for ON and MHG allowed us to identify patterns for automating the creation of NER training sets in highly inflected ancient and medieval languages. This demonstrates the wider applicability of these datasets. Beyond the plan we have for the data we present here, other researchers may use it to train their own models.

With NER models, complex tasks like producing prosopographies may be partly automated. Nevertheless, without correct anaphor resolution, i.e., the assignment of a personal pronoun or noun phrase to their referent, such task can only be partially performed.

Journal of Open Humanities Data

Data Papers