1 Context and motivation
Catalan is a Romance language that originated to the east of the Pyrenees mountain ridge during the Middle Ages. Its affiliation to either the Ibero-Romance or the Gallo-Romance groups within the Romance languages has been debated since the emergence of historical Romance linguistics, and is still contended. Catalan possesses a Medieval textual record that, while being limited, offers us enough data to allow for an in-depth study of this language. To this date, the largest publicly available searchable corpus of Old Catalan is the Corpus Informatitzat del Català Antic, the ‘Digitised Corpus of Old Catalan,’ commonly referred to as CICA (Torruella et al., 2009).1 CICA contains 414 texts dating from the 11th to the 18th century, and it allows for simple and complex token and lemma searches, enabling the study of collocations or the distribution of specific lexical items. Nevertheless, as the texts contained in CICA are not morphosyntactically annotated, this corpus does not lend itself to the study of the morphosyntax of Catalan diachronically. The lack of an open-access fully parsed corpus or treebank of Old Catalan renders it inaccessible to the research community for certain types of linguistic and philological studies, such as those focusing on diachronic syntax, information structure and patterns of language change. With the ultimate objective of developing a morphosyntactically annotated corpus of Old Catalan, here we present a Part-of-Speech (POS) tagger developed for Old Catalan and trained with a 13th century chronicle: Llibre dels Fets (composed between 1229 and 1276). This text has been deemed suitable for the training of the POS tagger, and in the future, syntactic annotator, on the basis of three factors:
- It is the first crònica ‘chronicle’: The Llibre dels Fets is the first of the so-called Great Catalan Chronicles, four historiographical texts written in prose from the end of the 13th century and throughout the 14th. While they differ in form, they all have a common theme: they narrate and praise the feats of several Catalan kings who reigned during the 13th and 14th centuries in the Crown of Aragon, a Medieval kingdom that resulted from the dynastic union of the Kingdom of Aragon with the County of Barcelona and its vassal territories. Llibre dels Fets broke with the preceding historiographic tradition modelled on Latin annales (lists of dated historical events) by narrating the feats of King James I in the 1st person, using a vivid style that has been described as ‘spontaneous, colloquial, primitive and careless’ (Bruguera, 1991) due to the abundance of direct reported speech, code-switching into languages other than Catalan, references to the book’s audience2 and the presence of information of personal nature, among other traits. Koch & Oesterreicher (2012) propose assessing linguistic variation along the immediacy/distance axis, instead of the written/oral one. Within their framework, the Llibre dels Fets would be a written text with features associated with immediate language, closer to the language of informal spoken interactions linked to informal registers (Bieber & Conrad, 2009).3 Given that the text does not abide by rigid literary conventions which, in Medieval times, often favoured the use of Latinising syntax, and that it exhibits a high degree of oral-like features, it provides us with a unique opportunity to investigate the word order of 13th century Old Catalan.
- The probable involvement of King James I in the production of the text: The Medieval notion of authorship and author differ significantly from the modern ones, where literary works are attributed to the individual or collective that has produced them and cannot be lawfully altered without explicit permission from the author/s. In Medieval times, literary works were not seen as immutable entities. They could (or not) be attributed to someone, but each scribe or reader was free to intervene in the text and add or subtract any material as they saw fit without creating a new text.4 Therefore, it is not possible to say that James I is the author of Llibre dels Fets in the modern sense. However, given the abundance of oral-like traits (Soldevila, 1971; Bruguera, 1991; Pujol i Campeny, 2021) and intimate information found throughout the text, it is possible that he dictated most of its content, and that the text was later put together as a Chronicle by a scribe.5 If the text was indeed dictated by the King, this would bring us closer to the modern notion of authorship and we could take the text to represent his speech without having undergone too much change. Regardless of whether the King did actually utter the words found in the text or not, we do know that the different manuscripts of Llibre dels Fets that have reached our day show discrepancies in terms of spelling or choice of lexical items stemming from scribal intervention while they remain stable in terms of word order.
- A linguistically-aware edition of the text is available: Medieval Catalan texts present editors with challenges due to the lack of standardisation of the orthography and the presence of scribal errors which render certain fragments of a text difficult to understand. Bruguera’s (1991) edition of Llibre dels Fets takes a linguistically-aware perspective in producing a text that relies on the oldest manuscript that has reached our days (manuscript H, 1343) to produce a regularised (not standardised) version of the text that (i) spells out abbreviations; (ii) regularises upper and lower case letters; (iii) regularises word separation, accentuation and punctuation according to standard Modern Catalan rules (elisions that are not currently rendered graphically in standard Modern Catalan are marked with a punt volat ‘flying dot’,’·’); (iv) it marks both thematic chapters established by the editorial tradition of the text, while also keeping the folia annotation. Given the qualities of this edition, it is a good starting point to produce an annotated version of the text, as abbreviations have already been spelled out. At the same time words are conveniently separated, while preserving greatly valuable spelling particularities which can point towards language change. Bruguera’s edited text was updated for its inclusion in CICA, with the addition of graphic accents contributing to the disambiguation of homographs. We have worked with this version of Bruguera’s edition.
2 Establishing the morphosyntactic tag set
The investigation of the evolution of syntactic constructions and the distribution of different forms diachronically benefits heavily from the analysis of large corpora that are consistently annotated. Manual annotation of thousands (and ideally, millions) of words is an extremely time-consuming task that is very prone to error. The use of Natural Language Processing (NLP) tools to automatise the annotation process and consistently treat large amounts of data in a minimal period of time have been successfully applied to various languages for the production of historical corpora, including Welsh, English and French.
Like most Medieval languages, Old Catalan had not undergone standardisation, and therefore, it displayed a great degree of variation at orthographic level. While the regularisation of spelling is possible, it is a time-consuming task that requires editorial decisions of philological nature in order to establish which form should be used as the ‘standard’ or ‘regularised’ form. In addition, the standardisation of the text can conceal dialectal variation as well as differences in scribal practices, impoverishing the text for the purposes of the diachronic study of a language. In the specific case of Llibre dels Fets, it would have concealed features that have attracted the interest of many researchers, as Colón Domènech’s (2012) volume shows.
One of the key steps in the development of a POS tagger is the definition of the tag-set, since it interacts directly with the effectiveness of the POS tagger (a lower number of tags makes morphosyntactic classification easier, thus yielding better results) and it potentially limits the possible searches that can be carried out. For the Old Catalan POS tagger, the defined tag-set is based on the standard UPenn annotation scheme,6 in order to render it readily comparable with similar resources in the UPenn historical corpora collections. At the same time, it has been simplified where possible and enriched where needed in order to adapt it to Old Catalan grammar and to provide as much information as possible for the study of word order variation at clausal level. As a result, tags devoted to the nominal domain have been simplified, as Old Catalan does not systematically display comparative and superlative morphology for adjectives, for instance, and different determiner categories have been created in order to reflect the emerging article system that distinguishes proper names from other nouns.
Certain part of speech categories have been further specified with different grammatical attributes, such as case, person and number. These are added to the part of speech label separated by a delimiter ^. Therefore, a third person plural accusative pronoun would be tagged as PRO^A^3^PL, with the attributes case, person and number occurring in this order, following the convention developed for the HeliPaD corpus (Walkden, 2016). A comprehensive list of the tags and attributes used in this corpus can be found in the Appendix. Our fully-tagged corpus is deposited and made available open access on Zenodo (https://doi.org/10.5281/zenodo.5615759,) as are the word embeddings we created for the neural-based tagger (https://doi.org/10.5281/zenodo.5615556).
2.1 Verbal tags and challenges within the verbal domain
As is the case in most Romance languages, verbal inflection is expressed through suffixes attached to the verbal root. Inflected verbs are specified as follows: firstly, the word category tag VB appears, followed by tense (P for present, D for past, F for future,) and, in turn, followed by mood (I for indicative, S for subjunctive and C for conditional7). The verbal tag is then completed with person and number information (^1/^2/^3 + ^SG/^PL). Therefore, a verb tagged VBPI^1^SG would indicate a verb in the present indicative, first person singular. In order to keep tags to a minimum, no aspectual tags were added, as most perfect tenses are expressed through verbal periphrases (auxiliary + non-finite form,) with the exception of the synthetic past perfect indicative. There are no tags for passive morphology either, as it was also expressed by means of analytic constructions.
Past perfect participles can agree with an element from their context in gender and number. While they were not tagged for these categories, whether they are inflected and agree with an element from their context is specified with the addition of the tag I (‘inflected’) to the past perfect participle label VN, yielding VNI.
Within the Catalan verbal domain we find a rich paradigm of pronouns that cliticise onto verbs and pose a real challenge for the tagger. The clitic pronoun paradigm includes reflexive, accusative and dative clitic pronouns that distinguish person and number (and in the case of 3rd person accusative pronouns, also gender: masculine, feminine and neuter). The paradigm also counts with two adverbial pronouns.8
The paradigm exhibits a high degree of syncretism between reflexive, accusative and dative clitic pronouns in the 1st and 2nd persons. This is illustrated in examples (1–3) using the 1st person reflexive, accusative and dative clitic pronoun em, me, m’:
- ‘Firstly, I tell you that I do not agree with anything that you say’
- Fol. 151r, l. 8
- ‘and I beg you to take me there, (…)’
- Fol. 143v, l. 12
- ‘since they have told it to me (…)’
- Fol. 125r, l. 18
Etymologically, third person accusative clitic pronouns derive from Latin demonstratives ILLE ILLA ILLUD, which are also the origin of one of the sets of definite articles found in the language (the other being derived from the demonstrative IPSE IPSA IPSUD). This accounts for the homophony between definite articles and 3rd person accusative clitic pronouns. This is the case of the masculine plural article los (see 4,) the 3rd person plural masculine accusative clitic pronoun los (see 5,) which, additionally, are also homophonous with the 3rd person plural dative clitic pronoun los (see 6).
- ‘the children and lay people were joyful (…)’
- Fol. 17v, l. 22
- ‘And they showed them to us’
- Fol. 17r, l. 4
- ‘and they killed and destroyed anything belonging to the moors wherever they found them.’
- Fol. 36r, l. 11
- ‘and we told them that (…)’
- Fol. 197v, l. 4
- ‘We even said to them (…)’
- Fol. 62v, l. 22
Old Catalan had two adverbial clitic pronouns: en and hi. The tagging of pronominal en is especially challenging, given its homophony with the atonic preposition en and the title En applied to some masculine names (akin to English ‘Sir’,) as illustrated in examples (7-9):
- ‘Leave it to us because in the long run we will keep you from shame and embarrassment’
- Fol. 122r, l. 1
- ‘when they arrived in Catalonia’
- Fol. 5v, l. 19
- ‘And, once Valencia was taken, Sir Ramon Folch of Cardona came, and (…)’
- Fol. 122r, l. 11
So far we have identified synchretism within the clitic pronoun paradigm and homophony between clitic pronouns and other lexical items as major challenges for the tagger within the verbal domain. In addition, we encountered other challenges that required special attention at the manual correction stage, stemming from the high frequency of homophonous (and homographic) forms across tenses and persons within the Old Catalan verbal paradigm. Amongst them, we find9: homophony between the 1st and 3rd person singular of the past imperfect indicative (example 10), homophony between the 3rd person singular of the past perfect of certain verbs with the past participle of the same verb, as is the case of promès (example 11), among others.
- ‘And what I did, (…)’
- Fol. 20r, l. 2
- ‘and, as he entered the tent, we took him by the hair and dragged him out.’
- Fol. 132v, l. 14
- ‘and he promised to accomplish all the aforementioned things.’
- Fol. 20r, l. 2
- ‘abiding by the commitment that he had made to us, (…)’
- Fol. 75v, l. 7
2.2 Nominal tags and challenges within the nominal domain
Within the nominal domain, tags were kept to a minimum. Nouns, adjectives and determiners were labelled for their category (N for nouns, ADJ for adjectives, D for determiners,) but not for gender or number.
Proper person names could be preceded by the proper noun articles Don/En for masculine nouns, and Dona for feminine nouns. The category DPR was created to account for them and distinguish them from other determiners.
Some Catalan pronouns and determiners (labelled D) are indefinite quantifiers. Their distribution exhibits particularities when compared to that of other members of the determiner class and, therefore, it was in our interest to mark them differently. Since they do not form a natural part of speech category, they have been further specified with the attribute D^Q.10
Tonic pronouns are marked for person and number, in contrast with clitic pronouns, which receive further specification (see section 2.1).
One of the challenges that we encountered regarding the tagging of nominal categories is the abundance of non-finite verbal forms that underwent nominalisation and thus, have a homophonous nominal counterpart. These mainly include past participles (i.e., vinguda, presa, feyt, anada, estada, partida, meaning ‘arrival, taking, event/fact, coming, stay, leaving, example 12) and infinitives (i.e., poder, saber, ‘to be able to, to know’, example 13).
- ‘and we [said] that we would increase our fortune and our honour with their coming.’
- Fol. 148r, l. 18
- ‘And when she arrived, (…)’
- Fol. 20r, l. 19
- ‘they tried to deceive all the others with their knowledge.’
- Fol. 56r, l. 23
- ‘And we wanted to know about the others, if they agreed with that piece of advice, (…)’
- Fol. 104v, l. 21
Finally, proper names presented a challenge, as most of them constitute unseen words and present unpredictable morphology. However, after the third round of training, the accuracy in which the tagger successfully identified them increased significantly, most likely due to the tagger being able to recognise capital letters.
3 POS tagging historical low-resource languages from scratch
For extremely low-resource languages with complex morphology and a large amount of short, homophonous forms like Old Catalan, even basic NLP tasks like part-of-speech tagging can prove challenging at first. Apart from the digitised edition of the text, no further resources were available to us for Old Catalan at this point.
In order to train the tagger, we therefore first manually annotated a text sample containing 4,500 words. Manual annotation was carried out through with the Pyrrha annotation tool (Clérice et al., 2021). Pyrrha allows for manual annotation of lemmas, morphological features and POS tags. Legitimate POS tags can be entered as so-called ‘control lists’ ahead of tagging. These options are then made available by means of a dropdown list while annotating, thus avoiding mistakes that could easily occur when typing every POS label separately. Pyrrha can furthermore automatically extrapolate from already tagged tokens: identical and/or similar tokens in the rest of the corpus can be presented in a convenient list of instances in context and their POS tags can then be adjusted in bulk: it allows annotators to decide whether the same tag is to be applied to all similar tokens, to only some, or to none. While this feature significantly speeds the tagging process, manual annotation remains a time-consuming endeavour. We therefore decided to limit this initial time-consuming round of manual annotation to 4,500 tokens. While getting used to our newly designed tag set, this initial task took 32 working hours (140,6 words/hour). In order to create a sizeable gold standard, we then proceeded in a semi-supervised manner, incrementally building a large training set using memory-based and neural taggers.
3.1 Incrementally building up our training set
The manually annotated 4,500 tokens allowed for the training of the memory-based POS tagger (MBT based on TiMBL11). After each round of manual annotation or correction, we used the MBT to generate a new tagger based on the growing set of training data. We then use this new tagger to tag the rest of the corpus, as we predicted that the increasing accuracy rates would make subsequent correction less time-consuming.
The second round of manual correction resulted in a new training set of 10,000 tokens that were corrected over 32 working hours (312,5 words/hour). In turn, a further 10,000 tokens were manually corrected reaching a total of 20,000 corrected tokens over 18 hours, at the rate of 556 words/hour. With the third round of manual correction, we reached 40,000 corrected tokens over 23h30min, at 851 words/hour. In the final round, a further 20,000 tokens were manually corrected, at a 1666.7 words/hour rate. Table 1 shows how much time was invested in manual annotation and correction in this semi-supervised, incremental build-up of the training data.
|TRAINING ROUND||HOURS INVESTED||WORDS/HOUR||WORD/HOUR INCREASE|
From Table 1 it is clear that even though this POS tagger was initially trained with a very small sample of 4,500 tokens only, our semi-supervised method renders the process of tagging almost three times faster than through manual annotation of larger data sets from the outset. The high increase in correction rates are partly due to increased experience of the annotator (including familiarity with the data and tag set,) but also due to the efficiency of working with Pyrrha, which allows for automatic extension of POS labels to similar tokens. Most importantly, however, the global accuracy of the memory-based tagger increased significantly with each round of training; we present the parameters and further technical details along with the results in the following section.
3.2 Memory-based vs neural tagging
We initially chose the TiMBL’s memory-based tagger (MBT,) because it has yielded very good results for historical, low-resource languages in the past (see, for example, the results for the Middle Welsh corpus in Meelen (2016) or the Tibetan historical corpora in Meelen et al. (2021)). In recent years, however, POS taggers based on neural networks have yielded very good results. One particularly good off-the-shelf model is TARGER, a BiLSTM-CNN-CRF tagger by Chernodub et al. (2019).12 Neural taggers generally work well on large data sets. As our initial set of training data was extremely small, we skipped the first round and only started testing once we had a gold standard of 10,000 tokens. Word embeddings, which are essential for TARGER to perform well, were created with FastText (www.fasttext.cc). The total number of tokens of Old Catalan material available to us at present was with just over 156k tokens much too small to create very good embeddings, but at least it could facilitate neural tagging. We evaluated the results of both 10k and 60k gold standards, each divided into 80/10/10 splits for training, development and test sets with the following key parameter settings:
A global accuracy of 91.4% is not bad considering the large tag set consisting of 114 distinct POS labels. However, results could most certainly be improved once more Old Catalan data becomes available so that better word embeddings can be created. In addition, higher accuracies could be achieved through feature engineering and by switching from a recurrent to a convolutional neural network or by adding further (Bi)LSTM and Conditional Random Field layers. We leave this for future research.
Unlike TARGER, the memory-based tagger (MBT) does not need large data sets or word embeddings to get decent results on challenging historical data like our Old Catalan corpus. The MBT allows for different parameter settings according to features of the words themselves or the context in which they appear. We started testing the default settings, but then adjusted the parameters for known and unknown words so that morphological suffixes in particular could feed better into the morphosyntactic classifier. We specifically focused on context, selecting the maximal windows for tags preceding (d) and following (a) the focus word (f). In addition, for unknown tokens, we made the tagger focus on the last three characters (s) in order to make optimal use of morphological suffixes in Old Catalan, which usually consist of suffixes containing up to 3 letters. The optimal settings tested so far for Old Catalan therefore are (see the MBT manual for further details Daelemans et al. (2010)):
-p dddfaaa -P sssdddFchnaaa
The accuracy of the memory-based POS tagger has also improved with each training round, reaching levels of accuracy akin to those of a human annotator for seen words and close to 96% globally:
|TOKENS||GLOBAL ACCURACY||KNOWN WORDS||UNKNOWN WORDS|
Through a 10-fold cross-validation, we calculated the Precision (percentage of system-provided tags that were correc,) Recall (percentage of tags in the input that were correctly identified by the system) and F-score or Global Accuracy (weighted harmonic mean of recall and precision). For the individual categories, Precision and Recall give more insight in the degree to which the model over- or under-generalises certain tags. When analysing the results in more detail, we see that when it comes to both known and unknown words, frequency of the specific POS tag plays an important role in overall accuracy. For both known and unknown words, there are some tags that occur very infrequently (e.g. 1 or 2 times). These are mostly pronominal clitics, whose tags include specification for morphological case, number and person (see the lower part of Tables 5 and 4). Precision, Recall and F-Scores for those are generally low, not only because they are infrequent, but also because there are a number of homophonous tokens with these POS labels, as we have shown in examples 1–9 above.
High frequency, on the other hand, unsurprisingly leads to higher accuracies. The parameters of the MBT were set to pay attention to initial capital letter, which clearly yields the high F-scores for proper nouns (NPR). Even when names are unknown yet, the tagger can accurately predict the tag because of the initial capital letters. Other frequent POS tags for unknown tokens are regular nouns (N) and third-person past-tense verb forms (VBDI^3^SG/PL). Recall results are slightly higher than Precision here, but overall F-scores vary between 64% and 77%. Again, unsurprisingly, the most-frequent tags for known tokens score extremely high, with only nouns not reaching 100% accuracy in the top 5 presented in Table 5 (for a full overview of the MBT results, see Appendix).
Generally, with a global accuracy of 95.8% the MBT trained on 60k manually corrected performs very well. The >1% increase in F-score between the 40k and 60k training sets furthermore suggests it is still possible to get higher accuracies as we keep on extending our training data. The high accuracy rates, in combination with Pyrrha’s efficient manual correction and tag extension function, mean that the time annotators need to spend creating gold standards of the POS-tagged corpus is significantly reduced.
This paper describes our pipeline to create the first morphosyntactically annotated corpus of Old Catalan. We first presented tag set and annotation manual for Old Catalan, based on that used for the standard UPenn corpora, with specific extensions especially in the domain of case, person, and number features. This resulted in a large number of POS tags (>110 in total,) which presents a real challenge for any automatic morphosyntactic classifier. Without those additional features, however, future research opportunities for scholars in Catalan studies, cross-linguistic syntax and beyond, would be extremely limited. In addition, these extra agreement features will facilitate future conversion of this corpus, which, like the other UPenn historical corpora, will be constituency-based to conll-U dependency formats as well.
Building on previous work on other historical, low-resource languages like Middle Welsh and Classical Tibetan, we presented the results of our semi-supervised method consisting of five iterations of manual tagging and correction, incrementally building up our training set to >60k tokens. After each iteration, the memory-based tagger (MBT) was trained and the entire corpus was retagged based on the newly trained version, thus speeding up manual correction of subsequent batches. In addition to generating a memory-based tagger for Old Catalan, we also created word embeddings with FastText in order to test TARGER, a BiLSTM-CNN-CRF tagger. Because of the small dataset, global accuracies were still lower for TARGER. Once more digitised Old Catalan data will be made available this tagger will most likely yield much higher accuracies as well.
For small and highly complex data sets like our Old Catalan corpus, semi-supervised, incremental annotation methods like these can thus yield highly accurate morphosyntactic taggers with minimum effort. Depending on the complexity of the tag set, multiple iterations of manual correction might be necessary to get started, but with each iteration, results clearly improve and the time invested on subsequent correction sessions is significantly reduced. This semi-supervised method of memory-based tagging, combined with manual correction in Pyrrha, is thus highly efficient to create reliable training data for historical, low-resource and morphologically rich languages.
The data and code for this paper can be found on our GitHub repository: https://github.com/lothelanor/catalancorpora.