1 Context and motivation

Institutional and academic contexts Handwritten text recognition (HTR) and its upstream task layout segmentation (LS) have become two important topics in the context of Digital Humanities and digital approaches to cultural heritage collections in the GLAM domain. Its growth over the past three to four years in digital projects can easily be linked to the emergence of user interfaces (UI) allowing for the annotation of ground truths (GTs, data that will be used for training), training new models (for the transcription and, lately, for the segmentation) and for the automatic transcription of the users’ own data. At first, only Transkribus () provided such a service through the READ project without fees or infrastructure requirements. At the end of the 2020 European Union funding, Transkribus became a paid service accelerating the interest growth of at least one alternative, namely eScriptorium () at the EPHE-Scripta-PSL. Unlike the former, the latter is completely open source, at the cost of not offering a centralized server.

In this context, the Consortium pour la Reconnaissance d’Écritures Manuscrites des Matériaux Anciens (CREMMA) project was created to fund a regional server. Its aims are to support students’ training and to provide local researchers with a free solution. The CREMMA funding consisted of a grant for the initial cost of the infrastructure as well as an evaluation grant for providing base models for the community of CREMMA’s users. The latter was divided into two main languages: French and Latin, from the 9th to the 21st century. A postdoctoral position, CREMMAlab provided the infrastructure with complementary time for building a dataset (CREMMA Medieval) and expertise around transcribing medieval manuscripts.

As the CREMMA project was being drafted, Chagué and Clérice () provided a solution for facilitating the FAIR principles in an HTR context and providing machine-actionable metadata for datasets. HTR-United, both a catalog of open source HTR ground truths and a toolkit to strengthen the control of documentation and validity of HTR data, records, as of late October 2022, 56 datasets composed of 41.5 million characters, 725,862 lines in over 13 languages and 6 scripts. HTR-United’s catalog provides a useful overview to build new datasets which can complement previous ones.

HTR for Latin and Old French Handwriting in the Middle Ages can, in a simplistic way, be divided into two big writing systems: cursive and calligraphy (). They reflect two complementary practices, namely cursive hands (écritures d’usage), which are more common to everyday and administrative documents such as accounting books and letters, and book hands. While cursive represents a harder challenge due to the variability of handwriting styles, both families have the potential to be highly abbreviated, depending on the expected audience of the document: literary classics, such as Cicero or Vergilius, might be less abbreviated than pharmaceutical recipes, scholastic works, or accounting books. This situation resulted in mainly two different kinds of strategies for creating HTR ground truth datasets: (1) datasets that would resolve abbreviations directly in the transcription (a practice found mostly used by historians, and quite common for cursive, specifically in France) and (2) datasets that would keep a diplomatic approach to transcription.

Our dataset builds on the experience of Ariane Pinche, specifically her work on the CREMMA Medieval dataset, which treats different variations of Old French from the 13th to the 15th century, with a heavy focus on the first section of the period. As the first recipient of the CREMMALab post-doctoral funding, Pinche co-organized a research seminar around the formalization of transcription guidelines for graphemic transcription of Old French (). Based on her recommendations, a few datasets emerged around the École nationale des chartes and the CREMMA project. Notably, the Gallic(orpor)a corpora () and the course project DecameronFR () provided two additions for Old French and Middle French data, centered around the end of the middle ages. On the opposite, the Caroline Minuscule project () was realigned in ALTO XML and adapted to the guidelines as it provided some foundations for recognizing the Caroline script specific to the first centuries of the early middle age. Vlachou-Efstathiou (, ) provided a complementary dataset based on the transcriptions of two Latin manuscripts from the 9th century. When the work for CREMMA Medii Aevi began, we identified a lack of data for the second half of the middle age (1100–1500, see Table 1).

Table 1

Datasets following the Pinche Guidelines or adapted through Choco-Mufin. Characters’ counts are rounded to the closest thousands.


AUTHORSDATASETPROJECTCHARACTERSPERIODLANGUAGE

White, Karaisl, and Clérice ()Caroline MinusculeRescribe17,000800–1200Latin

Vlachou-Efstathiou ()Eutyches87,000850–900Latin

Pinche ()CREMMA MedievalCREMMAlab593,0001100–1499French

CREMMA Medii AeviCREMMA263,0001100–1600Latin

Biay et al. ()DecameronFR20,0001430–1455French

Gabay et al. ()Manuscrits du 15e siècleGalliCorpora169,0001400–1500French

Total1,149,000

2 Dataset description

Object name: CREMMA-Medieval-LAT-0.1.1.zip

Format names and versions: XML (ALTO), JPEG

Creation dates 2022-01-01 / 2022-09-22

Dataset creators: Thibault Clérice (Organization, Curation, Transcription, Design), Malamatenia Vlachou- Efstathiou (Curation, Transcription, Design), Alix Chagué (Organization)

Language: Latin

License: CC0

Repository name: Zenodo (http://dx.doi.org/10.5281/zenodo.7013436)

Publication date: 2022-10-20

3 Method

3.1 General aspects of the corpus

Corpus construction theory Borrowing the terminology from the linguistic domain () where data construction methods have long been examined, evaluated, and reconsidered, we shall examine the following methodological aspects. Contrary to the notion of “sampling” which is, by definition, a random selection procedure, “corpus construction” implies a systematic selection of materials that obey a specific rationale, where its efficiency depends on the research question. “Representative sampling” is where these two approaches converge. Sampling secures efficiency in research by providing a rationale for studying only parts of a population without losing information. Its key feature is “representativeness” of the system in question. Sampling criteria and focal variables correlate. In HTR for medieval manuscripts, “representativeness” was approached in terms of the medieval handwritten Latin language’s characteristics as a system comprised of abbreviations, ligatures, and punctuation signs alongside graphemes. Different genres, scripts, and their degrees of formality served as instances of this system.

Document sampling strategy From the three registers making up the construction of a qualitative corpus according to Bauer and Aarts (), namely channel, domain, and function, only the first parameter is constant in our case: the sample represents exclusively the written Latin language while giving room to texts of multiple functions addressed to different audiences belonging to various genres (while not aiming at exhaustiveness at this stage). The corpus construction can be regarded as a cyclical process: it has not been entirely determined a priori but rather evolved, bearing in mind the logic of complementarity regarding the already existing datasets. Estimated abbreviation rate and use of specific characters, known genres and scripts were implemented to compensate for what was thought to be missing from the network of the corpus and the corpus itself in order to make it as “representative” as possible. HTR engines are language agnostic, but the same cannot be said for the resulting models, which means that it depends on the representativeness of the sample to determine whether a model will work on “similar” or “out-of-domain” documents.

Three distinctive selection processes have been applied in our case:

  1. The first set of documents was selected purely on their linguistic feature, their readability, and their availability as both digitized manuscripts and editions which could be found either online or in local libraries. It led to the inclusion of classical texts such as Seneca’s Medea. Script was not taken into account.
  2. In a logic of complementarity, the second part of the corpus was dictated inversely by content. More specifically, given the relative absence of ligatures and abbreviations in classical texts, we chose documents that would display a higher degree of abbreviations. This both induced or led to a genre selection process, specifically for medical and scholastic data. At the same time, script diversity was added to the consideration and came naturally as a sort of by-product.
  3. Finally, as we wanted to test Kraken models, we sought a transcription project that would provide us with data that would help us evaluate our own. This led to the alignment of the Eichenberger and Suwelack () dataset, produced in the context of a transcribathon in Berlin and containing genres new to our corpus (Book of Hours, Psalms, etc.).

Quantitative aspects of the corpus Corpus size depends largely on the subjective criteria and resources of each project and little can be said as a general rule: one needs to consider the limitations that stem from the effort put into producing the corpus, the budget available, the number of representations one wants to characterize, and some minimal and maximal requirements (in our case the quota for the production of an efficient HTR model). Building a turn-key HTR model applicable to as large a range of unseen manuscripts as possible is undoubtedly the end goal. With the production of ground truth being expensive but with increasingly more open-access models available to the public, the challenge is finding the right combination of GTs (either to create a model from scratch or to fine-tune an existing one) that yield the best results. This is where considerations of size and variety enter the discussion and affect directly the quantitative corpus construction strategy.

More specifically, while conducting an experiment on Caroline Minuscule OCR models, Hawk et al. () conclude that “relative preponderance” in small training pools was a considerably more important factor than that of size, which inversely impacts the accuracy of the models resulting from larger training pools. A careful conclusion would be that a specific combination of manuscripts can yield exceptional results, even though the reasons behind such results or the criteria for the respective manuscripts to be combined are not entirely clear yet. This means that quantity-wise we sought to find a balance between the diversity and size of the GT, always making sure that the ground truth yields an efficient model for individual manuscripts on the training set. Training and fine-tuning experiments conducted by Pinche showed that a specialized model per script isn’t always necessary, but the variety of the training set increases its robustness. Therefore, the size of each GT belonging to the training set was limited to 5 pages per script variation (depending on the density of the layout), examining whether this balance can contribute to the production of generic models.

Segmentation vocabulary: SegmOnto With the emergence of efficient layout analyzers and easy-to- use interfaces, the need for efficient segmentation models increases (as does the need for large amounts of data) based on the aggregation of heterogeneous documents. Alongside text recognition, eScriptorium allows for layout annotation using ontologies and controlled vocabularies. For this, researchers need to agree on a limited common vocabulary and share common practices to facilitate the interoperability of their ground truth.

In order to identify the different areas of the document and the type of lines present on the page as well as to characterize them from a codicological point of view, we decided to implement the controlled vocabulary SegmOnto (). SegmOnto was born out of the need for a small/restricted common ontology based on existing standards for the description and analysis of document layout, ranging from content categorization to text recognition, mainly addressing the case of manuscripts and early printed books.

SegmOnto has already been implemented in several projects led by Pinche and connected to the CREMMALab project such as Gabay et al. (), resulting in segmentation models mainly for late medieval manuscripts and early prints. As per the CREMMA Medii Aevi dataset, the documents present two kinds of layout: multi-columns and singular columns, for which lines are most often long, except for the Psalms and Book of Hours. SegmOnto offers multiple levels of description, of which only the first is completely standardized, as the second is intended for custom refinement and the third for local and document-based differentiation. For the purposes of the project, only the first level of SegmOnto has been utilized, such as MainZone for columns and MargintextZone for marginalia.

Pinche’s Transcription Guidelines Pinche () stressed that HTR was an answer to the need for scientific projects to acquire textual data either to undertake editions or to constitute large corpora. Her guidelines address the need to establish principles common to projects dealing with the transcription of manuscripts in order to:

  • build shareable, reusable, and durable ground truth data sets;
  • produce robust generic models, reusable in “out-of-domain” manuscripts;
  • minimize the collective cost, including that of training people;
  • build GT that seeks to optimize the learning space of HTR models.

Pinche has privileged a graphemic transcription, which reproduces graphemes, i.e. a canonical form for each character, instead of a graphetic one, which tries to reproduce each variation of a letter (such as ſ and s). Pushing the imitation too far through a graphetic approach induces a risk of making the transcription harder to complete (as it requires technical skills to recognize differentiated shapes of characters), harder to make uniform (specifically as more annotators are to participate in a dataset) and potentially unusable for HTR (as it might introduce more characters and ultimately noise for HTR engine to learn). Therefore, in cases where functional signs have more than one graphetic manifestation but essentially the same function, they could be represented by the same sign: for example, for every manifestation of the paragraph sign, we opt for the pilcrow sign “¶” (U+00B6) on every occasion, instead of several variations such as “” (U+F1E1). In the context of the guidelines, we set up a list of allowed characters and a list of common and rare cases (such as Table 2 and 3).

Table 2

Punctuation, functional signs and hyphenation.


TYPETRANSCRIPTIONUNICODEDESCRIPTION OR RESOLUTIONEXAMPLES

PunctuationU+00B6Content change

PunctuationU+002DHyphenation

Punctuation/U+2215Diastole

Reference markU+2038Omission sign ‘caret’(reintroduction of content)

Punctuation:U+003APunctus elevatus

Punctuation:U+003APunctus interrogativus

Table 3

Freestanding, letter-combining abbreviations and their corresponding transcription signs. đ cannot be found in our dataset and is mentioned here as it might be a common case in other datasets.


CHARACTER(S)UNICODERESOLUTIONEXAMPLES

U+204AEt

⁊+◌̃U+204A + U+0303Etiam

U+A76D–is

đU+0111d + any desinence truncation

U+A76Fcon

U+2248esse

÷U+00F7est/id est

;U+F1AC-que/-bus/-m/-et

U+A775-rum

On the topic of abbreviations, resolving them produces specific difficulties for HTR engines, as it leads them to learn more about the language than they originally intended. Abbreviations are not resolved in our dataset, as this constitutes rather an interpretative act linked to the specificity of each document. It is not the same as a textual prediction and it could prove to be detrimental to the extension of an HTR model in the long term. Pinche’s graphemic approach without abbreviation resolution simplifies the interpretation step of the text, and in turn, the reduction of characters diversity ultimately smooths both the human transcriber and the HTR engine’s learning curves.

In order to ensure the rigorous application of these guidelines and the homogeneity of the data produced, we introduced quality control to the production and publication workflow. Each manuscript transcription was passed through ChocoMufin (), using project-provided character translation and control tables.

This software, alongside these tables, allows for each dataset to be both controlled at the character level and adapted to guideline specifications and modifications. It also allows for project-specific transcription guidelines to be translated to a more common one such as CREMMALab’s (). This process has been largely used in the first months of the CREMMA Medieval project, as the guidelines were still being drafted. It allowed Pinche to produce or align datasets first and harmonize later, as long as the harmonization was from an upper level of details (closer to graphetic) to a lower level (closer to graphemic).

3.2 Transcription Guidelines for the CREMMA Medii Aevi

The section that follows aims to guide the reader through the transcription norms followed for the Medii Aevi dataset, illustrating the process and the more common and complex cases, especially where new characters have been introduced compared to the CREMMA Medieval dataset.

The project adheres to the general principles laid out by Pinche (, Tables pp. 4–15) concerning the base cases (punctuation, word separation, functional signs, superscript letters, abbreviations, ligatures, and roman numerals). Using the project-provided character conversion table, ChocoMufin controls the transcription and corrects any anticipated error by transforming the character automatically so it conforms to the pre-defined guidelines (data should be used in their post-ChocoMufin converted state as it sometimes corrected mistranscription). However, where the guidelines were not directly addressing the situation (new characters, new types of abbreviations), we positioned ourselves and interpreted the guidelines in light of the situation. Each decision was discussed with the original guidelines’ author.

In general, the main differences that we isolated between the CREMMA Medieval and Medii Aevi datasets, stemming from the language as well as the genre’s own characteristics, are:

  1. the dataset bears no accentuated vowels like in the Old French texts (a rare event though for the corpus);
  2. no normalization or distinction of u and v was provided, nor of i and j;
  3. two variations of con are found, namely the antisigma and the 9-shaped form;
  4. a higher diversity of abbreviating character usage and signification;
  5. Arabic numerals alongside roman, mostly in scholastic and medical treatises.

Reference marks, functional signs, and punctuation In general, complex medieval punctuation has been simplified as much as possible, with single sign punctuation being reduced to “.” and commas will be rendered as “,”. Double sign punctuation (mainly punctus elevatus and punctus interrogativus) are consistently reduced to “:”. The hyphenation for words that continue to the next line has been marked with a unique “-” (U+002D) sign, following 3.1. Table 2 gives a representative example.

Contractions, Abbreviations, and Ligatures Cappelli () categorized abbreviations into six categories: truncation, contraction, abbreviation marks significant in themselves, abbreviation marks significant in context, superscript letters, and conventional signs. As Pluta () stresses, the six aforementioned categories are not mutually exclusive, but the functional grouping is helpful.

Contractions: A word is abbreviated by contraction when one or more of the middle letters are missing. Such an omission is indicated by one of the general signs of abbreviation, present in both corpora, always following Pinche (). Thus, macrons and generally horizontal lines diacritics over the letter such as tildes are represented by combining horizontal tildes, and any vertical zigzag and similarly shaped forms are simplified into combining vertical tildes. In our corpus, in cases where a macron is extended to more than one letter due to the cursivity of the script, this trait has been reproduced in the transcription, as well as in the case of stacked diacritics, usually in later medieval manuscripts (cf. Table 4), as long as it was a semantic feature and not a decorative one.

Table 4

Ligatures and special contraction cases.


TYPETRANSCRIPTIONUNICODEDESCRIPTION OR RESOLUTIONEXAMPLES

LigaturestNormally transcribed ligature

Ligature.n.enim

Ligatureqrquia

Monogrammatic Ligatureqdquod

Monogrammatic ligatureEtEt

Contractionaũt̃Long vertical tilde transcribed by two tildes

ContractionẽẽLong vertical tilde transcribed by two tildes;

Contractiontp̃̃aTwo stacked tildes

Abbreviation marks significant in themselves: “Standard” Abbreviations signs have been preserved as such, like pr(a)e -p̃ (p + combining tilde, p + U+0303), pro - ꝓ (U+A753), hoc - ħ (U+0127), ẜ (s with diagonal stroke, U+1E9C) for secundum or ser-, ꝯ for 9 shaped con/cum (U+A76F), Tironian sign ꝰ for the desinence -us (U+A770), for (t)ur (U+1DD1), and Ꝙ / ꝙ for quod. Absent from the CREMMA Medieval but present in Medii Aevi, the truncated ending -is is transcribed using the character ꝭ (U+A76D). The “inverted c” variation of the preposition con/cum is a good example for the difference of approach between the graphetic and graphematic approach: while using the antistigma (ↄ) is more faithful, it simply is an allograph of the original ꝯ. For -rum, the symbol ꝵ is used rather than the rotunda -rum ꝝ (U+A75D).

Abbreviation marks significant in context: The abbreviation for the enclitic -que, or simply -bus or vertical -m in later manuscripts, has been reduced to the semicolon-shaped ; sign (U+F1AC), avoiding the private domain ligature specific (U+E8BF) character but also avoiding confusion with the regular semi-colon.

Conventional signs: a category that includes all signs that stand for a frequently used word or phrase, and they are almost always isolated (cf. ). First, a rather frequent one, the abbreviation sign for esse is represented by the mathematical operation ≈ (U+2248). The Division sign ÷ is used ubiquitously for the abbreviation sign of est/id est. Tironian et (U+204A, all variations of it, cf. below) is transcribed by ⁊. Etiam can also be found abbreviated by a combination of the Tironian et and the macron symbol (see Table 4).

Ligatures, ie. combinations of more than two letters in one form with the reduction of proclitic and enclitic letters or abbreviating symbols placed above or joined with letters are reduced to their original alphabetical components. Ligatures between letters in cursive scripts such as the ſt (U+FB05) ligature or the two ff (U+FB00) ligature are resolved as -st- and -ff-. For the very frequent quia, the transcription qr has been privileged, avoiding the MUFI sign that belongs to the private domain. More examples are provided in Table 4.

Superscripts letters and interlinear additions A standard way of contracting a word is by adding a superscript letter which gives information about the abbreviated sequence. Frequent ones are open a, u, o, or the ending of a word altogether. These were all rendered with the aid of superscript characters (). Ergo and igitur are two of the most frequent examples of abbreviations with superscript letters. Letters without any baseline letter are simply represented with the same combining superscript character and space as the supporting baseline character (e.g. “ͣ ͭ”: space + combining a + space + combining t cf. Figure 1).

Figure 1 

Examples of contraction use of superscript letters. Manuscripts in the following order: BIS 193, CML 13027, Montpelier H-318, Montpelier H-318, Vat. Pal. lat.373, BIS 193.

Superscript letters, alongside abbreviating functions, were sometimes used to render interlinear additions. Missing content or annotations are added in the interlinear space, especially in manuscripts of scholastic and medical content. This was something that was at first a challenge for the transcription process due to segmentation constraints. It can be, at times, impossible to completely differentiate the segmentation masks of two vertically adjacent letters (like the interlinear additions). Therefore, provided that the corresponding combining letter exists and both words can be formulated, no new lines were carved for the interlinear additions. Where this was deemed too complex, interlinear additions were omitted (see Figure 2).

Figure 2 

All examples come from the CML 13027 manuscript.

Rare characters and Numerals Referring to corpus construction practices for balanced corpora, Maniaci () stresses that “sporadically attested variables will therefore be preferred to those that appear in all – or almost all – the individuals that are part of the corpus.” Rare characters, a subset of freestanding abbreviation signs, specifically occurring in the Medii Aevi dataset are therefore given special attention (cf. Table 5). In two of the manuscripts, both of medical content, some occurrences of graphemes for the denotation of the metric values ounce and semuncia were encountered. For their transcription, ℥ (U+2125) and (U+10192) were used. “Barred O” is represented by ∅ (U+2205) and is widely used to transcribe the word instans instead of ꝋ (U+A74B) that, according to MUFI documentation stands for the abbreviation of obi(i)t ().

Table 5

Rare characters found in Montpellier H318, Phil., Col. of Phys. 10a 135 and BIS 193.


TYPETRANSCRIPTIONUNICODEDESCRIPTION OR RESOLUTIONEXAMPLES

SymbolsU+2125Ounce

SymbolsU+10192*Semi-Ounce

AbbreviationsU+2205instans

Last but not least, in addition to roman numerals, often preceded and followed by dots such as “.ii.”, Arabic numerals are also comprised in the dataset, mainly due to the medical treatises (see Figures 3 and 4).

Figure 3 

Manuscripts in the following order: Latin 16195, Phi. 10 a. 135 (x3), BIS 193, CML13027, Egerton 821, Latin 6395.

Figure 4 

Snippet of Arabic numerals from BnF, lat.15461, fol.13r for comparison purposes.

Production pipeline The data was built using eScriptorium and Kraken for both segmentation of zones and lines (specifically the BLLA model). Manuscripts were annotated successively. First, the manuscript is automatically segmented, then its segmentation is manually corrected, and finally the text is transcribed. Once each sample is entirely annotated, its use of characters is controlled via the ChocoMufin software, while its conformity to the segmentation classification vocabulary is controlled by HTRVX. Finally, data are released on Github. All the combining and abbreviation signs suggested for use by the present adaptation of Pinche’s guidelines can be also found on a custom-made eScriptorium keyboard configuration, in order to facilitate reuse and compatibility with the guidelines.

4 Results and discussion

Properties of the resulting dataset The resulting version of the dataset (see Table 6) is built on 18 + 3 manuscripts. All alignments are original alignments, but some draw their original transcription from online projects (cf. Acknowledgements).

Table 6

Basic features and length of the dataset in chronological order. Medic. stands for medical, Lit. for literature, Schol. for scholastic commentaries, Gramm. for grammatical commentaries, Eccl. for church literature (book of hours, psalms, etc.). Texts preceded by a ‡ are aligned and corrected using the Berlin Transcribathon dataset, by a † using the SCTA TEI editions. The complete metadata table can be found in the more detailed data-registry.csv of the dataset.


SHELFMARK IDPAGESTYPEDATESTATUSSCRIPTFOLIO SAMPLINGDEGREE OF ABBREVIATIONS

Egerton 8214Medic.1100–1199ColorPraegothicaSequentialmedium

Montpellier H3185Medic.1100–1299ColorSemitextualis LibrariaSequentialhigh

CCCC MSS 2365Lit.1200–1225ColorTextualis LibrariaSequentialmedium

CLM 130275Medic.1250–1299ColorSouthern Textualis LibrariaSequentialhigh

Latin 161954Medic.1250–1299MicrofilmSemitextualis CurrensSequentialhigh

† MsWettF 155Schol.1270–1280ColorTextualis LibrariaSequentialhigh

Laur. Plut. 33.315Lit.1300–1310ColorTextualis MeridionalisSequentiallow

Arras 8615Lit.1300–1399ColorTextualis FormataSequentialmedium

† BIS 1935Schol.1300–1399ColorTextualis currensSequentialhigh

Phil., Col. of Phys. 10a 1355Medic.1300–1399ColorCursiva recentiorSequentialmedium

† Mazarine Ms. 9154Schol.1300–1399ColorTextualis MeridionalisSequentialhigh

‡ UBL, Ms 75815Eccl.1320–1340ColorTextualis LibrariaSemi-Sequentiallow

Latin 63956Lit.1325–1399MicrofilmSemitextualis LibrariaSequentiallow

Laur. Plut. 39.345Lit.1400–1499ColorHumanistica CursivaSequentiallow

† Vat. Pal. Lat. 3734Schol.1400–1499MicrofilmHybrida CurrensSequentiallow

Laur. Plut. 53.084Gramm.1459ColorPersonal HumanisticaSequentialmedium

Laur. Plut. 53.094Gramm.1400–1499ColorHumanistica RotundaSequentiallow

‡ Berlin, Hdschr. 2517Eccl.1400–1499ColorTextualis FormataSemi-Sequentiallow

‡ Berlin, Germ. Oct. 5116Eccl.1400–1499ColorHybrida formataSemi-Sequentiallow

Latin 82365Lit.1471–1499MicrofilmHumanistica CursivaRandomlow

† CCCC MSS 1655Schol.1500–1599ColorPersonal CursiveSequentialmedium

The current version of the dataset shows a wide variety of genres, and thus a wide vocabulary. From medical and grammatical content to literary and scholastic, a certain level of arbitrariness is introduced in the sequence of characters as they are not as repetitive and predictable from the machine as in a homogeneous genre or topic-driven dataset. The collection was built not to be representative of one specific use of the Latin language and is not thematically unified, while the CREMMA Medieval dataset focuses more on literary texts, specifically hagiographic and chanson de geste texts. Medical and scholastic genres, furthermore, induce the use of a range of rare characters and often underrepresented letters (such as “z”, “y” and “k”).

Other features, such as layout and type of digitization (microfilm or original), provide different representations of texts, with more or less noise in the mask of each line given the space between them, with more or less contrast between information. Colored text yields less “information” in digitized manuscripts as they tend to be a duller form of grey that black ink, while clearly departing from the manuscript “background” in color.

A timespan of 5 centuries separates the earliest and the oldest manuscripts, with a clear focus on the period starting in the 1200s and finishing in 1500. This leads to a good representation of a variety of Gothic scripts, including personal hands alongside formal categories such as the one described by Rossi (), with different levels of execution (cursivity and formality).

Character frequencies in the CREMMA Medieval and the Medii Aevi datasets We set up this corpus to both complement the CREMMA Medieval dataset and grow the available set of data for Latin through the Middle Ages, noting that at least two datasets for Medieval Latin existed already (Caroline Minuscule and Eutyches) in abbreviated form for pre-10th century documents.

Unlike CREMMA Medieval, our approach has been feature-driven to compensate for rare characters in the dataset network. In this regard, we succeeded, as we have a higher frequency of special characters in our dataset than in Pinche’s dataset, despite being smaller overall (see Table 7 and Figure 5). Only three characters are more represented in CREMMA Medieval: the Tironian Et, the superscript combining R (common on words such as “grand”), and “&”. The character ꝯ is equally present in both datasets: resolved as con- or com-, it is often used in words such as ꝯmence (commence). Some very frequent diacritics, such as the horizontal lines and vertical lines transcribed as tildes, are more frequent in our dataset, by a factor of 2.51 for horizontal ones and of 3.93 for vertical ones. This will allow better recognition of these two frequent marks, as it now totals around 19,000 occurrences in both datasets for the horizontal tilde and 4,500 for the vertical one, making them the first and the third most represented abbreviating characters.

Table 7

Comparative statistics table on abbreviations: for each dataset, we look at words that are abbreviated (abbr.) or non-abbreviated (others). It reads the following way: “11.94% of words in the Latin corpus are abbreviated.”


LANGTYPEWORDSWORDS %UNIQUE WORDSUNIQUE WORDS %FREQ. OF UNIQUE WORDS > 1

Latinabbr.6,85511.94%1,4606.24%279

Latinothers50,55788.06%21,93593.76%5,025

Old Frenchabbr.5,7554.15%1,4574.89%286

Old Frenchothers132,82895.85%28,31595.11%8,726

Figure 5 

Frequences of character classes across manuscripts.

Some manuscripts have nearly no abbreviation (cf. Table 9). Laur. Plut. 39.34 notably so, as it only contains 3 abbreviated words which is a single character abbreviation (⁊, et). A little less than half of our manuscripts are less abbreviated than the most abbreviated text in the CREMMA Medieval dataset, while the other half can exceed it by up to ten points. However, both languages show similar maximum frequencies in terms of non-single letter abbreviations (abbreviations made up of a single Unicode codepoint such as ⁊, &, ꝑ).

Table 9

Statistics per manuscript. “Un.” stands for Unique, “Abbr.” for Abbreviated or Abbreviation, “NSCA” for Non-Single Character Abbreviation. The lowest and highest values are in bold typeface. The separation between Laur. Plut. 53.08 and UBLMs. 758 represents the highest abbreviation ratio in the CREMMA Medieval dataset.


MANUSCRIPTWORDSUN. WORDSABBR. WORDSABBR. RATIONSCA NSCA RATIOUN. ABBR.UN. ABBR. RATIO

Laur. Plut. 39.3478357130.38%00.00%10.18%

Berlin, Germ. Oct. 51117113410.58%00.00%10.75%

Berlin, Hdschr. 25961654121.25%30.31%60.92%

Latin 823614751057332.24%50.34%60.57%

Laur. Plut. 33.311278858362.82%171.33%212.45%

Laur. Plut. 53.091300798382.92%100.77%91.13%

CCCC MSS 1651521713493.22%281.84%233.23%

CCCC MSS 2361239874685.49%443.55%242.75%

Latin 6395330424181905.75%852.57%722.98%

Laur. Plut. 53.08298518701956.53%943.15%673.58%

UBL, Ms. 758446823932976.65%721.61%642.67%

Arras 861241616011646.79%1014.18%805.00%

Egerton 821981677717.24%282.85%314.58%

Phil., Col. of Phys. 10a 1351487105715110.15%523.50%444.16%

Montpellier H3184456231645810.28%1312.94%1094.71%

Vat. Pal. Lat. 3732258120323410.36%693.06%675.57%

Latin 161954135167656913.76%1684.06%1076.38%

MsWettF 153574145250114.02%1724.81%1077.37%

CLM 130276499361297014.93%3405.23%2577.12%

BIS 19373702731116115.75%4135.60%2448.93%

Mazarine Ms. 9154751187382417.34%3507.37%19510.41%

Finally, despite showing a similar number of pages, we see a large variation in terms of word density with a limited variation in terms of unique words (cf. Table 8). This shows how pages as a metric are not enough to characterize a corpus for HTR and Layout segmentation purposes: the number of columns, lines, and potentiality of words or characters supplements the first. To showcase this argument, the Berlin, Hdschr. 25 manuscript has the highest number of pages (17) but the third lowest amount of words (961).

Table 8

Abbreviating signs, present more than 50 times in both the Latin and the Old French CREMMA datasets. The CREMMA Medieval (Old French) dataset is comprised of 693,052 characters in total, which makes it more than twice the size of CREMMA Medii Aevi. Despite this difference, most abbreviated characters are more represented in the Latin dataset.


CHARACTERUNICODELATINOLD FRENCH% IN LATINRATIO

U+204A2228.04400.033.610.51

ͬU+036C148.0219.040.330.68

&U+002683.0116.041.710.72

U+A76F850.0779.052.181.09

U+A7511500.0919.062.011.63

ͥU+03651486.0820.064.441.81

̃U+030314445.05759.071.502.51

ͣU+03632024.0732.073.442.77

U+A7701763.0523.077.123.37

̾U+033E3827.0973.079.733.93

ͤU+0364518.0120.081.194.32

U+A753462.080.085.245.78

U+1DD11018.0137.088.147.43

U+1DE4978.055.094.6817.78

ͦU+0366870.061.093.4514.26

5 Implications/Applications

With this addition to the overall amount of datasets available, we now have 1.149 million characters for medieval manuscripts with book scripts, ranging from the 9th to the 15th century. These data offer more than characters: we can imagine using them in the context of linguistic studies (evolution of dialects, abbreviation usage, etc.) thanks to the shared transcription norm, or in codicology studies (evolution of layouts, relation between layouts) using the common segmentation vocabulary, both using the original data or automatically annotated one.

HTR data and models have a fairly high level of reuse potential. First and foremost, while it is still relatively a rare reuse, these data, visualised correctly, can easily serve as teaching materials: e-teaching of paleography has been gaining some traction, but simply moving away from printed to digital and interactive hand-outs using open data and transcription is a first step that undoubtedly some have already taken. Then reuse can move to the analysis of transcription themselves: Stutzmann () and Stutzmann, Mariotti, and Ceresato () have shown that analysis of graphematic data can yield information about scribal practices. Finally, such data can be used for model training. Project like Possamaï, Gaiffre, Souvaye, Duval, and Ducos () and Foehr-Janssens, Ventura, Carnaille, and Meylan () have used automatic transcription models to speed-up the transcription process of large collection of manuscripts, using base models which were then fine-tuned on sample of data to yield better results, such as described by Pinche (). Finally, models can be used for data-mining and operating research at scale on non-manually transcribed manuscripts: Camps, Clérice, and Pinche () proved the hypothesis of a 19th-century scholar by analysing a full manuscript with automatic transcription, Franzini et al. () proposed also a stylometrical analysis of data obtained through automatic transcription.

As a direct output, we trained a model which would allow for transcribing or starting the transcription of Latin medieval manuscripts. In order to evaluate the gain from our data, we trained three models:

  1. a model containing all data from Table 1, to help transcribe Latin and Medieval French manuscripts, which is the end goal of this paper;
  2. a model containing every dataset but our own, to evaluate the impact regarding the quantity of data we add for Latin (i.e., to find out if the original Carolingian datasets were enough to break the language model of the Old French datasets);
  3. a model containing only Old French data, from incunabula of the 15th century to the main dataset CREMMA Medieval.

From Medii Aevi, as stated earlier, all aligned data from the Faithful Transcription Data Set are kept for testing, as an out-of-domain set. Each model uses at least 10% of the pages of each dataset for the development set. CREMMA Medieval and Medii Aevi are split furthermore with another 10% subset for evaluation, proposing “in Domain” evaluation.

The results show a massive improvement for the in-domain Latin dataset (see Table 10) and an insignificant one for Old French. The addition of Medii Aevi provides overall better results on out-of-domain datasets: UBL Mss. 758 and Berlin, Hdschr. 25 transcriptions improved by 4.2 points at least while Berlin, Germ. Oct. 511 (BGO), the smallest transcription set of the dataset, only improved by 2.4. This improvement derives equally from the simple addition of Latin into the model, as shown by the clear gap between the mixed model with Carolingian data: not only the model might benefit from Latin in general (as potentially shown by the simple addition of the Carolingian data), but it also gains in performance out of the amount of data from the same period as CREMMA Medieval. We actually see in Table 11 that there are much fewer errors on characters that saw their frequencies reach new highs. The All model does only a fourth of the error of the Only Old French model on tildes or two-thirds on vertical tildes for the UBL manuscript. The -rum abbreviation (ꝵ) or the -et/-ed/-ibus one (;) are quite new to Medieval datasets in general, which explains the clear difference in results. Overall, this dataset helped create a model allowing for readable outputs (see Appendix Table 12 for a side-by-side comparison) on Medieval manuscripts, or at least transcriptions that can help produce new data.

Table 10

General accuracy results of the models. Model All contains all data presented in Table 1, model No CREMMA Medii Aevi contains everything but the present dataset, model Only Old French contains all datasets but Latin one (Eutyches, Caroline, CREMMA Medii Aevi). Two types of test sets are present: the “In Domain” dataset are pages from the same manuscripts as the models, all others (UBL 758, BGO 511, and B.H. 25) are manuscripts from the Faithful Transcriptions Data Set aligned in CREMMA Medii Aevi but not used for training purposes.


MODELMEDIEVAL OLD FRENCH (IN DOMAIN)MEDIEVAL LATIN (IN DOMAIN)UBLBGOBH25

All94.3090.1571.6979.1285.10

No CREMMA Medii Aevi94.0480.6867.6878.0281.89

Only Old French94.0178.1067.4976.8180.74

Table 11

Details on errors from the test presented in Table 10. Space % shows the portion of error points due to bad spacing, e.g. All Model has a 94.30% accuracy on CREMMA Medieval test set, which means a 5.7% Character Error Rate (CER): not recognized SPACES represent 1.7 points of CER, more than a quarter of the CER. Other numbers are absolute values of missed characters (deletion or substitutions) to make comparisons between models possible; insertions are not accounted for.


MODELTEST[SPACE] %[SPACE]TILDEVERT. TILDE7ħł;

AllCREMMA Medieval1.780377460101715000400

No CREMMA Medii AeviCREMMA Medieval1.772689550151215000300

Only Old FrenchCREMMA Medieval1.773386500121520000200

AllCREMMA Medii Aevi1.77427310323082201

No CREMMA Medii AeviCREMMA Medii Aevi2.813878920178111171521532

Only Old FrenchCREMMA Medii Aevi3.114991870161091172021533

AllBGO2.922101100000012

No CREMMA Medii AeviBGO2.313301100000012

Only Old FrenchBGO2.313101100000012

AllBH251.963441802080512310

No CREMMA Medii AeviBH251.873682104090512312

Only Old FrenchBH252.1100712105050512312

AllUBL4.4274674801205410302143043

No CREMMA Medii AeviUBL6.04822567603806911377167171

Only Old FrenchUBL5.74842397702805911377157171

Additional File

The additional file for this article can be found as follows:

Appendix Table 12

Ground-truth (left) and prediction (right) of the new model on UBL Mss. 758, 24r. Yellow highlighting shows the differences between transcriptions. DOI: https://doi.org/10.5334/johd.97.s1