1 Overview

Repository location

Zenodo: https://doi.org/10.5281/zenodo.8300851.

Context

The dataset was produced by the ERC project LiLa: Linking Latin; it has been used in all the project publications, including Pellegrini et al. (); Sprugnoli, Mambrini, Passarotti, and Moretti (); Sprugnoli, Moretti, and Passarotti (). A full list of publications is available at https://lila-erc.eu/output (last accessed: 26, October,2023).

2 Method

Steps

Our goal was to generate a dataset with all dictionary forms (known as canonical forms or lemmas) that can be adopted by projects dealing with the lemmatization of Latin, independently from the particular strategies adopted by each project. We wanted to provide each lemma with a stable Uniform Resource Identifier (URI) and to model its linguistic properties with the help of OWL ontologies for Linguistic Linked Open Data (LOD) ().

The starting point was the list of lemmas used by the morphological analyzer LEMLAT 3.0 (). The list of Latin lexical items used by the software was compiled from three sources: a set of dictionaries of Classical Latin (; ; ), the Onomasticon by Forcellini (), and the Medieval glossary of Du Cange et al. (). For the Classical words, LEMLAT also includes information on the derivational history of words ().

LEMLAT’s lemmas have undergone a twofold process of revision. Firstly, we manually identified and merged the duplicate entries from Classical and Medieval Latin. Secondly, we generated all possible inflected forms that may be chosen as lemmas (such as the present, perfect and future participles of verbs, or de-adjectival adverbs) that were not already in LEMLAT. Finally, we developed an OWL ontology, based on widely adopted standards for Linguistic Linked Data like Ontolex (), to express the different classes and properties used in lemmatization (), and we modeled the lemmas according to it.

A first version was published in 2020 and it included 196,853 lemmas; it has been revised and extended (see below under “Quality control”). Version 1.2 now includes 215,102 canonical forms. The information was originally stored in a relational database (MariaDB); the RDF triples were generated from this source. Both versions (RDF, and Structured Query Language (SQL) format) are provided, together with instructions on how to generate the RDF from the database.

Quality control

Quality control has been constantly performed during the linking process of lexical and textual resources within the LiLa network. The Lemma Bank has been used to interlink 10 lexical resources and 9 textual corpora of various size. Linking those resources means to match the lemma string used to lemmatize the entries in the original resource to the entries in our Lemma Bank. This process helped us in identifying several missing lemmas (in particular for proper nouns) and several duplicated entries. After manual revision, the former were added to the collection, while the latter were merged with pre-existing entries.

For an example of this workflow involving the ca. 1.7 million lemmatized words of the Opera Latina by LASLA, see Fantoli, Passarotti, Mambrini, Moretti, and Ruffolo ().

3 Dataset Description

Object name The LiLa Lemma Bank (V1.2).

Format names and versions V1.2: Turtle serialization of RDF; SQL file. The first version of the Turtle RDF was included in the ILC-CNR for CLARIN-IT repository under a more restrictive license (CC-BY-SA-NC 4.0).

Creation dates 2020-11-25 to 2023-08-30.

Dataset creators Marco Carlo Passarotti (supervisor), Flavio Massimiliano Cecchini (developer), Greta Franzini (annotator), Federica Iurescia (annotator), Eleonora Litta (annotator), Francesco Mambrini (annotator), Giovanni Moretti (developer), Giulia Pedonese (annotator), Matteo Pellegrini (annotator), Paolo Ruffolo (developer), Rachele Sprugnoli (annotator), Marinella Testori (annotator). Affiliation of all (at the time of data development): Università Cattolica del Sacro Cuore, Milan, Italy.

Language Latin; English for metadata.

License Creative common Attribution - ShareAlike 4.0 International (CC BY-SA 4.0).

Repository name Zenodo, GitHub.

Publication date 2023-08-30 (V1.2).

4 Reuse Potential

LOD publication

Any project wishing to publish linguistic information about Latin words or texts may use the URIs from the Lemma Bank to link their data. Indeed, the dataset provides easily reusable unique identifiers for a wide set of Latin canonical forms, which are already linked to a wealth of textual and lexical information. The dataset relies on a W3C de-facto standard for lexical information (Ontolex): any project adopting this model may easily reuse our data. For instance, Wikidata Latin lexemes () provide links to the LiLa lemmas with an ad-hoc property LiLa Linking Latin URI.

Linguistic research

The LOD paradigm used to connect resources via the URIs from the LiLa Lemma Bank allows researchers to run sophisticated queries across multiple layers of information. Users can, for instance, know how many derivative words that are etymologically linked to an Indo-European root exist in Latin, and where they are attested (), or what the distribution is of negative and positive words in the lyrics of Horace ().

The LiLa project provides a SPARQL endpoint, with pre-compiled queries that showcase some of these applications. As other SPARQL services start to provide access to data linked to the LiLa Lemma Bank (such as the Wikidata query service, where it is now possible to run federated queries to the LiLa SPARQL endpoint), this potential will only grow.

Language learning

The interoperability between resources can also be leveraged in the context of language learning. Latin is still widely studied in universities and secondary schools worldwide; however, the need for newer methods from current research on computational and corpus linguistics to facilitate the students’ access to the language is strongly felt in the community, especially in the domain of word usages and meanings (). The capability of crossing multiple lexical resources (like sentiment, valency and word-formation lexicons) with textual attestations can be extremely helpful. Word lists can be easily generated to help teachers, including, for instance, nouns with positive polarity, grouped by derivational patterns (such as verb-to-noun derivations involving the suffix -(t)io(n)) in reversed frequency order, based on the number of occurrences in one or more reference corpora. The web-based query interface provided by LiLa can be used to that purpose.

Natural Language Processing

The Lila Lemma Bank can support NLP tasks such as (1) lemmatization, (2) Part-of-Speech (PoS) tagging and (3) morphological analysis. As for (1), tools for automatic lemmatization can benefit from the connections of the canonical forms in the Lemma Bank to their occurrences in the interlinked corpora. These connections can be used to build a large lemmatized meta-corpus for Latin that may serve as a training set for a stochastic tool. As for (2), all forms in the Lemma Bank are assigned a PoS and that information can be exploited by PoS taggers in both training and testing phases. As for (3), the Lemma Bank enhances the canonical forms with morphological features such as gender and inflectional category, as well as derivational information on word formation.