(1) Overview


This database was created in the context of a PhD project on language contact in Ancient Italy, entitled The interplay between language contact and language change in a fragmentary linguistic area: the Italic peninsula in the first millennium BCE.


(2) Methodology


Most of the data was entered manually by the author, based on standard reference works for the languages in question. In some cases, basic forms of automation were used to create an initial dataset which was then corrected. For instance, an initial morphological analysis for Venetic was created by linking the attested tokens to a digitised version of Lejeune’s (1974: 315–341) Venetic word list, and the result was then systematically checked and corrected by the author. The method used for any given field is described in the accompanying documentation on GitHub.

A few fields were generated automatically using Python modules. These include, for instance, the field “Token_clean”, which uses the unidecode package to generate a version of the token stripped of special characters, intended for ease of searching. Once again, the documentation on GitHub describes in detail which fields are automatic and how they are generated.

Sampling strategy

The aim of the database is to include all texts in Oscan, Umbrian, Old Sabellic, Messapic and Venetic, as well as epigraphic Latin texts before 100 BCE. The corpus does not include Etruscan, due to the additional complexities of incorporating a non-Indo-European language into the structure of the database. Within the languages encompassed by the database, however, the primary aim is exhaustivity, and the corpus currently contains over 36,000 tokens.

Quality control

Data was entered manually and checked multiple times by the author.

(3) Dataset Description

Object name

Corpus of the Epigraphy of the Italian Peninsula in the 1st Millennium BCE (CEIPoM)

Format names and versions


Creation dates


Dataset creators

Reuben J. Pitts


Metadata are provided in English.


Creative Commons Attribution-ShareAlike 4.0 International License

Repository name

A continually updated version of the corpus is hosted on GitHub. Each old version of the corpus is permanently stored at Zenodo. In traditional publications CEIPoM should be cited as this paper, where relevant also specifying the version of the corpus used to achieve any given research result.

Publication date


(4) Reuse Potential

This database has a wide range of applications in linguistic research on the languages of ancient Italy. Currently, such research is hampered by the absence of searchable digital information, as the description of these languages is mostly spread over disparate written reference works (e.g. Bakkum, 2009; Lejeune, 1974; Santoro, 1982; Untermann, 2000; Wachter, 1987). This database aims to address that research need head-on.

The salience of digital and corpus-based approaches to ancient languages has increased in recent years (e.g. Adamik, 2016; Eckhoff et al., 2018; Mambrini et al., 2020; Qiu et al., 2018), and these methods have proven their effectiveness even in relatively poorly attested languages. It goes without saying that a digital dataset is more easily and more efficiently queried than a written corpus, facilitating research results that would otherwise be difficult or impossible to achieve. Moreover, the use of a digital dataset means any research results thus obtained can be replicated by other researchers, conferring a key advantage in terms of academic transparency. These advantages hold true in fragmentary languages such as Venetic or Messapic as much as in large corpus languages such as Classical Latin or Greek.

Since annotation is provided on multiple levels of description, this corpus can serve as a tool for linguistic research of various kinds, including research on the syntax, word order, morphology, lexicon, semantics, phonology and orthography of the ancient languages in question. To give an example of a simple linguistic query in CEIPoM, if one is researching the usage of syntactic objects in these languages, one can simply use spreadsheet software to search for instances of OBJ in the field Relation, and thus obtain a list of all tokens in the corpus with a syntactic analysis containing this value. The GitHub documentation offers considerable detail on how each of these features are annotated, and how the different levels of linguistic description can be related to one another to formulate more complex queries.

In addition to the strictly linguistic annotation, chronological and geographical information (including longitude and latitude) is integrated into the data throughout, allowing the evolution and distribution of these linguistic features to be tracked through time and space. Although the focus of the corpus does not lie on epigraphical metadata, the texts in the corpus are linked to their ID in the Trismegistos database (Depauw & Gheldof, 2014), which means they can easily be linked to further metadata and bibliography, as well as to other epigraphic databases (such as EDR or EDCS). In addition to its linguistic uses, therefore, the database also holds promise for related fields such as history, epigraphy and onomastics.

The corpus focuses strongly on ensuring that the information provided for the languages of ancient Italy is intercomparable. This makes it particularly well adapted for the study of convergence, language contact and other cross-linguistic typological trends in ancient Italy. This region has sometimes been described as a linguistic area (Zair, 2016: 311–312), a geographic region where prolonged language contact is responsible for grammatical similarities across distantly related languages (Friedman & Joseph, 2017: 55). Since the data is (with a few clearly signalled exceptions) annotated in the same way for all six languages currently in the corpus, this makes it possible to track the evolving differences and similarities between these languages, and to test hypotheses on contact-based change in this region.

The main current limitation of the database lies in the fact that, inevitably, its data is not fully complete. In particular, the emphasis until now has been on providing a single plausible linguistic analysis for each token, even when the scholarly literature offers multiple possible interpretations. Since this is frequently the case in disputed fragmentary texts, this may cause queries to miss potentially relevant and interesting forms. However, since the state of the data in each field is described in detail in the documentation on GitHub, researchers can take these limitations into account and adjust their use of this research tool in line with their research aims. Future updates to the corpus will continue to improve and fine-tune the quality of the data offered, as well as expanding the coverage of alternative analyses for individual tokens.