The corpus was produced and is used by the project Digital Grammar of Greek Documentary Papyri (PapyGreek). Parts of the data were used in Vierros and Yordanova (in press).
We have obtained our source files from the Duke Databank of Documentary Papyri (DDbDP),1 a corpus of non-literary Greek and Latin texts written on papyri, ostraca, or wooden tablets, encoded in TEI Epidoc XML (Elliott, Bodard, Cayless, & al., 2006). Based on the encoded information about the texts’ modern editorial treatment, we have split the files in two, one version containing the plain transcribed text and the other the editorial corrections and regularizations (for the details, see Vierros, 2018; Vierros & Henriksson, 2017). We have then annotated both versions separately. For example, if the original document has the Greek word moi ‘me’ (dative) and the editor has corrected it to mou ‘me’ (genitive), we have annotated both forms. The two versions are encoded in our dataset using XML attributes of <word> elements suffixed with _orig and _reg (e.g., postag_orig and postag_reg).
The annotation has been done at the PapyGreek website2 using an embedded Arethusa Treebank editor (Perseids Project and Alpheios Project, Ltd.).3 We have used the morphological analyses provided by Morpheus (Crane, 1991), an automatic morphological tagger which ships with Arethusa, as well as those produced by Alek Keersmaekers4 using machine learning techniques (Keersmaekers & Depauw, in press), making corrections as necessary. Syntactic annotation has been done manually.
The dataset also includes various kinds of metadata. Dating and location metadata has been retrieved from the Heidelberger Gesamtverzeichnis der Griechischen Papyrusurkunden Ägyptens, or HGV (part of the idp.data repository). References to the source HGV and DDbDP files can be found in the <document_meta> element of each file. In addition, we have described the persons associated with producing each text: authors (e.g., the sender of a letter), writers (who penned the text), addressees, and external scribal officials. We use our own designated person IDs for those persons, but when available, we have added the Trismegistos Person identifiers as well (e.g., Depauw & Van Beek, 2009). TM identifiers are not available for the anonymous scribes, whose handwriting may be recognised across different documents, and we consider marking them worthwhile. We have also described the documents’ text types (not indicated in the source files). The document types are listed in three levels: hypercategory (e.g., law), category (e.g., contract) and subcategory (e.g., marriage). This typology is planned to be compatible with the data of the project Everyday Writing in Graeco-Roman and Late Antique Egypt: A Socio-Semiotic Study of Communicative Variation (EVWRIT).5
Although the dataset is in XML, we use a MySQL database to store and process our data in our server, with separate tables for documents, sentences and words. There are two main advantages of working with a relational database vs. XML files: the former is easier to update incrementally; and indexed SQL tables are arguably better suited for data queries, the development of which is central to the PapyGreek project (see 4 below). The dataset’s XML files have been generated using SQL table joins and the Python package lxml.6
The DDbDP contains many different types of documents, of which we have selected to annotate—in the first phase—mostly private and business letters (ca. 25K tokens), and petitions (ca. 11K tokens). We have selected the texts from certain archives in order to have a known context for the texts and the people within (e.g., the Zenon Archive, the Archive of Katochoi of the Sarapieion, the Athenodoros Archive, the ostraca from Mons Claudianus and selection of women’s letters from different time periods). The present release (v1.01) focuses on the period BCE (ca. 32K tokens from the total of 44K). Later versions are planned to cover a larger time frame and range of text types.
We have followed the Ancient Greek Dependency Treebank Guidelines 2.0 (Celano, 2014b), which builds on version 1.1 (Bamman & Crane, 2008) and originates in the Prague Dependency Treebank (see e.g., Celano, 2019; Hajič, 1998). We have not applied the advanced syntactic/semantic layer of AGDT Guidelines 2.0. Additional PapyGreek Guidelines can be found in a separate document in the data repository. Each text has gone through a human review process.7
3 Dataset Description
Format names and versions
2019-09-06 to 2021-06-23.
Marja Vierros (reviewer, annotator); Erik Henriksson (developer); Polina Yordanova (reviewer, annotator); Arttu Alaranta (annotator); Petri Lahtinen (annotator); Lauri Marjamäki (annotator); Jamie Vesterinen (annotator); Iida Huitula (annotator); Sari Kock (annotator). Affiliation of all: University of Helsinki.
Original texts: Ancient Greek. Other: English.
CC BY-SA 4.0
4 Reuse Potential
Ancient Greek documentary sources have not been previously linguistically annotated with a review process (Vierros, 2018, pp. 105–106), making the PapyGreek treebanks a valuable resource in the study of Postclassical Greek (e.g., historical morphology and syntax, linguistic variation, and historical sociolinguistics). The present release (v1.01) covers the CE centuries in smaller numbers, and is thus indicative of the Greek usage mostly in the early post-classical period. The genres of private and administrative letters and petitions represent well the everyday uses of the language, including both the formulaic and narrative parts in petitions and the private language use in letters.
An understudied topic in Greek syntax—word order—has gained fruitful research results with treebanked data (Mambrini & Passarotti, 2013); but more well-studied questions, too, may benefit from revisiting quantitatively using treebanks (e.g., Celano, 2014a; Mambrini, 2019). A PapyGreek search interface8 developed by Erik Henriksson combines orthographic queries with morphosyntactic ones, making it a powerful tool to study morphosyntactic variation in tandem with phonology.
Universal Dependencies and NLP
Treebanks using the Ancient Greek Dependency Grammar specification can be converted into other formats, such as Universal Dependencies, and thus be used together with other languages or corpora. We decided to use the AGDT formalism instead of UD, because we wish our data to be directly comparable to other genres of Ancient Greek data, for which treebanked data existed only in AGDT when we started. Due to the simplicity of the schemata used, converting XML treebanks from one formalism to another is potentially a trivial task (Celano, 2019, p. 283); but see (Cecchini, Korkiakangas, & Passarotti, 2020) for an example of a more complicated transition.
Treebanks can also be used as training data for automatic lemmatizers and morphosyntactic parsers. We indeed hope that the PapyGreek treebanks are exploited for such purposes, in particular the development of automatic parsers of non-standard documentary Greek (see Keersmaekers, Mercelis, Swaelens, & Van Hal, 2019; Mambrini & Passarotti, 2012). Treebanked data can also serve as a basis for automatic semantic role labeling (Keersmaekers, 2020).
Ancient Greek historical prose treebanks (V. B. Gorman, 2020) have been used in, e.g., stylometric and authorship attribution studies (R. Gorman, 2019; V. B. Gorman & R. J. Gorman, 2016). A paper on authorship attribution, which utilizes the PapyGreek treebanks’ person metadata (see 2 above), is being prepared by the current authors. The results, we hope, will be useful in evaluating the practicality of authorship attribution based on short and fragmentary texts.
Treebanking has proven to be an effective way to teach ancient languages (V. B. Gorman, 2021; Mambrini, 2016). The annotation software developed by the Perseids Project9 and the Alpheios Project, Ltd.10 are freely available, promoting the equality of learning Greek and Latin worldwide. The PapyGreek website is likewise open for all to learn papyrological Greek by using (and even creating new) treebanks. A pivotal part of the PapyGreek project is the development of an online Grammar of Postclassical Greek based on the treebanked material, which will serve as a valuable teaching resource.
Thus far, the size of the PapyGreek treebanks repository is limited due to the costly method of semi-manual annotation and review process. The token count is 44K, which makes 1.6% of the total number of tokens of texts that are “treebankable” in the DDbDP corpus (ca. 2.8M tokens; as “not treebankable” we have counted e.g., lists and labels). Even small-scale vetted data, however, are a valuable resource for improving the accuracy of automatic parsers.