Dependency Treebanks of Ancient Greek Prose

Format names and versions Creation dates 2014-03-01 to 2019-12-31 Dataset Creators Vanessa Gorman is the manual annotator of these trees. Original Greek texts came from the Perseus Project [14] and are pre-processed within the Arethusa program at the Perseids Project [13]. Arethusa offers possible lemmas and morphology options, from which the proper form must be selected and, if necessary, corrected or created. The syntactic analysis (relation labeling) is performed manually, except that the files for Demosthenes 1 and 59 are hand-corrected from a computer version pre-parsed by This dataset is a collection of dependency syntax trees of representative texts from ancient Greek prose authors (Aeschines, Antiphon, Appian, Athenaeus, Demosthenes, Dionysius of Halicarnassus, Herodotus, Josephus, Lysias, Plutarch, Polybius, Thucydides, and Xenophon), totaling to date 550,000+ tokens. It is hand-annotated by one person, using the Arethusa program on the Perseids website. Original texts were obtained from the Perseus Digital Library, and some (as indicated) were computer pre-parsed at the Pedalion Project. The database is stored in a stable form (2019-12-31) on Zenodo (DOI: 10.5281/ zenodo.3596076) and in a continuously updated form on GitHub in .xml format (https://vgorman1.github. io/). The repository can be used for pedagogical purposes and for research in linguistics analysis and corpus linguistics, stylistics, natural language processing, classification, and literary and historical analysis.

I made the trees using the Arethusa software on the Perseids website [13]. Original text files were obtained from the Perseus Project [14] (Tufts Univ.) and from the Pedalion Project (UK Leuven). I followed the rules of dependency syntax, employing the standard AGDT 1.1 tagset [2] and refining them according to the discussion of dependency syntax offed by Pinkster [15]. I have not used the 2.0 tagset based on Smyth developed by Celano [4]: the level of specificity increases the subjectivity of the annotation decisions exponentially, often relying more on semantics than syntax (what is the difference between a partitive genitive and a genitive of material in the phrase 'piece of pie'?), and the tagset is specific to Greek, making a linguistic comparison between languages more difficult.

Sampling strategy
While no formal statistical sampling methods were used, I chose to annotate at least 20,000 tokens each from a variety of Greek prose authors. As the size of an average 'book' by many authors, it represents a dataset large enough to use for significant sampling algorithms. I have included works from the Classical, Hellenistic, and Roman periods: Aeschines, Antiphon, Appian, Athenaeus, Demosthenes, Dionysius of Halicarnassus, Herodotus, Josephus, Lysias, Plutarch, Polybius, Thucydides, and Xenophon.

Quality Control
The relation labeling follows the general instructions for the AGDT 1.1 tagset given in Bamman and Crane [2]. I have created more detailed instructions for annotating major linguistic phenomena not covered in Bamman and Crane [2] in the 'Treebanking Tips' file within this dataset, relying heavily on the parallel interpretation of dependency syntax offered for Latin by Pinkster [15]. Vanessa Gorman is the manual annotator of these trees. Original Greek texts came from the Perseus Project [14] and are pre-processed within the Arethusa program at the Perseids Project [13]. Arethusa offers possible lemmas and morphology options, from which the proper form must be selected and, if necessary, corrected or created. The syntactic analysis (relation labeling) is performed manually, except that the files for Demosthenes 1 and 59 are hand-corrected from a computer version pre-parsed by  Many recent advances in linguistic knowledge are due to the development of the methods of corpus linguistics. Treebanks such as the ones presented here are a resource for the application of these methods. For example, Greek copular verbs and subject-verb agreement have recently been studied on the basis of annotated dependency data [10,11]. McGillivray and Vatri [12] use treebanks to examine the relationship of acoustic and syntactic information. I intend future work on a valency dictionary of Greek verbs based on this dataset.

Ancient
Greek allowed a relatively free word order, and the rules that govern it are not easy to discern. Treebanks offer a powerful tool for discovering those rules. Syntactically-and morphologically-annotated data allow for word order to be studied in a controlled fashion. For example, the frequency of the relative order of a participial indirect object and a nominal direct object can be easily determined and all examples quickly identified. The advantages offered by such specificity is apparent in the recent literature [3,4,8].

Natural language processing
Accurate machine parsing of natural language syntax is a high priority among computer scientists. In order to achieve success in this area, it is crucial to have a sufficient corpus of accurately annotated texts to provide both training and testing data. This dataset is rare in consisting of a corpus of texts, manually annotated by one person and representing primarily Attic and Atticizing Greek, rather than the non-standard poetic or dialectic Greek of other collections (AGDT or PROIEL). Thus it offers the opportunity to develop and evaluate algorithms of a standard dialect with relatively complex morphology and frequent discontinuous syntactic structures [9].

Classification
Categorizing uncertain texts is one of the original concerns of the digital humanities. Most studies in this area have relied on various measures of vocabulary richness for their criteria of analysis, while the value of syntactic information has been discounted. In contrast, recent work shows that the morpho-syntactic data provided by the present database may significantly improve the results in some classification problems, such as author attribution [5,6]. I am pursuing future studies on issues surrounding dubious passages and the level of stylometric variation within any one author.

Literary and historiographical analysis
Applied to single texts, classification methods using morpho-syntactic annotation can reveal divisions and segmentation invisible to the unaided eye. Investigation of these units may lead to a deeper understanding of, inter alia, the compositional structure of the work.

Pedagogy
The availability of a large database of syntactically-annotated sentences is an important asset for students of the ancient Greek language. The structures posited by dependency grammar are close to the kinds of grammar analyses traditional in Greek pedagogy. This similarity makes them more helpful to students than, e.g., phrase structure trees would be. In addition, recent software, such as the Alpheios browser extension, allows the trees in this corpus to be combined with vocabulary glosses and links to a standard reference grammar [1]. The result is an on-line reading environment that guides students wordby-word through definitions, morphology, and syntax (e.g., https://vgorman1.github.io/Greek-Language-Class/ [7, in progress]).

Limitations
The principal limitation of this repository lies in the human element. Just as different people make different decisions in annotating specific structures, so also the same human annotator may change her mind over time. We lack detailed instructions on specific annotation choices. I am compiling just such documentation, a very preliminary version of which can be viewed in the repository ('Treebanking Tips').