A Full Morphosyntactic Annotation of the State Archives of Assyria Letter Corpus

Matthew Ong

1 Overview

Repository location

https://doi.org/10.5281/zenodo.10622983.

Context

The royal archives of the late Neo-Assyrian kings (8th–7th century BCE) are an important source for our understanding of the Neo-Assyrian empire. They contain a wide variety of texts ranging from treaty tablets and legal documents to prophecies, ritual instructions, and court literature. Over the past four decades, much from these archives has been published in the State Archives of Assyria (SAA) volumes at the University of Helsinki, and in more recent years has appeared digitally under the Munich Open-access Cuneiform Corpus Initiative (LMU Munich) as the State Archives of Assyria online (SAAo).

In particular, the letters from these archives constitute the largest subgroup (some 2,600 of the more than 5,000 texts published so far) and are a valuable resource for reconstructing aspects of late imperial administration, royal ideology, biographies of notable palace authorities, and social history.

The letters published in SAAo span a period from the reign of Tiglath-pileser III (r. 747–722) down to Sin-šarri-iškun (r. 627?-612). However, most are dated to the reigns of Sargon II (r. 721–705), Sennacherib (r. 704–681), Esarhaddon (r. 680–669), and Assurbanipal (r. 668–627). The letters are published across ten volumes of the SAAo series, namely SAA 1, 5, 10, 13, 15, 16, 17, 18, 19, and 21. Note that in the interests of keeping the dataset fairly homogeneous as far as form and content, we are excluding SAA 8 (astrological reports to the king) due to the differences in form and content such reports have with the letter corpus proper. The small number of literary letters found in SAA 3 (Court Poetry and Literary Miscellanea) are likewise not included in the dataset due to their special use context.

Importantly, the letters in SAAo are one of the best resources for the study of the Neo-Assyrian dialect, a vernacular form of Akkadian different from the Neo-Babylonian dialect to the south as well as older forms of Akkadian such as Old Babylonian. The letters do employ various formulas and fixed expressions as a matter of genre, particularly in the introductory sections or when introducing new topics (see , and ). However, in comparison to texts from the same period that are written in a more conservative style (such as royal inscriptions and liturgical works), the letters feature grammatical constructions, linguistic forms, and discourse patterns more evocative of spoken discourse, in addition to reflecting influence from Aramaic (, ). Moreover, the letters involve communication between individuals or groups of different social backgrounds, undertaken for a variety of aims. Combined with the fact that we have dossiers on many of these individuals and can often date the letters to a particular ruler’s reign, the letter corpus offers opportunities for sociolinguistic, historical linguistic, and stylistic studies of late-stage Akkadian. Studies of this sort have been done previously by hand, but they can now also be done with the help of this dataset.

Most of the letters are written in the Neo-Assyrian dialect. However, a minority are written in the Neo-Babylonian dialect. This difference goes hand in hand with script difference, as the cuneiform script used by scribes trained in the Assyrian tradition versus the Babylonian also differs. Some letters written by the most educated scribes at the Assyrian court show code switching ().

2 Method

Generating the morphosyntactic annotations

The morphosyntactic annotations for the letter corpus were created using the same process described in and . At the most abstract level, this centered on a cyclic boot-strapping procedure illustrated in Figure 1. We first trained a spaCy language model () on an initial batch of manually-generated annotations in CONLLU format, then applied the model to a new group of letters to yield a set of imperfect or incomplete annotations. This set of annotations was corrected and completed by hand using Inception (), and then added to the initial training data. The language model was then retrained on the extended training set, resulting in a slight improvement in model performance, and hence faster work in hand-annotating the next batch of letters in Inception. This process was repeated numerous times until all letters were annotated. The result is a set of CONLLU files, one for each letter. For the convenience of those wishing to train their own spaCy model on this data, the CONLLU files were also converted to SPACY binary files.

Figure 1

Abstract pipeline involving bootstrapping.

A detailed discussion of the labels and conventions used in the morphosyntactic annotations, including the labels used in the morphological parsing, are provided on the author’s github account. Certain points of grammatical interpretation are also covered there.

In brief, the annotations provide, for each interpretable form in a text, the Universal Dependencies part of speech tag (UPOS), the lemma, and morphological decomposition expressed as a string of feature-value pairs. The morphological analysis describes, in the case of nouns and adjectives, the gender, number, and case of the form, and whether the form was in the bound or free state. In the case of finite verbs, the person, number, gender, tense, stem, and mood are specified. Suffixes and enclitics attached to head forms are encoded as additional feature-value pairs within the morphological analysis of the head form. Sentence breaks, in most cases, were not marked. The morphological analysis of a form at the level of part of speech came directly from Oracc metadata. Other features such as verb stem and suffix patterns required annotator judgment, although the Oracc translation of the text was almost always accepted and used in making a decision. Grammatical analyses were largely based on , save that verbal adjective forms in the stative were labeled as verbs.

The most valuable part of the annotations is perhaps the universal dependency relations between interpretable forms. As discussed in , syntactic dependencies are the most difficult task for our spaCy model, and the area where most of the manual correction and completion takes place. At the same time, it is still rare to find digitized Akkadian corpora marked for syntactic parsing or even language models trained to perform such a task (, ).

Illustrations of how these features were marked within Inception and the underlying CONLLU file are given in Figures 2 and 3.

Figure 2

Morphosyntactic annotation of SAAo letter in Inception.

Figure 3

Morphosyntactic annotation of SAAo letter in CONLLU format.

Converting the annotations to linked open data

All the CONLLU files representing letter annotations were concatenated into a single file, with a comment line above each section indicating which text it represents. This block file was converted to RDF turtle format (TTL) using the Java package conll-rdf. The value of converting the annotations to linked open data is that they may then be easily searched via SPARQL queries for various morphological and syntactic features (see for an example). They can also be converted to other knowledge graph representations such as Neo4j, which enable even more sophisticated graph queries.

Generating the metadata

The process for providing metadata for the letters is similar to that in . Preexisting metadata was extracted from SAAo catalogue files and combined in a CSV file. This metadata includes sender, recipient, estimated date of composition, script, and dialect of Akkadian (if determinable).

3 Dataset Description

Object name

A Full Morphosyntactic Annotation of the State Archives of Assyria Letter Corpus.

Format names and versions

CSV, CONLLU, SPACY, TTL, TXT

Creation dates

2022-09-01–2024-02-01

Dataset creator

Matthew Ong (UC Berkeley) was responsible for all aspects of the project.

Language

English, Akkadian

License

Creative Commons Attribution-Share-Alike 4.0

Repository name

Zenodo

Publication date

2023-08-28

4 Reuse Potential

The dataset (both morphsyntactic annotations and metadata) can be used in a variety of ways. Scholars may search the normalized letter corpus for text patterns conforming to any number of syntactic, morphological, or even phonological features (i.e. spelling) provided those features are marked in the annotations and the query itself can be expressed in SPARQL (or other knowledge graph query format such as Neo4j). This goes beyond the current search capabilities of the letter corpus on Oracc, which is largely based on keywords. When the metadata is incorporated into this search, one may also begin to search for sociolectal, ideolectal, topolectal, and other linguistic patterns in the letters that have so far escaped human readers.

The methods used to generate the annotations (as described in ) can be applied to other lemmatized Akkadian corpora on Oracc. Researchers wishing to annotate such texts in conjunction with spaCy model training may benefit from using the spaCy Akkadian language package in development by the author. This package is still under development and currently only works for normalized texts. However, one of its most useful features currently is the able to correctly tokenize a large number of lexicalized construct phrases in Oracc (encoded by a ‘long dash’ in the online edition).

Finally, the process of converting CONLLU annotations to RDF triples allows the data to be integrated into other linked open data projects, particularly those involving other annotated corpora or Mesopotamian culture more generally.

Journal of Open Humanities Data

Data Papers