CuneiML: A Cuneiform Dataset for Machine Learning

Danlu Chen; Aditi Agarwal; Taylor Berg-Kirkpatrick; Jacobo Myerston

1 Introduction

In this article we present a curated dataset of 38,947 2D photographs of Sumerian and Akkadian cuneiform tablets with their accompanying transcriptions in cuneiform Unicode – as well as lineart, transliterations, and metadata specifying attributes like period and genre. In contrast to the data provided by digital libraries which offer general access to cuneiform texts, our dataset was envisioned from the very beginning for machine learning with an emphasis on consistency of format. Therefore, we developed our dataset with strict preprocessing and filtering criteria and present preliminary baseline experiments for three classification tasks supported by our data: period, provenance, and genre prediction, conditioned on major face cutouts from tablet photographs.

The CuneiML dataset was produced by processing photographs and transliterations available online in the Cuneiform Digital Library Initiative (CDLI) (). This library gives access to 56,694 photographs of inscribed objects classified by time period, genre, provenance, and museum collection. Current digitized cuneiform archives like CDLI were designed as portals where experts can consult photographs of inscribed objects (tablets, seals, inscriptions, etc.), transliterations, dictionaries, and other working tools. Although the CDLI is an extraordinary resource which has proven to be of invaluable use for Assyriologists, it offers its data in a format not suitable for machine learning experiments. CDLI photographs are of varied quality: some are high-resolution while others do not meet the minimum requirements for machine learning tasks. In addition, CDLI images are composite; this means they contain multiple perspectives of the same object: front, back and sides of tablets (see Figure 1). Another issue is that the transliterations of Sumerian and Akkadian that accompany CDLI images are missing their rendering into cuneiform Unicode. These aspects make the CDLI data unsuitable for machine learning. Thus, with machine learning in mind, we have meticulously filtered and processed the CDLI corpus, isolating the most salient fragments of 38,947 high resolution composite images and have tokenized and converted the Latin transliterations of Sumerian and Akkadian into cuneiform Unicode. This latter part of the dataset retains the original polysemy of the cuneiform sign, a feature that neural network architectures like transformers are capable of capturing ().

Figure 1

An overview of CuneiML. An example tablet of ID 453248 with multi-modal data: (1) Metadata consist of time period, provenience, genre and measurement. (2) High-resolution 2d photograph of 6 faces. (3) Lineart from paleographers. (4) Latin transliteration directly downloaded from CDLI. (5) Cuneiform Unicode transcription we automatically converted from the Latin transliteration. (6) Major face cutouts automatically processed from the 2d photograph.

Although 3D scans are preferable to 2D photographs, to our knowledge there is only one existing Open Access 3D dataset of cuneiform tablets, which is limited in size and the historical periods. The Hilprecht – Heidelberg Cuneiform Benchmark Dataset for the Hilprecht Collection (HeiCuBeDa), contains 3D scans of only 1,977 tablets, which are limited to merely four historical periods, namely, Ed IIIb (ca. 2500-2340 BCE), UR III (2100-2000 BCE), Old Assyrian (ca. 1950-1850 BCE), and Old Babylonian (1900-1600 BCE) (). In an ideal world, all cuneiform tablets in museums would be 3D scanned, but such a scenario is not foreseeable in the near future. Based on existing photographs that have been collected by museums and scholars, our data provide a 20 times larger number of tablets covering almost the entire history of cuneiform writing, as well as richer metadata information, linearts (hand drawings made by modern scholars), and transcriptions into cuneiform Unicode.

Our dataset supports the development of a variety of machine learning tools – for example, the training and evaluation of automatic classifiers for predicting period, genre, or provenance from an artifact’s photograph, its Unicode transcription, or a transliteration of that transcription; but also, development of – for example – end-to-end automatic transcription systems from lineart or photograph (). Next, we provide summary info for our dataset, followed by a description of how it was collected and processed. Finally, we include initial experiments with baseline classifiers on three classification tasks supported by our dataset.

2 Dataset description

As we stated in the introduction, our aim is to curate a dataset that can support the development of novel machine learning tools for cuneiform. Our data consist of composite 2d photographs of tablets, as well as their major face cutouts, lineart, transliteration, transcription into cuneiform Unicode, and metadata. An example is shown in Figure 1. We also plot the histograms for time period (Figure 2), genre and provenience.

Figure 2

Number of tablets by metadata attributes: time period, genre, and provenance.

2.1 Summary

Below, we provide a brief summary of metadata for CuneiML:

Object name CuneiML_v1.0.tar.gz, tran.

Format names and versions JPEG and JSON.

Creation dates 2023-09-01

Dataset creators Danlu Chen, Aditi Agarwal, Taylor Berg-Kirkpatrick, Jacobo Myerston.

Language Sumerian and Akkadian.

License CC BY-NC 4.0.

Repository name https://doi.org/10.5281/zenodo.8307503

Publication date 2023-09-01

3 Methods

Our processing methods for dataset creation break down into several phases. First, we systematically scrape composite 2D photographs, transliterations, and artifact metadata for all artifacts represented in CDLI. Next, we process the composite 2D photographs in order to split them into images of individual tablet faces. Finally, we develop a set of de-transliteration rules for converting each transliteration into Unicode. Throughout this process, we automatically filter out unusual and rare forms of artifacts that could introduce spurious correlations into the prediction tasks supported by our dataset. In the next sections, we describe each phase of the pipeline in more detail. For implementation details, please checkout the code repository https://github.com/taineleau/CuneiML.

3.1 Downloading the metadata from CDLI

We used the CDLI Github repository to get the public catalog data containing a list of P-numbers for all tablets. “P-number” is a 6 digit unique identifier prefixed with the letter P, used to uniquely identify a tablet by the CDLI initiative. Using the P-number for every tablet we then crawl 2D images, lineart images and metadata including transliteration from CDLI. We gathered a total of 133,923 tablets from CDLI, of which 56,694 come with a 2D scan image and 52,637 come with a lineart image.

3.2 Data sampling and filtering

Most 2D photographs from CDLI are high-resolution color images showing six faces of a single tablet. Since our goal is to create a dataset for training and evaluating machine learning systems, consistency is paramount: if non-standard examples are included (e.g. black and white images when the majority of the dataset is in full color), machine learning systems may learn to leverage these features as false predictors (e.g. if most black and white images tend to come from the same period due to how they were collected, a classifier will learn to depend on this spurious correlation). Thus, we filtered out non-standard and low-quality and images according to several conditions:

The image is black-and-white.
The resolution of image is lower than 100*100 px.
The tablet is in poor condition, e.g. too many fragments or barely readable cuneiform.
The image does not contain a well-defined major face.

We also filtered out artifact type such as cone, cylinder and prism and only keep entries whose artifact type is tablet. After this processing, we eventually have 38, 947 tablets with high quality 2D images.

3.3 Cutting out the major faces

An additional issue for machine learning development arises from the variability of the imaging setup used to capture the raw composite photographs from CDLI. While the majority of composite photographs consist of six tablet faces, arranged in a fixed layout from a consistent camera angle, these properties vary to some extent based on when and where imaging was performed. In order to increase consistency, and therefore reduce the danger of overfitting to false correlations that are dependent on the imaging process itself, we systematically extract cutouts of the major tablet face in each composite and include this as an additional, more consistent, photographic representation for machine learning systems. Specifically, we build and test three different computer vision methods to segment and obtain individual faces of each tablet. As a way of validating and producing final extractions, we reconcile differences between the bounding boxes each system produces by computing the area of their overlap. When the area is large, the methods are in agreement and our output is reliably high-quality. The three methods are described briefly as follows.

Connected component segmentation. We first convert the images into black and white where the background is black. This is a classical rule-based segmentation that clusters the adjacent pixels with the same color. TWe use OpenCV’s implementation cv.connenctedComponents().
Watershed segmentation. We first convert the images to grayscale. The watershed algorithm views a greyscale image as a topographic surface where high intensities denote hills while low intensities denote a valleys. Each valley is labelled with a different color of water. As the water rises, unknown pixels will be colored and therefore clustered. We use OpenCV’s implementation cv.watershed().
SegmentAnything. This is the state-of-the-art general segmentation algorithm using neural networks. We uses the official toolkit () with default model weights to obtain cutouts.

Quality checking We automatically cut the images using three methods and only keep tablet cutouts whose overlapping area is larger than 90%. We sampled 100 images randomly to validate the cutouts; 97% met our quality requirements. Figure 3 shows a sample of 20 major cutouts produced by our algorithm.

Figure 3

A random sample of 20 major face cutouts.

3.4 Converting transliteration to cuneiform Unicode

CDLI and ORACC offer transliterations of cuneiform tablets, which are sometimes but not always accompanied by their photographs and linearts. The transliteration standard used in both projects is called ATF and is explained in detail in the ORACC’s website. Transliteration is the process of transcribing cuneiform signs into the Latin alphabet, using conventions which have varied over time and can take particular shapes according to various Assyriological projects. But, despite its possible inconsistencies, transliteration has played a crucial role in making Akkadian and Sumerian more accessible to the non-specialist; it has also facilitated the creation of dictionaries and critical editions. From the point of view of data processing, transliteration is also important because it reveals how modern scholars read certain signs that allow multiple interpretations. In this sense, transliteration is a form of disambiguation. Take for example the sign that can read as “sky” or “god” but can also be interpreted as the syllables an or il. This issue of interpretation occurs with many cuneiform signs that need to be disambiguated so that modern editors can stabilize what seems to them the most plausible reading of a text. A final example may serve to further illustrate this point. In the well-known Epic of Creation or Enūma eliš, the mother of the gods’ name is often spelled with the signs TI and GÉME, a combination of signs that is usually transliterated as Ti-amat and rendered into English as Tiamat. Now, this transliteration somewhat conceals that the goddess is the Sea, something that can be expressed more directly if one transliterates TI GÉME as ti-amtu, the “sea” in Akkadian. Thus, transliteration implies a reduction of possible choices which were present for an ancient audience, but which are concealed to modern readers that use latinized editions of cuneiform texts.

One possible issue if we use the transliteration directly for machine learning, is circular reasoning. Given that the transliteration itself might exhibit bias towards specific time periods and other attributes – e.g., an expert’s approach to transliterating a tablet is already influenced by preconceived notions about its time period. Thus, as an additional layer in our dataset, we produce and provide cuneiform Unicode conversions of the original Latin transliterations. Specifically, we follow the ATF convention to remove some of the editorial marks, tokenize, and map the transliteration into machine-readable cuneiform Unicode format. We use cuneifyplus to map the Latin transliteration to cuneiform Unicode. If a latinized sign is not processed, we then query eBL’s sign list to obtain the cuneiform Unicode. We briefly describe the rules here.

Uncertainty. The query (?) placed after a grapheme indicates uncertainty and the asterisk (*) indicates a collated reading. We remove the marks but keep the grapheme by default.
Breakage. The $ sign represents breakage, sometimes also indicating how many lines are broken. If the number is recorded, we insert the same number of <LB>. E.g. 2 lines broken → <BREAK><LB><BREAK><LB>. Moreover, the annotation […] indicates missing signs. We also insert a special token <BREAK> to indicate the missing content.
Compound words. We remove the markers of compound words. E.g. |SU.KUR| → su-kur.
Reading. sudx(|SU.KUR|) means the reading is sudx, while the signs are su-kur; we remove the reading and only keep the actual signs for tokenization.

Quality checking

We downloaded and extracted a dataset with human annotated transliteration-transcription pairs from the Akkademia project to use as a validation reference for our method. We take 2,719 lines of the human-labeled Latin transliteration/cuneiform Unicode transcription pairs. We run our program to tokenize and convert the transliteration into the Unicode transcription and obtain 99% character accuracy against the reference.

4 Potential usage and tasks

As described above, each cuneiform tablet in CuneiML comes with multiple layers of information across several modalities. The potential usages of this dataset for machine learning development can be roughly split into unimodal and multi-modal tasks (the possible inputs and outputs from cuneiML are summarized in Table 1). Beyond the more standard classification tasks like period, genre, and provenance prediction, we suggest several additional examples of potential tasks that our dataset can support. This list is not intended to be exhaustive.

Table 1

Task summary with possible input and output pairs. (1) Metadata consist of time period, provenience, genre and measurement. (2) High-resolution 2d photograph of 6 faces. (3) Lineart from paleographers. (4) Latin transliteration (5) Cuneiform Unicode transcription. (6) Major face cutouts.


TASK NAME	INPUT	OUTPUT

Language Modeling	(4)(5)	(4)(5)

Transliteration	(5)	(4)

Lineart generation	(2)(6)	(3)

Attribute prediction	(2)(3)(4)(5)(6)	(1)

Sign identification	(2)(3)(6)	(5)

Unimodal tasks

Language modeling. One of the most popular and broadly useful machine learning applications is to train a language model on text in a given domain. Language models trained on our dataset could be used to encode cuneiform Unicode for further processing and analysis, or as a generative prior in related downstream tasks like transcription and restoration (; ).
Transliteration. As mentioned above, there are multiple ways to transliterate the same cuneiform sign sequence. Gordin et al. () proposed several models, inculding HMMs and LSTMs, to automatically transliterate and segment Unicode cuneiform glyphs. Our dataset could be used as further training or validation data for this task.
Lineart generation. Analogously, in the image modality, there are potential use cases for automatically “translating” photographic representations into lineart, which potentially increases the readability of tablets to scholars. This task is structurally similar to image generation tasks in the broader field of computer vision. Following similar techniques, CuneiML could be used to train neural models (; ) capable of accurate lineart generation conditioned on a tablet image.

Multi-modal tasks

Attribute prediction. Our dataset supports training and evaluating classifiers for predicting metadata based on images or lineart. The attributes in the metadata include geographical, genre, and chronological attribution ().
Sign identification / automatic transcription. A useful, but particularly challenging task that our data supports is automatic transcription of tablet images or lineart into cuneiform Unicode text. There is very little text line annotation data for cuneiform tablets. Even with line-level annotations, the task is much harder than documents written on paper. Recently, new page-level end-to-end OCR systems () have been developed that are capable of high-accuracy transcription of more modern languages without line-level annotation. The lineart-cuneiform Unicode parallel data presented in our dataset is an ideal testbed for extending these techniques to more ancient languages.

In the following section, we present preliminary results on attribute prediction tasks using major face images, cuneiform Unicode and Latin transliteration in order to demonstrate a specific use case of our dataset for machine learning.

5 Preliminary experiments with attribute prediction

We analyze the task of attribute prediction for cuneiform and present the results on an image classification baseline using deep neural networks. We take three different types of attributes (time period, provenance, and genre) and treat each separately as a target output for an automatic classifier. As shown in Figure 2, the distribution of these attributes are imbalanced and long-tailed; therefore, we discard classes whose number of examples are less than 50. We then split the data randomly into training, validation and testing sets with a ratio of {.9, .05, .05}. In all cases, our image classifier is a pretrained version of ResNet-101 that we continue to optimize on our training set. For the textual features, we train a two-layer LSTMs from scratch.

Result and analysis

Table 2 shows the result of attribute prediction using three different type of input as features. We can see that the image baseline model achieves reasonable accuracy on all three types of attributes. The preliminary results demonstrate the effectiveness of our data for training machine learning systems that make predictions based in tablet images and further underscore the difficulty of provenience and genre attribution. Note that the major face cutout features seem to have the overall best performance, but it is possible that lighting and camera configurations influence the classification of a tablet. Furthermore, label imbalance and the distribution shift between the training and testing sets remain significant challenges in cuneiML.

Table 2

Summary of test accuracy for attribute prediction using different features.


	IMAGE	UNICODE	TRANS.	# OF CLASSES

Time period	97.66	90.50	87.17	14

Provenience	85.72	61.71	68.60	25

Genre	89.00	81.50	86.21	12

Further research and analysis are necessary to assess the reliability of the predicted results.

Journal of Open Humanities Data

Data Papers