(1) Overview

Repository location

GitHub:https://github.com/UgaritAlignment/Alignment-Gold-Standards.

Zenodo: Palladino, Shamsian & Yousef (); Palladino, Wright & Yousef (); d’Orange Ferreira, Ferreira dos Reis & Yousef ().

Ugarit: The alignments can also be visualized on the Ugarit Website (https://ugarit.ialigner.com/).

Context

Data were produced as part of the research illustrated in Yousef et al. (), Yousef et al. (), and Yousef (). They are currently used in the following projects:

  • For the evaluation of the performance of the Ugarit Automatic Alignment Model (https://huggingface.co/UGARIT/grc-alignment).
  • To produce aligned datasets in the Ugarit Alignment Editor: https://ugarit.ialigner.com/.
  • The Grc-Por Guidelines are being used to create new aligned corpora in the project “Letras clássicas digitais: interligando línguas antigas ao português e aprimorando um modelo automático de alinhamento de tradução” (Digital classics: linking ancient languages to Portuguese and enhancing an automatic model for translation alignment).

(2) Methods

Creating the Datasets and the Guidelines

Each dataset was created by two domain experts, who designed the Guidelines and annotated the corpus. The Grc-Eng Guidelines were created first and served as a model for the other two. Prior to aligning the corpus, the two domain experts created a first draft. Then, they aligned a subset to test the general consistency and feasibility of the Guidelines. For each new issue encountered in this phase there was a brief discussion, and a preferred annotation style was agreed upon. After the subset was completed, the experts completed the alignment without further discussion. The corpus was aligned using the Ugarit translation alignment editor (https://ugarit.ialigner.com/) (Figures 1, 2).

Figure 1 

An example of a paragraph from Xenophon, Cyropaedia, aligned on Ugarit as part of the Gold Standard for Grc-Eng.

Figure 2 

An example of a fragment from the Digital Fragmenta Historicorum Graecorum, aligned on Ugarit as part of the Gold Standard for Grc-Lat.

The Grc-Por and Grc-Eng datasets consist of 2,010 words from the Iliad, 1,829 words from Plato’s Crito, and 1,520 words from Xenophon’s Cyropaedia, with the corresponding translations (; ; ; ; ; ). The corpus for Grc-Lat includes 100 fragments from the Digital Fragmenta Historicorum Graecorum Project (; ), with the corresponding Latin translation by philologist Karl Müller ().

Quality Control

To test the reliability and consistency of the resulting Guidelines and Gold Standard, we measured IAA over each dataset. IAA is considered when both annotators align the same pair of tokens or when both annotators do not align a token. Multi-word alignments (1–N, N–1, N–N) are flattened as 1–1 pairs. Let A1 and A2 be the flattened translation pairs created by each annotator, and I the intersection between them, we calculate IAA as follows:

IAA=2*I/(A1+ A2)

The resulting IAA was measured at 90.5% for Grc-Lat, 86.08% for Grc-Eng, and 83.31% for Grc-Por.

(3) Dataset description

Object name

Folders:

grc-eng

grc-lat

grc-por

Each folder contains:

  • alignment_source_target.txt
  • target.txt
  • source-target-goldstandards.json
  • source.txt
  • guidelines_ source-target.pdf
  • text_source_target.txt

Format names and versions

We exported the Gold Standards in NAACL format (). This format allows the retrieval of translation pairs and corresponding sentences, but it also makes the corpus of parallel sentences available, in addition to the word-level alignments. Each dataset consists of the following files:

  • Source sentences: source.txt (e.g. grc.txt).
  • Target sentences: target.txt (e.g. eng.txt).
  • The file source-target-goldstandards.json (e.g. grc-eng-goldstandards.json) contains the Gold Standard in JSON format. It provides the complete aligned sentences and translation pairs. Each entry in the JSON file includes two aligned source and target sentences and the complete sequence of translation pairs with unique IDs and link types, as they appear in the translation pairs file.
  • The file alignment_source_target.txt (e.g. alignment_grc_eng.txt) contains the Gold Standard as a list of translation pairs. Each line in the file corresponds to a parallel sentence in the corpus. Each translation pair is identified through a source token ID within the source sentence, and a target token ID within the target sentence. Each translation pair is given a link type, S for Sure and P for Possible.
  • The file text_source_target.txt (e.g. text_grc_eng.txt), contains the parallel sentences used in the Gold Standard, one pair per line, concatenated with the symbol |||.
  • Alignment Guidelines are available in pdf, in English and Portuguese, in the format guidelines_source-target.pdf.

Creation dates

Start: 2022-01-19

End: 2022-11-14 (first release)

Dataset Creators and Contributions

  1. Chiara Palladino: Grc-Eng, Grc-Lat Guidelines, dataset creation, conceptualization, data curation, writing – original draft, writing – revision.
  2. Farnoosh Shamsian: Grc-Eng Guidelines, dataset creation, conceptualization, data curation.
  3. Tariq Yousef: Gold Standard and IAA Calculation, formal analysis, software.
  4. David J. Wright: Grc-Lat Guidelines, dataset creation, data curation.
  5. Anise d’Orange Ferreira: Grc-Por Guidelines, dataset creation, data curation.
  6. Michel Ferreira dos Reis: Grc-Por Guidelines, dataset creation, data curation.

Languages

Ancient Greek; Latin; English; Portuguese

License

Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode

Publication date

2022-11-14

(4) Reuse potential

Word-level manual alignments are extremely rare and challenging to create. Moreover, while there are some Alignment Guidelines available for modern languages (see https://ugarit.ialigner.com/guidelines.php for a partial list), these are currently the first ones explicitly addressing an ancient corpus, which has a set of specific problems usually not considered in modern languages. They provide an important contribution to NLP, but most importantly they define a workflow and the conceptual foundations to develop more aligned corpora of ancient and historical languages with their translations. Below are some suggested applications to reuse these datasets.

The Gold Standards provide a reliable, high-quality dataset to test and train automatic translation alignment models. These datasets are essential to evaluate the performance of such models and can be used as references to compare their predictions to assess their quality and reliability. We used them to evaluate the first transformer-based Translation Alignment model for ancient languages (; ), which is a multilingual model, but they can also be used to evaluate the performance of monolingual models focusing on Ancient Greek. For example, the Grc-Por dataset and guidelines are currently being reused in the context of the project “Letras clássicas digitais: interligando línguas antigas ao Português e aprimorando um modelo automático de alinhamento de tradução”, to improve the automatic alignment of Ancient Greek to Portuguese.

The aligned corpora can also be used to train AI models, both multilingual and monolingual, to perform other tasks, such as word sense disambiguation, Named Entity Recognition, and annotation projection. Yousef et al. () and Yousef, Palladino & Jänicke () offer insights on how to use these datasets for similar tasks.

The Guidelines provide a conceptual reference for people who wish to design or create aligned corpora in ancient languages. Being the first ones that explicitly address ancient texts, they cover issues such as controversial translations, fragmentary evidence, uncertainty, and phenomena connected to inflection. As they are not project-specific, they can be expanded and adapted depending on context. For example, they can be reused by scholars who use translation alignment for research, to provide an out-of-the-box reference to create a consistent corpus (). The Guidelines can also be adapted by teachers who use Ugarit in the classroom to create aligned corpora for tests and assignments (; ; ). They can provide a general reference on how to handle typical linguistic phenomena that tend to be translated inconsistently, such as the genitive absolute, the use of the dative, proverbial expressions, or changes in verbal tense and voice. Using these guidelines, students can also be instructed to create consistent lexicographic indexes generated from the alignment of ancient texts.

Fundamentally, the Guidelines serve as a conceptual reference to create new sets of guidelines for other languages or contexts. Students and teachers can use analogous principles and select relevant phenomena, to create a more specific style guide that fits their needs. Moreover, scholars may use the general criteria and strategies illustrated in the guidelines to design new ones. This is currently being done in the project “Creating a corpus of Akkadian inscriptions with Ugarit”, which aims at the creation of an aligned corpus of Akkadian and English, and in the Beyond Translation project for the creation of a Latin-English aligned text of the Bellum Alexandrinum ().