1 Overview

Repository location

The dataset can be found on GitHub at https://github.com/quadrama/knowledge-annotation as well as on Zenodo at https://doi.org/10.5281/zenodo.8319261.


This dataset was collected in the project Q:TRACK – Quantitative Drama Analytics: Tracking Character Knowledge. Q:TRACK targets the fact that a play’s dramatic characters can have different levels of awareness of certain information. Hence, the transmission and distribution of knowledge is one central object of study for drama analysis. In his Poetics, Aristotle emphasizes the importance of so-called anagnorisis. Aristotle’s concept of anagnorisis refers to recognition scenes, where a character, for instance, recognizes a long-lost relative and all previous events appear in a new light (). The “discrepant awareness” () of different characters and/or characters and the audience can propel the plot of a play, creates suspense and thus greatly contributes to the play’s effect (; ; ). Therefore, the project aims to systematically model and track the distribution of knowledge in plays through annotation.

This adds to existing research in the field of computational literary studies where characters and their relationships in plays have recently gained attention (; ; ; ). The knowledge distribution can also be used to specify the character interactions with regard to network analysis (). Character relationships are also covered in a dataset by Massey, Xia, Bamman, and Smith () that is based on English narratives. However, they do not distinguish between the diverging and developing knowledge of individual characters and work with text summaries only.

We restricted the annotation to the domain of knowledge about character relations, as this domain is key in many plays. In Johann Gottlob Benjamin Pfeil’s tragedy Lucie Woodvil (1756), for instance, the main character Lucie learns too late that her lover and father of her unborn child is also her brother. The annotated relations in our dataset include family relations (parent_of(A, B), child_of(B, A), siblings(B, C) …), love relations (in_love_with(B, D), engaged(B, D), spouses(B, D) …) questions of identity (identity (A, E), has_name(A, ‘name’)) and death (dead(A), murderer_of(B, A)). In addition to the knowledge itself, the annotations contain information about the source and target of each knowledge transfer. This results in the following tag structure:


SOURCE is the character that passes on the knowledge, TARGET is one or several characters that receive the knowledge, and KNOWLEDGE specifies the knowledge itself as one of the character relations described above. Optional attributes allow to include additional information, e. g. if SOURCE is lying or if the information is still uncertain (see details in , in German).

2 Method

This dataset was created by manual annotation using the tool CorefAnnotator ().


The dataset comprises the 30 German plays listed in Table 1, with a total size of 736,808 tokens (including all utterances as well as stage directions). The plays were retrieved in the TEI-XML format from the Drama Corpora Project () and imported into CorefAnnotator. The data were annotated in three rounds:

Table 1

List of all plays included in the corpus.


1Brentano, C.Ponce de Leon1803

2von Eichendorff, J.Die Freier1833

3Gellert, C. F.Die zärtlichen Schwestern1747

4Goethe, J. W.Die natürliche Tochter1803

5Goethe, J. W.Iphigenie auf Tauris1787

6Goethe, J. W.Stella1776

7Goethe, J. W.Clavigo1774

8Gottsched, L. A. V.Das Testament1745

9Grillparzer, F.Die Ahnfrau1817

10von Günderode, K.Magie und Schicksal1805

11von Günderode, K.Udohla1805

12Hauptmann, G.Vor Sonnenaufgang1889

13Hebbel, F.Maria Magdalene1844

14von Hofmannsthal, H.Der Rosenkavalier1911

15von Hofmannsthal, H.Elektra1903

16von Kleist, H.Familie Schroffenstein1803

17Klinger, F. M.Die Zwillinge1776

18Lenz, J. M. R.Der Hofmeister1774

19Lessing, G. E.Nathan der Weise1779

20Lessing, G. E.Emilia Galotti1772

21Lessing, G. E.Miß Sara Sampson1755

22Pfeil, J. G. B.Lucie Woodvil1756

23Schiller, F.Die Braut von Messina1803

24Schiller, F.Die Räuber1781

25Schiller, F.Maria Stuart1800

26Schlegel, J. E.Canut1746

27Schnitzler, A.Komtesse Mizzi oder Der Familientag1909

28Wagner, H. L.Die Kindermörderin1776

29Wagner, R.Die Walküre1853

30von Weißenthurn, J.Das Manuscript1817

  1. In the initial round, 16 plays were annotated by two annotators following a preliminary guideline. Issues were discussed with one of the authors and, where necessary, with the whole team. This process resulted in the final annotation guideline (, in German).
  2. In the second round, the other 14 plays were annotated independently following the guideline. These plays were used to calculate the inter-annotator agreement using the measure gamma by Mathet, Widlöcher, and Métivier (), as presented (and discussed critically) in Andresen, Krautter, Pagel, and Reiter ().
  3. In a final round, every play was discussed and double checked by at least one annotator. In this round, three more relations were added for murder, death and pregnancy.

The final version of the corpus (round 3) comprises 37 files, as for seven plays, both annotators performed the last step of finalizing the annotations, resulting in two final versions for these plays. We decided to keep two versions instead of creating a single gold standard, because in many cases more than one way of annotating the play was justified (see below). In total, there are 1277 annotated text passages, which corresponds to an average number of 34.5 annotations per text, with a considerable standard deviation of 18.8.

Sampling strategy

The plays were manually selected to cover

  • plays of which we knew that knowledge about character relations is important for the plot, (?) as well as plays where this was not the case,
  • tragedies as well as comedies,
  • plays from different literary epochs (1740–1900).

Accordingly, the dataset is not designed to be representative of a specific group of texts, but to cover a wide range of relevant phenomena.

Quality control

All plays were annotated by two people independently, making it possible to calculate the inter-annotator agreement. The agreement is rather low for many of the plays, see Table 2. This is due to the high complexity and interpretation dependency of the task. In many cases more than one way of modeling the data is plausible. Also, measuring inter-annotator agreement in a way that makes the scores comparable to other studies is challenging for annotations without predefined annotation spans. See Andresen et al. () for a more in-depth discussion and the repository for more detailed scores. We publish several versions of each annotation as well as the annotation guidelines (, in German) for comparability and transparency.

Table 2

IAA scores (Gamma) for the 14 texts of annotation round 2. For the unlabeled scores, only the position of annotations is taken into account. For the labeled scores, position and labels are considered.


Brentano: Ponce de Leon0.5760.355

Eichendorff: Die Freier0.5730.375

Gellert: Die zärtlichen Schwestern0.4740.476

Goethe: Clavigo0.4270.438

Gottsched: Das Testament0.4010.290

Günderrode: Magie und Schicksal0.5360.428

Günderrode: Udohla0.4670.194

Hauptmann: Vor Sonnenaufgang0.6440.493

Lessing: Miß Sara Sampson0.5310.362

Schiller: Maria Stuart0.6510.496

Schlegel: Canut0.5190.431

Wagner: Die Kindermörderin0.4930.410

Wagner: Die Walküre0.6020.400

Weißenthurn: Das Manuscript0.6340.510


3 Dataset Description

Object name quadrama/knowledge-annotation

Format names and versions CSV, JSON, ca2z (a compressed data format used by the CorefAnnotator)

Creation dates 2020-11-01 until 2023-08-02

Dataset creators Melanie Andresen (University of Stuttgart), Benjamin Krautter (University of Cologne), Janis Pagel (University of Cologne), Nils Reiter (University of Cologne), Christian Lantzinger (student assistant, University of Stuttgart), and Jonas Hirner (student assistant, University of Stuttgart).

Language The plays in the dataset are in German, the annotation labels and variable names are in English.

License CC-BY-4.0

Repository name GitHub, Zenodo

Publication date 2023-09-05

4 Reuse Potential

The dataset can be reused in a number of ways. Literary scholars might take the data as a starting point for a systematic analysis of knowing and not-knowing, knowledge distribution and knowledge transmission between characters in one or several individual plays. This is often considered a crucial piece of information for the interpretation of dramatic texts (; ). Horstmann () has proposed to narratologically reinforce theater studies by including focalization, understood as relations of knowledge, into the analysis. Analyses of individual plays can be supported by the visualization of the data as we have suggested in Andresen, Krautter, Pagel, and Reiter () and Andresen et al. ().

Quantitative analyses of the frequency of specific types of knowledge transfers, for instance, are limited by the size of the dataset, but are still possible on a small scale. This allows insights into which relations are discussed most often, which characters are the most important for knowledge transfer and similar questions. The annotations could also be aligned with the attempt to model character relationships based on topic modeling as presented in Iyyer, Guha, Chaturvedi, Boyd-Graber, and Daumé III ().

To solve the problem of data scarcity in the long term, the dataset can be used as training and/or test data for attempts to automate this type of annotation, for instance by prompting large language models (; ). As we provide the annotations of two annotators for most plays, the data can also be used to investigate annotation disagreement. One may investigate if annotation disagreements point to ambiguous and potentially crucial text passages or look into the causes of disagreements (; ).