(1) Overview
Repository location
Our dataset is located in a Github repository within the forTEXT organisation: https://github.com/forTEXT/EvENT_Dataset. Additionaly, this repository is published as a Zenodo dataset ().
Context
The annotations were produced as part of the research project EvENT, located at the Technical University Darmstadt and the University of Hamburg. The EvENT Project is part of the priority programme Computational Literary Studies (CLS), funded by the German Research Foundation (DFG). For further informations see the programme website: https://dfg-spp-cls.github.io/. We developed an event annotation tagset that is based on narrative theory, where events are considered the smallest units of narratives (). The event tagset has been used for annotating the texts, assigning to each subclause one of the four categories (non-event, stative event, process event and change of state). Depending on the event types, additional properties have been assigned.
(2) Method
The dataset is created by manual annotation using the CATMA tool () for the manual annotations and the GitMA package () for annotation data processing.
Steps
The annotation procedure includes the following steps:
-
Corpus collection: The six texts are collected from the Textgrid Corpus () and the d-Prose corpus (). We selected narratives representing the literary developments between 1800 and 1920. In order to represent the most common narrative genres of this time period, we included short stories, novellas and novels. The corpus consists of:
- - Ludwig Tieck (1797): Der blonde Eckbert
- - Heinrich von Kleist (1807): Das Erdbeben in Chili
- - Annette von Droste Huelshoff (1842): Die Judenbuche
- - Theodor Fontane (1894): Effi Briest
- - Marie von Ebner-Eschenbach (1896): Krambambuli
- - Franz Kafka (1915): Die Verwandlung
- Annotation Guidelines: We developed guidelines for the annotation of narratological event types ().
-
Manual Annotation Process:
- -Pilot annotations: The annotation guidelines were developed and improved by extensive pilot annotations.
- -Annotator training: Annotators were first trained by annotating and discussing a training text.
- -Systematic annotations: Every text has been annotated by two independent annotators (see Table 1). The annotation process was accompanied by regular meetings to discuss cases of doubt. For the documentation of these cases, the annotators used a dedicated tag.
- -Gold standard annotations: Based on the double annotations of every text, gold standard annotations were created by one annotator who resolved inconsistent annotations (Table 3). Here again, cases of doubt were discussed. In this process, the GitMA package () was developed for supporting the extraction, comparison and integration of annotations in CATMA.
ECKBERT | EFFI BRIEST | ERDBEBEN | JUDENBUCHE | KRAMBAMBULI | VERWANDLUNG | |
---|---|---|---|---|---|---|
event type | 0.73 | 0.57 | 0.75 | 0.61 | 0.66 | 0.73 |
Quality control
The multi annotator approach with comprehensive training of annotators and the feedback loops described above were designed for controlling the quality of manual annotations. The main annotation task was the classification of the event types based on four categories
- non_event
- stative_event
- process
- change_of_state.
Here, we accomplished an agreement greater than 0.55 Krippendorff’s α for the six texts. The evaluation results of inter annotator agreement (IAA) for the final annotations is documented in Table 1.
Table 2 shows additional event classifications that are also grounded in narrative theory and depend on the event type classification. These categories are implemented as properties for defined event types. For instance, only process events and changes of state can be iterative. As the lower IAA values for some categories indicate, some of these categories are highly interpretative. The strongly varying agreement values are also due to the fact that different classification systems are provided for these event properties:
ECKBERT | EFFI BRIEST | ERDBEBEN | JUDENBUCHE | KRAMBAMBULI | VERWANDLUNG | |
---|---|---|---|---|---|---|
unpredictable | –0.25 | –0.30 | –0.08 | –0.35 | –0.21 | –0.55 |
mental | 0.79 | 0.33 | 0.58 | 0.39 | 0.46 | 0.79 |
representation_type | 0.94 | 0.87 | 0.86 | 0.91 | 0.86 | 0.67 |
persistent | 0.09 | 0.13 | 0.28 | –0.14 | 0.25 | –0.89 |
iterative | 0.62 | 0.20 | –0.29 | 0.35 | 0.07 | 0.70 |
intentional | 0.75 | 0.24 | 0.45 | 0.43 | 0.32 | 0.70 |
non_event_type | 0.66 | 0.68 | 0.80 | 0.71 | 0.80 | 0.69 |
- unpredictable: 0, 1, 2, 3, 4
- mental: yes, no
- representation_type: (any combination of) narrator_speech, character_speech, thought_representation
- persistent: 0, 1, 2, 3, 4
- iterative: yes, no
- intentional: yes, no
- non_event_type: conditional_sentence, subjunctive_sentence, modalised_statement, negation, generic_sentence, ellipsis, imperative_sentence, question, request
(3) Dataset Description
Object name
Annotations_EvENT.json
Format names and versions
JSON
ERDBEBEN | VERWANDLUNG | ECKBERT | KRAMBAMBULI | JUDENBUCHE | EFFI BRIEST | ALL TEXT | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
COUNT | TOKEN | COUNT | TOKEN | COUNT | TOKEN | COUNT | TOKEN | COUNT | TOKEN | COUNT | TOKEN | COUNT | TOKEN | |
non_event | 167 | 1,086 | 757 | 5,938 | 212 | 1,488 | 116 | 712 | 856 | 4,732 | 2,887 | 16,655 | 4,995 | 30,611 |
stative_event | 136 | 1,046 | 455 | 3,830 | 243 | 1,667 | 82 | 637 | 476 | 3,502 | 1,675 | 11,656 | 3,067 | 22,338 |
process | 400 | 3,459 | 1,126 | 9,748 | 450 | 3,225 | 268 | 1,990 | 1,120 | 8,146 | 2,061 | 15,180 | 5,425 | 41,748 |
change_of_state | 9 | 63 | 26 | 216 | 25 | 163 | 4 | 39 | 39 | 324 | 43 | 362 | 146 | 1,167 |
Creation dates
2020-12-01 – 2022-03-31
Dataset creators
Evelyn Gius, Michael Vauth, Michael Weiland (student assistant), Gina Maria Sachse (student assistant), Angela Nöll (student assistant) (all contributors are affiliated to Technical University Darmstadt).
Language
German (texts) and English (annotation categories)
License
GPL-3.0 License.
Repository name
EvENT_Dataset
Publication date
2022-04-01
(4) Reuse Potential
The dataset is reusable for several natural language processing (NLP) tasks focused on the detection of events. Based on the manual annotations in the dataset we accomplished the automation of narratological event type recognition (). In general, the event annotations can be used as features for the detection of phenomena related to narrative text structures.
Furthermore, based on the event annotations we developed and evaluated an approach to model the narrativeness/eventfulness and to identify the most ‘tellable’ parts in a narrative (). In a next step, the modelling of narrativity will be used in text comparisons.