The ChildPoeDE Corpus: 1082 German Children&rsquo;s Poems for Computational and Experimental Studies on Poetry Reception

Marina Lehmann; Anne Heumann; Moniek M. Kuijpers; Gerhard Lauer; Jana Lüdtke

(1) Overview

Repository location

Zenodo: https://zenodo.org/record/7936860

Context

Few of the German literary corpora available today focus on children’s and young adult literature (e.g. ) and none of them specifically on poetry. The childPoeDE corpus tries to fill this gap by providing a set of 1082 poems for children, which can be used in computational and experimental studies. It was built within a project on sentiment analysis in children’s and young adult literature (). Seven anthologies – published between 1991 and 2018 – form the basis of this corpus. The poems were written by 356 different authors (84 female, 271 male, 1 unknown) and cover a wide range of topics from animals and nature to family life and children’s dreams, from everyday situations to adventures, from humorous to serious. The texts also vary in style. Some poems follow traditional formats with strict rhyme and metre, while others show free, experimental or playful characteristics, such as onomatopoeia.

(2) Method

Steps

Based on recommendations by five experts on German children’s literature and poetry – all professors or research associates from the fields of German Studies, Didactics or Pedagogy – we selected the following seven anthologies as sources: Großer Ozean (), Die schönsten Kindergedichte (), Sieben Ziegen fliegen durch die Nacht (), Sieben kecke Schnirkelschnecken (), Im Mondlicht wächst das Gras (), Ich liebe dich wie Apfelmus () and So viele Tage wie das Jahr hat (). The experts were asked to suggest anthologies based on the following criteria: the poems should still be widely read today, aimed mostly at primary school children, written between 1800 and 2018, cover a wide range of poetry, including classics but also less known poetry, written in German and focus on text (not on pictures) to convey meaning. We chose these seven anthologies because they were named multiple times by different experts. Since we intended to provide data suitable as stimulus material in contemporary studies, we only included anthologies with editions published in the last 25 years. We used OCR software from Tesseract and Adobe.

Further, we collected poem-level and token-level metadata (csv). Some poem-level information was added manually (author, title, anthology, anthology count, publisher, publication year, ISBN). From the Integrated Authority File (GND), we retrieved additional data about the authors (GND id, author gender, year of birth, year of death) to ensure their accurate identification. We decided to provide the author’s year of birth and year of death as an indication of when the poem could have been published, as it was difficult and, in some cases, impossible to find original publication dates for single poems. Most features, however, were extracted with our own Python script (poemtool.py) (e.g. word/stanza/line counts, data on case, punctuation, layout, rhyme and sonority). To determine rhyme patterns, we used rhymetagger (). Calculations for the sonority score are based on Jacobs () and Stenneken et al. (). We also calculated the lexical density and type-token ratio (TTR) for each poem to provide information on lexical richness. Along with the standard TTR, we computed Moving-Average-TTRs (MATTRs) to account for different text lengths (). As MATTRs are usually computed for longer texts, we used different window sizes. All TTR and MATTR values were calculated using the R-package quanteda (). Data on onomatopoeia was annotated manually. The token-level metadata file additionally provides data on word length, word position and parts-of-speech in different levels of granularity. Part-of-speech information was generated with TreeTagger (). We also published a frequency table with absolute and relative frequencies for all tokens present in the corpus. Figure 1 represents the childPoeDE corpus in descriptive statistics. It includes frequency tables for the features special layout, rhyme and onomatopoeia, histograms with boxplots for poem length (measured in the number of stanzas and lines), poem sonority, TTR, lexical density and rhyming degree, a word cloud of the most frequent content words, a pie chart on gender distribution and a table with the ten most frequent authors and the number of poems they contributed to the corpus.

Figure 1

ChildPoeDE corpus – overview of poem-level metadata.

Sampling strategy

We included as many poems from the anthologies as possible. However, poems relying on pictures, graphical layout or typography to convey meaning were excluded, as well as poems that used archaic or difficult language (e.g. all poems from “Des Knaben Wunderhorn”) and poems consisting of a single repeated word. A list of the omitted poems can be found on Zenodo. If a poem appeared in more than one anthology, this was noted in the column “anthology count” in the poem-level metadata file. The childPoeDE corpus in its current state is a first (yet still imperfect) attempt to collect data of German poetry for children. Ideally, a corpus should be balanced with regards to author gender. We will work towards this in the future. For now, the gender imbalance of the corpus represents the gender imbalance present in the anthologies.

Quality control

All texts were checked for OCR errors. Additionally, whitespace and special characters, such as quotation marks, were normalised. We also harmonised the poems’ structure to simplify automatic text processing: Detailed information on normalisation processes and explanations of text features can be found in the README files on Zenodo. The part-of-speech data was checked and manually corrected if necessary. In the end we conducted a quality check by reviewing randomly selected data.

(3) Dataset Description

Object name

childPoeDE

Format names and versions

TXT, CSV

Version 2.0

Creation dates

Start: 2021-01, End: 2023-02

Dataset creators

Moniek Kuijpers, University of Basel

Priska Hadayani Rüegg, University of Basel

Jana Lüdtke, Freie Universität Berlin

Marina Lehmann, Johannes Gutenberg-Universität Mainz

Anne Heumann, Johannes Gutenberg-Universität Mainz

Language

Data: German

Metadata: English

Licence

Poem-level metadata, token-level metadata, word-frequency table, TTR data and poemtool.py: CC 0

Repository name

Zenodo

Publication date

Version 2.0: 2023–05–15

(4) Reuse Potential

Although there is much research available on German poetry, both on corpora (e.g. ) and computational assessments (e.g. ), these works never focus on German poetry for children alone. Thus, our data offers new research scenarios for anyone interested in poetry for children, such as empirical scholars, researchers in didactics or digital humanists. In experimental studies the texts can be used as stimulus material to investigate children’s emotional involvement when reading poetry. Elaborate metadata allows for a precise poem selection along specific criteria, including rhyme, sonority or onomatopoeia. However, the corpus cannot provide all information which might be useful for empirical studies, including publication dates for individual poems or an evaluation of age appropriateness.

In the context of digital humanities, especially computational literary studies, our data allows for investigations of different poetic features and their correlations. There are plenty of possible approaches from the field of Natural Language Processing which can be performed on the data and might yield new insights on the study of German poetry for children. These include linguistic corpus analysis, sentiment analysis (for an example for children’s books see ), text similarity assessment, topic modelling, named entity recognition or explorative approaches through visualisations.

Overall, the childPoeDE corpus lays the foundations for a wide range of research scenarios while being extensible at the same time. The data could be enriched with additional metadata (e.g. sentiment values, reading age or text complexity measures), linked to other data sets through the authors’ GND ids or used for comparisons with corpora from other genres (i.e. childLex ()).

Journal of Open Humanities Data

Data Papers

The ChildPoeDE Corpus: 1082 German Children’s Poems for Computational and Experimental Studies on Poetry Reception

Abstract