“d-Prose 1870–1920” was created as part of the project “Gender and Illness” in the cooperation project “Automated modelling of hermeneutic processes – The use of annotation in social research and the humanities for analyses on health”.1 The focus of the project was on the description of illness from a gender perspective. The research focused on how the presentation, perception, and dealing with illness differ among characters depending on their gender.
The text files were taken from the KOLIMO corpus (Herrmann & Lauer, 2017), which in turn is based on the repositories TextGrid,2 Deutsches Textarchiv,3 and Gutenberg-de.4 The KOLIMOtoText tool (Adelmann, 2020d) served as the basis for the corpus creation, extracting all texts in the KOLIMO corpus published between 1870 and 1920. After this, a manual check was performed in order to exclude all non-prose texts as well as all texts that have not been originally published in German (i.e., translations). Since the received text collection contained duplicate texts due to the occurrence of different editions with different titles, duplicate texts were removed in a further step using the author-name-title comparison program ANTComp (Adelmann, 2020a) and the full-text comparison program BatchSED (Adelmann, 2020b). All texts were manually cleaned of paratexts (i.e., author names, dedications, prefaces, etc.) and supplemented by metadata: author’s name and pseudonyms, author’s gender, author’s date of birth and death, title of the work, repository source, file name, number of words and types, as well as publication date (extracted from the metadata of the original repositories, checked and if necessary corrected or extended by data from the literary encyclopedias Killy (Kühlmann et al., 2016) and Kindler (Arnold, 2009)).
In order to cover a broad variety of phenomena in literary prose texts, a certain degree of heterogeneity in form and content was aimed for in the sampling of the corpus. Beyond that, there were no content-related restrictions of the text selection. Thus, criteria for text selection were only date of first publication, text language, genre, and text length (for a more detailed discussion of sampling criteria, see Gius, Krüger, & Sökefeld, 2019). As a result, the corpus includes the works of 334 authors from at least three different literary movements (i.e., naturalism, realism, and modernism). It contains approximately equal proportions of long texts (novels) as well as shorter prose forms (cf. Table 1).
|number of texts||2511|
|number of texts written by female authors||346|
|number of texts written by male authors||2165|
|number of authors||334|
|number of female authors||72|
|number of male authors||262|
|text size average||31146 words; 4753 types|
|standard deviation||58117 words; 5323 types|
|shortest text||1006 words|
|longest text||990351 words|
|number of texts per decade||1870–1879||226|
The steps of automatic processing described above (Adelmann, 2020a, 2020b, 2020d) were evaluated by manual control of random samples of the output. This lead to an iterative improvement of the results. For the de-duplication process (Adelmann, 2020b) a manual evaluation of all duplicate pairs identified by the process (i.e., about 1,000 text pairs) was performed (Adelmann & Gius, 2020). After the automated cleaning, a manual cleaning was performed. All texts were manually cleaned of paratexts such as author name, author biography, dedication, preface, remarks, etc. in a collaborative approach. In this approach, the data curators worked in a review process so that each text was double-checked. The same procedure was used for enrichment with metadata. A first data curator added the respective meta information and a second data curator reviewed the entered information.
3 Dataset Description
Format names and versions
plain txt-files (UTF-8); spread sheet (xslx) with metadata. Version 2.0
2017-05-01 – 2021-06-22
Adelmann, Benedikt (Developer, University of Hamburg); Gius, Evelyn (Conceptualization, Project administrator, Supervisor, Technical University of Darmstadt); Guhr, Svenja (Data curator, Supervisor, Technical University of Darmstadt); Kurz, Laura (Data curator, Technical University of Darmstadt); Otte, Felicitas (Data curator, University of Hamburg); Schlesiger, Nicole (Data curator, Technical University of Darmstadt); Schreiber, Annekea (Data curator, Technical University of Darmstadt); Sökefeld, Carla (Data curator, University of Hamburg); Krüger, Katharina (Project member, University of Hamburg); Murawska, Anna Aline (Project member, University of Hamburg); Uglanova, Inna (Validation, Application, Technical University of Darmstadt).
Creative Commons Attribution Non Commercial Share Alike 4.0 International
2020-12-15 (2021-06-22 (V.2.0))
4 Reuse Potential
“d-Prose 1870–1920” is primarily of interest to literary scholars and linguists, as well as to scholars involved in the modelling of textual and cultural phenomena. The corpus can be used to test literary, literary-historical, cultural and linguistic hypotheses. It may be especially helpful for the study of the literary developments and artistic movements of the represented period of time. The corpus has sufficient volume to allow for the application of machine learning techniques like clustering, classification, or topic modelling (see Uglanova & Gius, 2020, for an example based on topic modelling). It can be used as analysis material for didactic purposes too, such as developing core knowledge and skills for working with a literary corpus (studying patterns of usage, historical dynamics, typical and unique contexts of language or literary phenomena, author’s neologisms, collocates across metadata, etc.). The dataset can be extended and included as a sub-corpus for more global tasks. Due to the fact that the corpus is heterogeneous in its structure, it can be divided into smaller subcorpora tailored to specific research objectives. In particular, subcorpora can be created on the basis of publication year, gender of the author, size (short stories, novellas, novels), or, with some additional work on/with the metadata by genre (historical novel, adventure novel, social novel, etc.), literary movements (realism, modernism, naturalism), or other criteria. For more special tasks, the corpus can be pre-processed with a linguistic analysis pipeline, specifically developed for this dataset by Adelmann (2020c). The pipeline is optimized for the corpus and provides tokenization, part-of-speech and morphological tagging, lemmatization, and dependency parsing. It is accompanied by detailed instructions that can easily be used by users with minimal technical knowledge. The data format makes it easy to use linguistic, web-based technology without further extraction and conversion. Additionally it can be used with popular GUI-based open-source software for data mining and analysis such as Voyant Tools (Sinclair & Rockwell, 2021), the manual annotation tool CATMA (Gius et al., 2021) the concordance program AntConc (Anthony, 2020), and others.