(1) Overview

Repository location

Zenodo: https://doi.org/10.5281/zenodo.10404966

Context

Our goal was to generate a balanced dataset of French Novels first published between 1751 and 1800 that can be used to generate Resource Description Framework (RDF) statements for a knowledge graph on French Enlightenments novels, but can also serve as a resource for other projects in the Digital Humanities or in the domain of Eighteenth-Century French literature. The dataset was produced by the project Mining and Modeling Text; it has been used in several project publications, including ; ; ; .

(2) Method

Steps

The starting point was a subset of novels carefully digitized by double keying. Using this first group of novels, an OCR-model has been trained in cooperation with Christian Reul (Centre for Philology and Digitality, University of Würzburg), one of the developers of OCR4all (, Figure 1).

Figure 1 

Training the OCR model for late Eighteenth-Century prints of French novels with OCR4all.

Applying this OCR-model for French prints of the late 18th-Century to additional scans provided for instance by Gallica (bnf.fr) and other sources (see metadata for details), a second group of novels which were not yet available in full text (or only in low quality) was produced. A third group of texts, based on existing full texts, helped us reach 200 volumes.

Sampling strategy

As shown in Figure 2, we used bibliographic data on the overall literary production in France 1751–1800 () to balance the corpus of full texts regarding the parameters gender, year of first publication and narrative form in approaching the historical distribution of these parameters in our corpus composition.

Figure 2 

Narrative forms of French novels 1751–1800 () and corpus metadata.

We compared the overall novel publication with the corpus data and added novels per year according to the known historical publication proportions. Regarding gender (Figure 3), we used information from Wikidata as well as a python script designed to identify gender-specific titles such as “Abbé” or “Marquis”.

Figure 3 

Gender balance in bibliographic metadata () and in corpus metadata.

Data regarding narrative form was derived from bibliographic metadata (), complemented by human evaluations carried out on the full texts.

Quality control

Optical character recognition

The output of the OCR4all pipeline has undergone several quality controls including by a French native speaker correcting the output of OCR4all, documented by versioning control (GitHub).

Metadata

Additionally, we made sure that the data set meets the FAIR data criteria of findability, accessibility, interoperability and reusability (). Every item is provided with a stable Uniform Resource Identifier (URI) (MiMoTextID) and additional authoritative data. In the process of reconciling data against entities in Wikidata, the output of OpenRefine () was manually corrected if necessary.

(3) Dataset Description

Repository name

Zenodo, GitHub.

Object name

Collection de romans français du dix-huitième siècle (1751–1800)/Collection of Eighteenth-Century French Novels 1751–1800 (V1.2).

Format names and versions

V1.2: 200 files in TEI/XML according to the ‘level 1’-schema of the European Literary Text Collection; TXT files in two versions: (automatically) normalized and historical spelling; controlled vocabularies used to describe metadata are documented on GitHub.

Creation dates

2019-12-01 to 2023-12-06.

Dataset creators

Julia Röttgermann (editor), Johanna Konstanciak (researcher), Christof Schöch (researcher), Julia Dudar (researcher), Henning Gebhard (researcher), Anne Klee (researcher), Sarah Ondraszek (researcher), Amélie Probst (researcher), Damir Padieu (researcher). Affiliation of all (at the time of data development): University of Trier, Trier, Germany.

Language

French; English for metadata.

License

Public Domain.

Publication date

2023-12-06 (V1.2).

(4) Reuse Potential

Our data set can be used for varying language-processing tasks that use 18th-Century French language. As we provide detailed metadata, one could generate subsets of the data set, for example regarding gender, decade of publication or narrative form. One could study for example distinctive words for different decades or investigate linguistic differences between male and female authors in a diachronic perspective.

Moreover, the dataset provides a comprehensive and structured resource for analysing literary and cultural trends during the Enlightenment era, enabling researchers to gain insights into the intellectual and societal transformations of that time. As the data set is balanced according to different parameters, it can be regarded as representative of the time period.

Computational Literary Studies Methods that have already been used on the data set in the context of the project are topic modeling (), named entity recognition (), sentiment analysis or stylometry.

Furthermore, the Linked Open data paradigm used to connect these full text resources with the knowledge graph ‘MiMoTextbase’ allows to run sophisticated SPARQL queries on these texts combining them in the graph with metadata on about 2000 French novels 1751–1800.