(1) Overview

An idiolect is the language of an individual and, like language in general, it is subject to change over time (; ). However, the notion of idiolect remains an understudied topic, especially in quantitative linguistics, due to the lack of relevant large corpora (, ; ). We thus developed the CIDRE corpus, the first corpus for the diachronic and quantitative study of idiolect in French (logo in Figure 1). Together with the EMMA corpus on 17th-century English (), it is one of the rare quantitative resources suited to stylochronometry (see and ).

Figure 1 

The logo of CIDRE.

With the purpose of collecting as much data per person as possible within one genre (to enable comparison), we decided to use the fiction works of prolific 19th-century writers. The advantages of this type of data are the following: fiction works tend to be long, providing us with large quantities of data; they are in the public domain, there are high quality e-books available; and the orthography of that period is very similar to today’s, making the use of off-the-shelf NLP systems possible.

Using various websites distributing free epub files, we included in CIDRE, as exhaustively as possible, the fiction works of Gustave Aimard, Honoré de Balzac, Paul Féval, Henry Gréville, Daniel Lesueur, Pierre-Alexis Ponson du Terrail, George Sand, La Comtesse de Ségur, Jules Verne, Michel Zévaco and Émile Zola (see Table 1 and Figures 2 and 3 for more details). We dated the works with the year they were written in, if this information was available, and with the first year of publication otherwise. In this way, each work can be seen as a datapoint characteristic of the way the author was writing at the time.

Table 1

Summary of the content of CIDRE.


AUTHORNUMBER OF WORKSEARLIESTLATEST

Gustave Aimard2418581881

Honoré de Balzac5918291848

Paul Féval2318431881

Henry Gréville3618761892

Daniel Lesueur3118821911

Pierre-Alexis Ponson du Terrail4218521870

George Sand6218311875

La Comtesse de Ségur2218561871

Jules Verne5818621905

Michel Zévaco2919061926

Émile Zola3518641903

Figure 2 

The distribution of the data in CIDRE.

Figure 3 

Summary of the content of CIDRE.

Context

This resource was produced as part of a research project investigating large corpora of French literature with advanced natural language processing methods.

(2) Method

CIDRE was produced using programming scripts in Python and manual gathering of metadata.

Steps

  1. For each author, we produce a list of fiction works to be included in the corpus (in the metadata file).
  2. Hyperlinks to downloadable sources of the aforementioned works in epub format are collected.
  3. Each fiction work is manually dated using various sources. The result is stored in the metadata file, together with the source used for the dating.
  4. We feed the metadata into a first Python script that downloads all epub files.
  5. We applied a second Python script to the downloaded files to obtain .txt formats that are stripped of paratext (e.g. prefaces, or license declarations).

Sampling strategy

To select relevant authors, we searched for authors from the 19th century, whose fiction works are available in the public domain in epub format of good quality. Once works have been preprocessed, they should be at least 100 Kb (~16,500 words) in .txt format. We removed works that have co-authors (for example some posthumous novels by Jules Verne), or authors that were known to work with ghostwriters, e.g. Alexandre Dumas (), or works that we were not able to date.

Quality control

The first Python script, step1-getEBooks.py, is responsible for the correct naming of all e-books. The second, step2-convertToTei.py, removes prefaces, image descriptions and license declarations by first converting the epub file into a TEI file, using the software, then parsing through the TEI structure and selecting only the text that has been written by the author in the year of writing of the novel. Finally, a manual cleaning phase removed dedications and prefaces that remained undetected.

(3) Dataset Description

Object name

CIDRE.zip

Format names and versions

Fiction works are distributed in repositories named after the authors’ last name in .txt format. The filenames of the works always start by the year of writing, followed by _ and the title of the novel, with words separated by underscores. For example, 1886_Un_mysterieux_amour.epub.txt in the repository ‘lesueur’.

The metadata of the corpus is stored in a CSV file. The scripts to gather corpora from online libraries (e.g. Wikisource, Project Gutenberg, etc.) must be executed using Python 3. Beforehand, one needs to install the Python packages selenium, Geckodriver (from https://github.com/mozilla/geckodriver/releases), and pandoc (from https://pandoc.org/installing.html).

Creation dates

This corpus set has been created between 2020-10-01 and 2021-04-07.

Dataset creators

Olga Seminck and Philippe Gambette

Language

French

License

Corpus: public domain; metadata: Licence Creative Commons – Attribution – Partage dans les Mêmes Conditions 4.0 International; processing scripts: GPLv3 License.

Repository name

Zenodo and Ortolang

Publication date

2021-03-30

(4) Reuse Potential

Our resource can be used not only for idiolect studies, but can also serve as data in other contexts, like authorship attribution, stylometric studies, and for literature studies on the genres of realism, naturalism, adventure novels and detectives of the 19th and early 20th century. The fact that four women are included among the eleven authors of our corpus may also open some perspectives in gender studies; see Rybicki () for an example on English. Moreover, the scripts can be re-used by anyone who wants to compose their own corpus of e-books.