The Corpus for Idiolectal Research (CIDRE)

The Corpus for Idiolectal Research (CIDRE) is a collection of fiction works from 11 prolific 19th-century French authors (4 women, 7 men; 22–62 works/author; total of 37 million words). Every work is dated with the year it was written. Using programming scripts, the works have been gathered from open source platforms, for example La Bibliothèque électronique du Québec, and stripped of paratext (text not being part of the novel, e.g. prefaces). We distribute the text files, the dating, other metadata and the programming scripts under an open source license. CIDRE is the first resource of French for the study of style and idiolect in a diachronic manner (i.e. stylochronometry) on a larger scale.

(1) OVERVIEW An idiolect is the language of an individual and, like language in general, it is subject to change over time (Dittmar, 1996;Kuhl, 2003). However, the notion of idiolect remains an understudied topic, especially in quantitative linguistics, due to the lack of relevant large corpora (Barlow, 2010(Barlow, , 2013Mollin, 2009). We thus developed the CIDRE corpus, the first corpus for the diachronic and quantitative study of idiolect in French (logo in Figure 1). Together with the EMMA corpus on 17 th -century English (Petré et al., 2019), it is one of the rare quantitative resources suited to stylochronometry (see Klaussner andVogel, 2018 andStamou, 2008 for examples of stylochronometric studies).
With the purpose of collecting as much data per person as possible within one genre (to enable comparison), we decided to use the fiction works of prolific 19 th -century writers. The advantages of this type of data are the following: fiction works tend to be long, providing us with large quantities of data; they are in the public domain, there are high quality e-books available; and the orthography of that period is very similar to today's, making the use of off-the-shelf NLP systems possible.
Using various websites distributing free epub files, 1 we included in CIDRE, as exhaustively as possible, the fiction works of Gustave Aimard, Honoré de Balzac, Paul Féval, Henry Gréville, Daniel Lesueur, Pierre-Alexis Ponson du Terrail, George Sand, La Comtesse de Ségur, Jules Verne, Michel Zévaco and Émile Zola (see Table 1 and Figures 2 and 3 for more details). We dated the works with the year they were written in, if this information was available, and with the first year of publication otherwise. In this way, each work can be seen as a datapoint characteristic of the way the author was writing at the time.

https://zenodo.org/record/4707812#.YK-Tai8ivs0
CONTEXT This resource was produced as part of a research project investigating large corpora of French literature with advanced natural language processing methods.
(2) METHOD CIDRE was produced using programming scripts in Python and manual gathering of metadata.

1.
For each author, we produce a list of fiction works to be included in the corpus (in the metadata file).

2.
Hyperlinks to downloadable sources of the aforementioned works in epub format are collected.  3. Each fiction work is manually dated using various sources. The result is stored in the metadata file, together with the source used for the dating.

4.
We feed the metadata into a first Python script that downloads all epub files.

5.
We applied a second Python script to the downloaded files to obtain .txt formats that are stripped of paratext (e.g. prefaces, or license declarations).

SAMPLING STRATEGY
To select relevant authors, we searched for authors from the 19 th century, whose fiction works are available in the public domain in epub format of good quality. Once works have been preprocessed, they should be at least 100 Kb (~16,500 words) in .txt format. 2 We removed works that have co-authors (for example some posthumous novels by Jules Verne), or authors that were known to work with ghostwriters, e.g. Alexandre Dumas (Chodorowicz, 2019), or works that we were not able to date.

QUALITY CONTROL
The first Python script, step1-getEBooks.py, is responsible for the correct naming of all e-books. The second, step2-convertToTei.py, removes prefaces, image descriptions and license declarations by first converting the epub file into a TEI file, using the software, then parsing through the TEI structure and selecting only the text that has been written by the author in the year of writing of the novel. Finally, a manual cleaning phase removed dedications and prefaces that remained undetected.

FORMAT NAMES AND VERSIONS
Fiction works are distributed in repositories named after the authors' last name in .txt format. The filenames of the works always start by the year of writing, followed by _ and the title of the novel, with words separated by underscores. For example, 1886_Un_mysterieux_amour.epub. txt in the repository 'lesueur'.
The metadata of the corpus is stored in a CSV file. The scripts to gather corpora from online libraries (e.g. Wikisource, Project Gutenberg, etc.) must be executed using Python 3. Beforehand, one needs to install the Python packages selenium, Geckodriver (from https://github.com/mozilla/ geckodriver/releases), and pandoc (from https://pandoc.org/installing.html). 2 We chose the plain text format to facilitate import into different NLP tools. However, our programming scripts allow users to produce a TEI-format file from the epub downloads.

(4) REUSE POTENTIAL
Our resource can be used not only for idiolect studies, but can also serve as data in other contexts, like authorship attribution, stylometric studies, and for literature studies on the genres of realism, naturalism, adventure novels and detectives of the 19 th and early 20 th century. The fact that four women are included among the eleven authors of our corpus may also open some perspectives in gender studies; see Rybicki (2016) for an example on English. Moreover, the scripts can be re-used by anyone who wants to compose their own corpus of e-books.