An idiolect is the language of an individual and, like language in general, it is subject to change over time (Dittmar, 1996; Kuhl, 2003). However, the notion of idiolect remains an understudied topic, especially in quantitative linguistics, due to the lack of relevant large corpora (Barlow, 2010, 2013; Mollin, 2009). We thus developed the CIDRE corpus, the first corpus for the diachronic and quantitative study of idiolect in French (logo in Figure 1). Together with the EMMA corpus on 17th-century English (Petré et al., 2019), it is one of the rare quantitative resources suited to stylochronometry (see Klaussner and Vogel, 2018 and Stamou, 2008 for examples of stylochronometric studies).
With the purpose of collecting as much data per person as possible within one genre (to enable comparison), we decided to use the fiction works of prolific 19th-century writers. The advantages of this type of data are the following: fiction works tend to be long, providing us with large quantities of data; they are in the public domain, there are high quality e-books available; and the orthography of that period is very similar to today’s, making the use of off-the-shelf NLP systems possible.
Using various websites distributing free epub files,1 we included in CIDRE, as exhaustively as possible, the fiction works of Gustave Aimard, Honoré de Balzac, Paul Féval, Henry Gréville, Daniel Lesueur, Pierre-Alexis Ponson du Terrail, George Sand, La Comtesse de Ségur, Jules Verne, Michel Zévaco and Émile Zola (see Table 1 and Figures 2 and 3 for more details). We dated the works with the year they were written in, if this information was available, and with the first year of publication otherwise. In this way, each work can be seen as a datapoint characteristic of the way the author was writing at the time.
|AUTHOR||NUMBER OF WORKS||EARLIEST||LATEST|
|Honoré de Balzac||59||1829||1848|
|Pierre-Alexis Ponson du Terrail||42||1852||1870|
|La Comtesse de Ségur||22||1856||1871|
This resource was produced as part of a research project investigating large corpora of French literature with advanced natural language processing methods.
CIDRE was produced using programming scripts in Python and manual gathering of metadata.
- For each author, we produce a list of fiction works to be included in the corpus (in the metadata file).
- Hyperlinks to downloadable sources of the aforementioned works in epub format are collected.
- Each fiction work is manually dated using various sources. The result is stored in the metadata file, together with the source used for the dating.
- We feed the metadata into a first Python script that downloads all epub files.
- We applied a second Python script to the downloaded files to obtain .txt formats that are stripped of paratext (e.g. prefaces, or license declarations).
To select relevant authors, we searched for authors from the 19th century, whose fiction works are available in the public domain in epub format of good quality. Once works have been preprocessed, they should be at least 100 Kb (~16,500 words) in .txt format.2 We removed works that have co-authors (for example some posthumous novels by Jules Verne), or authors that were known to work with ghostwriters, e.g. Alexandre Dumas (Chodorowicz, 2019), or works that we were not able to date.
The first Python script, step1-getEBooks.py, is responsible for the correct naming of all e-books. The second, step2-convertToTei.py, removes prefaces, image descriptions and license declarations by first converting the epub file into a TEI file, using the software, then parsing through the TEI structure and selecting only the text that has been written by the author in the year of writing of the novel. Finally, a manual cleaning phase removed dedications and prefaces that remained undetected.
(3) Dataset Description
Format names and versions
Fiction works are distributed in repositories named after the authors’ last name in .txt format. The filenames of the works always start by the year of writing, followed by _ and the title of the novel, with words separated by underscores. For example, 1886_Un_mysterieux_amour.epub.txt in the repository ‘lesueur’.
The metadata of the corpus is stored in a CSV file. The scripts to gather corpora from online libraries (e.g. Wikisource, Project Gutenberg, etc.) must be executed using Python 3. Beforehand, one needs to install the Python packages selenium, Geckodriver (from https://github.com/mozilla/geckodriver/releases), and pandoc (from https://pandoc.org/installing.html).
This corpus set has been created between 2020-10-01 and 2021-04-07.
Olga Seminck and Philippe Gambette
Corpus: public domain; metadata: Licence Creative Commons – Attribution – Partage dans les Mêmes Conditions 4.0 International; processing scripts: GPLv3 License.
Zenodo and Ortolang
(4) Reuse Potential
Our resource can be used not only for idiolect studies, but can also serve as data in other contexts, like authorship attribution, stylometric studies, and for literature studies on the genres of realism, naturalism, adventure novels and detectives of the 19th and early 20th century. The fact that four women are included among the eleven authors of our corpus may also open some perspectives in gender studies; see Rybicki (2016) for an example on English. Moreover, the scripts can be re-used by anyone who wants to compose their own corpus of e-books.