The Corpus for Idiolectal Research (CIDRE)

Olga Seminck; Philippe Gambette; Dominique Legallois; Thierry Poibeau

(1) Overview

An idiolect is the language of an individual and, like language in general, it is subject to change over time (; ). However, the notion of idiolect remains an understudied topic, especially in quantitative linguistics, due to the lack of relevant large corpora (, ; ). We thus developed the CIDRE corpus, the first corpus for the diachronic and quantitative study of idiolect in French (logo in Figure 1). Together with the EMMA corpus on 17^th-century English (), it is one of the rare quantitative resources suited to stylochronometry (see and ).

Figure 1

The logo of CIDRE.

With the purpose of collecting as much data per person as possible within one genre (to enable comparison), we decided to use the fiction works of prolific 19^th-century writers. The advantages of this type of data are the following: fiction works tend to be long, providing us with large quantities of data; they are in the public domain, there are high quality e-books available; and the orthography of that period is very similar to today’s, making the use of off-the-shelf NLP systems possible.

Using various websites distributing free epub files, we included in CIDRE, as exhaustively as possible, the fiction works of Gustave Aimard, Honoré de Balzac, Paul Féval, Henry Gréville, Daniel Lesueur, Pierre-Alexis Ponson du Terrail, George Sand, La Comtesse de Ségur, Jules Verne, Michel Zévaco and Émile Zola (see Table 1 and Figures 2 and 3 for more details). We dated the works with the year they were written in, if this information was available, and with the first year of publication otherwise. In this way, each work can be seen as a datapoint characteristic of the way the author was writing at the time.

Table 1

Summary of the content of CIDRE.


AUTHOR	NUMBER OF WORKS	EARLIEST	LATEST

Gustave Aimard	24	1858	1881

Honoré de Balzac	59	1829	1848

Paul Féval	23	1843	1881

Henry Gréville	36	1876	1892

Daniel Lesueur	31	1882	1911

Pierre-Alexis Ponson du Terrail	42	1852	1870

George Sand	62	1831	1875

La Comtesse de Ségur	22	1856	1871

Jules Verne	58	1862	1905

Michel Zévaco	29	1906	1926

Émile Zola	35	1864	1903

Figure 2

The distribution of the data in CIDRE.

Figure 3

Summary of the content of CIDRE.

Repository location

https://zenodo.org/record/4707812#.YK-Tai8ivs0

Context

This resource was produced as part of a research project investigating large corpora of French literature with advanced natural language processing methods.

(2) Method

CIDRE was produced using programming scripts in Python and manual gathering of metadata.

Steps

For each author, we produce a list of fiction works to be included in the corpus (in the metadata file).
Hyperlinks to downloadable sources of the aforementioned works in epub format are collected.
Each fiction work is manually dated using various sources. The result is stored in the metadata file, together with the source used for the dating.
We feed the metadata into a first Python script that downloads all epub files.
We applied a second Python script to the downloaded files to obtain .txt formats that are stripped of paratext (e.g. prefaces, or license declarations).

Sampling strategy

To select relevant authors, we searched for authors from the 19^th century, whose fiction works are available in the public domain in epub format of good quality. Once works have been preprocessed, they should be at least 100 Kb (~16,500 words) in .txt format. We removed works that have co-authors (for example some posthumous novels by Jules Verne), or authors that were known to work with ghostwriters, e.g. Alexandre Dumas (), or works that we were not able to date.

Quality control

The first Python script, step1-getEBooks.py, is responsible for the correct naming of all e-books. The second, step2-convertToTei.py, removes prefaces, image descriptions and license declarations by first converting the epub file into a TEI file, using the software, then parsing through the TEI structure and selecting only the text that has been written by the author in the year of writing of the novel. Finally, a manual cleaning phase removed dedications and prefaces that remained undetected.

(3) Dataset Description

Object name

CIDRE.zip

Format names and versions

Fiction works are distributed in repositories named after the authors’ last name in .txt format. The filenames of the works always start by the year of writing, followed by _ and the title of the novel, with words separated by underscores. For example, 1886_Un_mysterieux_amour.epub.txt in the repository ‘lesueur’.

The metadata of the corpus is stored in a CSV file. The scripts to gather corpora from online libraries (e.g. Wikisource, Project Gutenberg, etc.) must be executed using Python 3. Beforehand, one needs to install the Python packages selenium, Geckodriver (from https://github.com/mozilla/geckodriver/releases), and pandoc (from https://pandoc.org/installing.html).

Creation dates

This corpus set has been created between 2020-10-01 and 2021-04-07.

Dataset creators

Olga Seminck and Philippe Gambette

Language

French

License

Corpus: public domain; metadata: Licence Creative Commons – Attribution – Partage dans les Mêmes Conditions 4.0 International; processing scripts: GPLv3 License.

Repository name

Zenodo and Ortolang

Publication date

2021-03-30

(4) Reuse Potential

Our resource can be used not only for idiolect studies, but can also serve as data in other contexts, like authorship attribution, stylometric studies, and for literature studies on the genres of realism, naturalism, adventure novels and detectives of the 19^th and early 20^th century. The fact that four women are included among the eleven authors of our corpus may also open some perspectives in gender studies; see Rybicki () for an example on English. Moreover, the scripts can be re-used by anyone who wants to compose their own corpus of e-books.

Journal of Open Humanities Data

Data Papers