The CONLIT Dataset of Contemporary Literature

Andrew Piper

(1) Overview

Repository location

https://doi.org/10.6084/m9.figshare.21166171.v1

Context

Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years (; ), we define genre for our purposes as a form of institutionally framed classification (). According to this definition, genre is what a given institution labels a book using a distinct category of writing.

As we show with the overview of our data (Table 1), our institutional frameworks can include bestseller lists, prize committee shortlists, book review lists, user-generated “choice awards”, or corporate forms of categorization. Taken together, they allow research on three different types of institutional framing: cultural capital, stylistic affinity, and reading level. Rather than rely on a single “best” framework, we choose to include multiple forms of selection to allow researchers to explore the effects of different institutional frameworks on stylistic behavior.

Table 1

List of genres, their selection criteria, and the total number of documents per category.


CODE	GENRE	INSTRUMENTALITY	PLATFORM	SELECTION CRITERIA	# DOCS

BIO	Biography	Non-fiction	Goodreads	“Best memoir/biography/autobiography” list	193

BS	Bestseller	Fiction	New York Times	Fiction published since 2001 with the longest aggregate time on the New York Times bestseller list	249

HIST	History	Non-fiction	Amazon	Books listed under “history” under the “bestsellers” tag	205

MEM	Memoir	Non-fiction	Amazon	Books listed under “memoir” under the “bestsellers” tag	229

MID	Middle school	Fiction	Goodreads	Goodreads Choice awards for “Middle Grade” books	166

MIX	Assorted non-fiction	Non-fiction	Amazon	Books listed under assorted non-fiction tags such as “health”, “politics”, and “business”, under the “bestsellers” tag	193

MY	Mystery	Fiction	Amazon	Books listed under “Mystery, Thriller, Suspense” under the “bestsellers” tag	234

NYT	New York Times reviewed	Fiction	New York Times	Fiction reviewed in the New York Times Book Review	419

PW	Prizelists	Fiction	5 Prizelists (US, UK, Canada)	Works shortlisted for the National Book Award (US), PEN/Faulkner Award (US), Governor General’s Award (Canada), Giller Prize (Canada), and the Man Booker Prize (UK)	258

ROM	Romance	Fiction	Amazon	Books listed under “Romance” under the “bestsellers” tag	208

SF	Science-Fiction	Fiction	Amazon	Books listed under “Science Fiction & Fantasy” under the “bestsellers” tag	223

YA	Young Adult	Fiction	Goodreads	Goodreads Choice Awards for Young Adult Fiction	177

In addition to our manually curated selection of books, we also provide researchers with a set of derived features that can be used for further research on the style and content of books (described in Table 2).

Table 2

List of 20 features included in our data.


FEATURE	DESCRIPTION	ANNOTATION TYPE

Category	Fiction or non-fiction	Manual

Genre	Twelve categories	Manual

Publication Date	Date of first publication	Manual

Author Gender	Perceived authorial gender	Manual

POS	Part-of-speech uni- and bigrams	Computational

Supersense	Frequency of 41-word supersenses	Computational

Word Frequencies	Word frequencies for every book/1,000-word passage	Computational

Token Count	Work length measure	Computational

Total Characters	Estimated total number of named characters	Computational

Protagonist Concentration	Percentage of all character mentions by main character	Computational

Avg. Sentence Length	Average length of all sentences per book	Computational

Avg. Word Length	Average length of all words per book	Computational

Tuldava Score	Reading difficulty measure	Computational

Event Count	Estimated number of diegetic events	Computational

Goodreads Avg. Rating	Average user rating on Goodreads	Computational

Goodreads Total Ratings	Total number of ratings on Goodreads as of June 2022	Computational

Average Speed	Measure of narrative pace	Computational

Minimum Speed	Measure of narrative distance	Computational

Volume	Measure of topical heterogeneity	Computational

Circuitousness	Measure of narrative non-linearity	Computational

(2) Method

Steps

The steps for our dataset construction were the following. Books were manually selected according to the sampling strategies described in Table 1; digitized and manually cleaned; processed using the “large model” of bookNLP (); and manually and computationally annotated for features indicated in Table 2.

Sampling strategy

All books were chosen to represent “popular” writing across 12 different genres of contemporary publishing spanning a 20-year timeframe dating from 2001 through 2021. We define “popular” through multiple criteria that include user-generated awards or lists, elite prize committee lists or book reviews, or bestseller tags on platforms like Amazon or the New York Times. As a further way to validate popularity, we provide two measures drawn from the platform Goodreads.

We define genre through three different kinds of institutional framing: cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult (YA)). This allows researchers a high degree of flexibility to better understand stylistic behavior of professionally published books targeting different kinds of readerships. We also segment our genres by the “instrumentality” of the information contained (“fiction” or “non-fiction”).

While our genre categories are not mutually exclusive (mysteries may appear in Bestsellers and vice versa), no books appear in two separate categories. It is important to note that our larger genre categories (cultural capital, style, age) are not necessarily commensurate with one another and thus researchers should use caution when comparing across these categories. Experimentation with alternative genre labeling systems can be a further affordance of this dataset. Finally, we aimed to select ca. 200 works per category, which we have found is sufficient for training robust text classification algorithms. Due to text availability, list sizes, and cleaning, some categories have more or less than this number. In the case of those books reviewed in the New York Times, we iterated twice on this process. In total, we assemble 2,754 books representing 2,234 unique authors across 12 genres.

To further understand our data, we provide figures of the distribution of publication dates (Figure 1), the average user rating on Goodreads (Figure 2), and the log-transformed number of ratings on Goodreads (Figure 3) to capture book popularity. Finally, while no attention was given to the selection of books based on author gender, our gender distribution across all books is 49.76% women and 49.94% men with only eight books written by self-identified non-binary authors. We note, however, that there are meaningful within-genre differences (Figure 4) as predicted by prior research ().

Figure 1

Distribution of publication dates of books in our sample.

Figure 2

Distribution of the average user rating on Goodreads for books in our sample. Only includes books with > 9 ratings.

Figure 3

Distribution of the log-transformed number of ratings on Goodreads for books in our sample. Only includes books with > 9 ratings.

Figure 4

Distribution of author gender by genre.

Quality Control

All texts were manually cleaned of front and end matter. Metadata such as publication date, authorial gender, author name and title were all manually entered. The dataset was manually reviewed for the appropriateness of genre labels for every book. Finally, duplicates were removed and any books that were not at least 15,000 tokens in length were also removed. No maximum length was set.

Limitations

Our data is limited by intellectual property restrictions that do not allow access to full text data. To overcome this limitation, we provide a robust set of derived data that has served in prior research as a reliable foundation for the stylistic understanding of creative writing. Our data is also limited by focusing on a single language. Future work will want to emphasize multilingual data construction to facilitate our understanding of cross-cultural stylistic behavior. Finally, for both manually and computationally derived features, we expect there to be some level of error. For the manual features, we have undertaken two-levels of review. For the computational features, the bookNLP documentation provides estimates on the expected error rates of different predictive models. Nevertheless, it is important for researchers to be aware that our derived features are always estimates. We would flag “Character Count” and “Event Counts” as two features that are worth further research due to the challenging nature of their prediction.

(3) Dataset description

Object name

CONLIT

Format names and versions

.CSV

Creation dates

Start date: 2015-03-10; End date: 2022-06-22.

Dataset Creators

Andrew Piper (McGill University) was responsible for the overall design of the dataset. Eve Kraicer (McGill University) and Joey Love (McGill University) assisted with cleaning and processing the data.

Language

English

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Repository name

Figshare

Publication date

2022-09-22

(4) Reuse potential

Prior work on the computationally driven study of genre has focused on using different selection mechanisms to better understand the role that genre plays in organizing literary communities and reader responses, ranging from studies of historical text data (; ; ) to contemporary reader response data (; ; ). Summarizing this work, one could say that research on the content or stylistic aspects of genre has largely focused on historical data while research into contemporary genre formations has largely focused on metadata or non-professionally published writing.

Our dataset is thus designed to give researchers access to stylistic data of contemporary, professionally published writing that spans a range of genre definitions and institutional frameworks. Doing so can help further research into understanding the role genre plays in constraining authorial behavior. It can also facilitate further understanding that the role of differentiation plays in genre classification (Sharma et al., 2022). As genre-theorist Ralph Cohen argued some time ago, “A genre, therefore, is to be understood in relation to other genres, so that its aims and purposes at a particular time are defined by its interrelation with and differentiation from others” (). Our data will facilitate the empirical exploration of such theories.

By providing Goodreads user response data, our dataset also allows further research into the relationship between style and success (). The links provided to the Goodreads versions of our books also allow our data to be combined with reader-based response data. An exciting new avenue of literary study aims to better understand the causes and conditions of readers’ responses to texts (; ; ) and our data provides the infrastructure to undertake such a research program across a large, diverse set of professionally published contemporary writing.

Journal of Open Humanities Data

Data Papers