(1) Overview

Repository location

https://doi.org/10.6084/m9.figshare.21166171.v1

Context

Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years (; ), we define genre for our purposes as a form of institutionally framed classification (). According to this definition, genre is what a given institution labels a book using a distinct category of writing.

As we show with the overview of our data (Table 1), our institutional frameworks can include bestseller lists, prize committee shortlists, book review lists, user-generated “choice awards”, or corporate forms of categorization. Taken together, they allow research on three different types of institutional framing: cultural capital, stylistic affinity, and reading level. Rather than rely on a single “best” framework, we choose to include multiple forms of selection to allow researchers to explore the effects of different institutional frameworks on stylistic behavior.

Table 1

List of genres, their selection criteria, and the total number of documents per category.


CODEGENREINSTRUMENTALITYPLATFORMSELECTION CRITERIA# DOCS

BIOBiographyNon-fictionGoodreads“Best memoir/biography/autobiography” list193

BSBestsellerFictionNew York TimesFiction published since 2001 with the longest aggregate time on the New York Times bestseller list249

HISTHistoryNon-fictionAmazonBooks listed under “history” under the “bestsellers” tag205

MEMMemoirNon-fictionAmazonBooks listed under “memoir” under the “bestsellers” tag229

MIDMiddle schoolFictionGoodreadsGoodreads Choice awards for “Middle Grade” books166

MIXAssorted non-fictionNon-fictionAmazonBooks listed under assorted non-fiction tags such as “health”, “politics”, and “business”, under the “bestsellers” tag193

MYMysteryFictionAmazonBooks listed under “Mystery, Thriller, Suspense” under the “bestsellers” tag234

NYTNew York Times reviewedFictionNew York TimesFiction reviewed in the New York Times Book Review419

PWPrizelistsFiction5 Prizelists (US, UK, Canada)Works shortlisted for the National Book Award (US), PEN/Faulkner Award (US), Governor General’s Award (Canada), Giller Prize (Canada), and the Man Booker Prize (UK)258

ROMRomanceFictionAmazonBooks listed under “Romance” under the “bestsellers” tag208

SFScience-FictionFictionAmazonBooks listed under “Science Fiction & Fantasy” under the “bestsellers” tag223

YAYoung AdultFictionGoodreadsGoodreads Choice Awards for Young Adult Fiction177

In addition to our manually curated selection of books, we also provide researchers with a set of derived features that can be used for further research on the style and content of books (described in Table 2).

Table 2

List of 20 features included in our data.


FEATUREDESCRIPTIONANNOTATION TYPE

CategoryFiction or non-fictionManual

GenreTwelve categoriesManual

Publication DateDate of first publicationManual

Author GenderPerceived authorial genderManual

POSPart-of-speech uni- and bigramsComputational

SupersenseFrequency of 41-word supersensesComputational

Word FrequenciesWord frequencies for every book/1,000-word passageComputational

Token CountWork length measureComputational

Total CharactersEstimated total number of named charactersComputational

Protagonist ConcentrationPercentage of all character mentions by main characterComputational

Avg. Sentence LengthAverage length of all sentences per bookComputational

Avg. Word LengthAverage length of all words per bookComputational

Tuldava ScoreReading difficulty measureComputational

Event CountEstimated number of diegetic eventsComputational

Goodreads Avg. RatingAverage user rating on GoodreadsComputational

Goodreads Total RatingsTotal number of ratings on Goodreads as of June 2022Computational

Average SpeedMeasure of narrative paceComputational

Minimum SpeedMeasure of narrative distanceComputational

VolumeMeasure of topical heterogeneityComputational

CircuitousnessMeasure of narrative non-linearityComputational

(2) Method

Steps

The steps for our dataset construction were the following. Books were manually selected according to the sampling strategies described in Table 1; digitized and manually cleaned; processed using the “large model” of bookNLP (); and manually and computationally annotated for features indicated in Table 2.

Sampling strategy

All books were chosen to represent “popular” writing across 12 different genres of contemporary publishing spanning a 20-year timeframe dating from 2001 through 2021. We define “popular” through multiple criteria that include user-generated awards or lists, elite prize committee lists or book reviews, or bestseller tags on platforms like Amazon or the New York Times. As a further way to validate popularity, we provide two measures drawn from the platform Goodreads.

We define genre through three different kinds of institutional framing: cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult (YA)). This allows researchers a high degree of flexibility to better understand stylistic behavior of professionally published books targeting different kinds of readerships. We also segment our genres by the “instrumentality” of the information contained (“fiction” or “non-fiction”).

While our genre categories are not mutually exclusive (mysteries may appear in Bestsellers and vice versa), no books appear in two separate categories. It is important to note that our larger genre categories (cultural capital, style, age) are not necessarily commensurate with one another and thus researchers should use caution when comparing across these categories. Experimentation with alternative genre labeling systems can be a further affordance of this dataset. Finally, we aimed to select ca. 200 works per category, which we have found is sufficient for training robust text classification algorithms. Due to text availability, list sizes, and cleaning, some categories have more or less than this number. In the case of those books reviewed in the New York Times, we iterated twice on this process. In total, we assemble 2,754 books representing 2,234 unique authors across 12 genres.

To further understand our data, we provide figures of the distribution of publication dates (Figure 1), the average user rating on Goodreads (Figure 2), and the log-transformed number of ratings on Goodreads (Figure 3) to capture book popularity. Finally, while no attention was given to the selection of books based on author gender, our gender distribution across all books is 49.76% women and 49.94% men with only eight books written by self-identified non-binary authors. We note, however, that there are meaningful within-genre differences (Figure 4) as predicted by prior research ().

Figure 1 

Distribution of publication dates of books in our sample.

Figure 2 

Distribution of the average user rating on Goodreads for books in our sample. Only includes books with > 9 ratings.

Figure 3 

Distribution of the log-transformed number of ratings on Goodreads for books in our sample. Only includes books with > 9 ratings.

Figure 4 

Distribution of author gender by genre.

Quality Control

All texts were manually cleaned of front and end matter. Metadata such as publication date, authorial gender, author name and title were all manually entered. The dataset was manually reviewed for the appropriateness of genre labels for every book. Finally, duplicates were removed and any books that were not at least 15,000 tokens in length were also removed. No maximum length was set.

Limitations

Our data is limited by intellectual property restrictions that do not allow access to full text data. To overcome this limitation, we provide a robust set of derived data that has served in prior research as a reliable foundation for the stylistic understanding of creative writing. Our data is also limited by focusing on a single language. Future work will want to emphasize multilingual data construction to facilitate our understanding of cross-cultural stylistic behavior. Finally, for both manually and computationally derived features, we expect there to be some level of error. For the manual features, we have undertaken two-levels of review. For the computational features, the bookNLP documentation provides estimates on the expected error rates of different predictive models. Nevertheless, it is important for researchers to be aware that our derived features are always estimates. We would flag “Character Count” and “Event Counts” as two features that are worth further research due to the challenging nature of their prediction.

(3) Dataset description

Object name

CONLIT

Format names and versions

.CSV

Creation dates

Start date: 2015-03-10; End date: 2022-06-22.

Dataset Creators

Andrew Piper (McGill University) was responsible for the overall design of the dataset. Eve Kraicer (McGill University) and Joey Love (McGill University) assisted with cleaning and processing the data.

Language

English

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

Repository name

Figshare

Publication date

2022-09-22

(4) Reuse potential

Prior work on the computationally driven study of genre has focused on using different selection mechanisms to better understand the role that genre plays in organizing literary communities and reader responses, ranging from studies of historical text data (; ; ) to contemporary reader response data (; ; ). Summarizing this work, one could say that research on the content or stylistic aspects of genre has largely focused on historical data while research into contemporary genre formations has largely focused on metadata or non-professionally published writing.

Our dataset is thus designed to give researchers access to stylistic data of contemporary, professionally published writing that spans a range of genre definitions and institutional frameworks. Doing so can help further research into understanding the role genre plays in constraining authorial behavior. It can also facilitate further understanding that the role of differentiation plays in genre classification (Sharma et al., 2022). As genre-theorist Ralph Cohen argued some time ago, “A genre, therefore, is to be understood in relation to other genres, so that its aims and purposes at a particular time are defined by its interrelation with and differentiation from others” (). Our data will facilitate the empirical exploration of such theories.

By providing Goodreads user response data, our dataset also allows further research into the relationship between style and success (). The links provided to the Goodreads versions of our books also allow our data to be combined with reader-based response data. An exciting new avenue of literary study aims to better understand the causes and conditions of readers’ responses to texts (; ; ) and our data provides the infrastructure to undertake such a research program across a large, diverse set of professionally published contemporary writing.