(1) Overview
Repository location
https://doi.org/10.6084/m9.figshare.21166171.v1
Context
Access to well-defined collections of contemporary writing is extremely limited today due to intellectual property restrictions, corporate control of data, and the absence of clear consensus surrounding literary categorization. Our dataset is designed to provide researchers with freely accessible derived data of a robust collection of professionally published writing in English produced since 2001, which spans 12 different genre categories. While the term “genre” has been understood in multiple ways within the research community over the years (; ), we define genre for our purposes as a form of institutionally framed classification (). According to this definition, genre is what a given institution labels a book using a distinct category of writing.
As we show with the overview of our data (Table 1), our institutional frameworks can include bestseller lists, prize committee shortlists, book review lists, user-generated “choice awards”, or corporate forms of categorization. Taken together, they allow research on three different types of institutional framing: cultural capital, stylistic affinity, and reading level. Rather than rely on a single “best” framework, we choose to include multiple forms of selection to allow researchers to explore the effects of different institutional frameworks on stylistic behavior.
CODE | GENRE | INSTRUMENTALITY | PLATFORM | SELECTION CRITERIA | # DOCS |
---|---|---|---|---|---|
BIO | Biography | Non-fiction | Goodreads | “Best memoir/biography/autobiography” list | 193 |
BS | Bestseller | Fiction | New York Times | Fiction published since 2001 with the longest aggregate time on the New York Times bestseller list | 249 |
HIST | History | Non-fiction | Amazon | Books listed under “history” under the “bestsellers” tag | 205 |
MEM | Memoir | Non-fiction | Amazon | Books listed under “memoir” under the “bestsellers” tag | 229 |
MID | Middle school | Fiction | Goodreads | Goodreads Choice awards for “Middle Grade” books | 166 |
MIX | Assorted non-fiction | Non-fiction | Amazon | Books listed under assorted non-fiction tags such as “health”, “politics”, and “business”, under the “bestsellers” tag | 193 |
MY | Mystery | Fiction | Amazon | Books listed under “Mystery, Thriller, Suspense” under the “bestsellers” tag | 234 |
NYT | New York Times reviewed | Fiction | New York Times | Fiction reviewed in the New York Times Book Review | 419 |
PW | Prizelists | Fiction | 5 Prizelists (US, UK, Canada) | Works shortlisted for the National Book Award (US), PEN/Faulkner Award (US), Governor General’s Award (Canada), Giller Prize (Canada), and the Man Booker Prize (UK) | 258 |
ROM | Romance | Fiction | Amazon | Books listed under “Romance” under the “bestsellers” tag | 208 |
SF | Science-Fiction | Fiction | Amazon | Books listed under “Science Fiction & Fantasy” under the “bestsellers” tag | 223 |
YA | Young Adult | Fiction | Goodreads | Goodreads Choice Awards for Young Adult Fiction | 177 |
In addition to our manually curated selection of books, we also provide researchers with a set of derived features that can be used for further research on the style and content of books (described in Table 2).
FEATURE | DESCRIPTION | ANNOTATION TYPE |
---|---|---|
Category | Fiction or non-fiction | Manual |
Genre | Twelve categories | Manual |
Publication Date | Date of first publication | Manual |
Author Gender | Perceived authorial gender | Manual |
POS | Part-of-speech uni- and bigrams | Computational |
Supersense | Frequency of 41-word supersenses | Computational |
Word Frequencies | Word frequencies for every book/1,000-word passage | Computational |
Token Count | Work length measure | Computational |
Total Characters | Estimated total number of named characters | Computational |
Protagonist Concentration | Percentage of all character mentions by main character | Computational |
Avg. Sentence Length | Average length of all sentences per book | Computational |
Avg. Word Length | Average length of all words per book | Computational |
Tuldava Score | Reading difficulty measure | Computational |
Event Count | Estimated number of diegetic events | Computational |
Goodreads Avg. Rating | Average user rating on Goodreads | Computational |
Goodreads Total Ratings | Total number of ratings on Goodreads as of June 2022 | Computational |
Average Speed | Measure of narrative pace | Computational |
Minimum Speed | Measure of narrative distance | Computational |
Volume | Measure of topical heterogeneity | Computational |
Circuitousness | Measure of narrative non-linearity | Computational |
(2) Method
Steps
The steps for our dataset construction were the following. Books were manually selected according to the sampling strategies described in Table 1; digitized and manually cleaned; processed using the “large model” of bookNLP (); and manually and computationally annotated for features indicated in Table 2.
Sampling strategy
All books were chosen to represent “popular” writing across 12 different genres of contemporary publishing spanning a 20-year timeframe dating from 2001 through 2021. We define “popular” through multiple criteria that include user-generated awards or lists, elite prize committee lists or book reviews, or bestseller tags on platforms like Amazon or the New York Times. As a further way to validate popularity, we provide two measures drawn from the platform Goodreads.
We define genre through three different kinds of institutional framing: cultural capital (bestsellers, prizewinners, elite book reviews), stylistic affinity (mysteries, science fiction, biography, etc.), and age-level (middle-grade and young adult (YA)). This allows researchers a high degree of flexibility to better understand stylistic behavior of professionally published books targeting different kinds of readerships. We also segment our genres by the “instrumentality” of the information contained (“fiction” or “non-fiction”).
While our genre categories are not mutually exclusive (mysteries may appear in Bestsellers and vice versa), no books appear in two separate categories. It is important to note that our larger genre categories (cultural capital, style, age) are not necessarily commensurate with one another and thus researchers should use caution when comparing across these categories. Experimentation with alternative genre labeling systems can be a further affordance of this dataset. Finally, we aimed to select ca. 200 works per category, which we have found is sufficient for training robust text classification algorithms. Due to text availability, list sizes, and cleaning, some categories have more or less than this number. In the case of those books reviewed in the New York Times, we iterated twice on this process. In total, we assemble 2,754 books representing 2,234 unique authors across 12 genres.
To further understand our data, we provide figures of the distribution of publication dates (Figure 1), the average user rating on Goodreads (Figure 2), and the log-transformed number of ratings on Goodreads (Figure 3) to capture book popularity. Finally, while no attention was given to the selection of books based on author gender, our gender distribution across all books is 49.76% women and 49.94% men with only eight books written by self-identified non-binary authors. We note, however, that there are meaningful within-genre differences (Figure 4) as predicted by prior research ().
Quality Control
All texts were manually cleaned of front and end matter. Metadata such as publication date, authorial gender, author name and title were all manually entered. The dataset was manually reviewed for the appropriateness of genre labels for every book. Finally, duplicates were removed and any books that were not at least 15,000 tokens in length were also removed. No maximum length was set.
Limitations
Our data is limited by intellectual property restrictions that do not allow access to full text data. To overcome this limitation, we provide a robust set of derived data that has served in prior research as a reliable foundation for the stylistic understanding of creative writing. Our data is also limited by focusing on a single language. Future work will want to emphasize multilingual data construction to facilitate our understanding of cross-cultural stylistic behavior. Finally, for both manually and computationally derived features, we expect there to be some level of error. For the manual features, we have undertaken two-levels of review. For the computational features, the bookNLP documentation provides estimates on the expected error rates of different predictive models. Nevertheless, it is important for researchers to be aware that our derived features are always estimates. We would flag “Character Count” and “Event Counts” as two features that are worth further research due to the challenging nature of their prediction.
(3) Dataset description
Object name
CONLIT
Format names and versions
.CSV
Creation dates
Start date: 2015-03-10; End date: 2022-06-22.
Dataset Creators
Andrew Piper (McGill University) was responsible for the overall design of the dataset. Eve Kraicer (McGill University) and Joey Love (McGill University) assisted with cleaning and processing the data.
Language
English
License
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).
Repository name
Figshare
Publication date
2022-09-22
(4) Reuse potential
Prior work on the computationally driven study of genre has focused on using different selection mechanisms to better understand the role that genre plays in organizing literary communities and reader responses, ranging from studies of historical text data (; ; ) to contemporary reader response data (; ; ). Summarizing this work, one could say that research on the content or stylistic aspects of genre has largely focused on historical data while research into contemporary genre formations has largely focused on metadata or non-professionally published writing.
Our dataset is thus designed to give researchers access to stylistic data of contemporary, professionally published writing that spans a range of genre definitions and institutional frameworks. Doing so can help further research into understanding the role genre plays in constraining authorial behavior. It can also facilitate further understanding that the role of differentiation plays in genre classification (Sharma et al., 2022). As genre-theorist Ralph Cohen argued some time ago, “A genre, therefore, is to be understood in relation to other genres, so that its aims and purposes at a particular time are defined by its interrelation with and differentiation from others” (). Our data will facilitate the empirical exploration of such theories.
By providing Goodreads user response data, our dataset also allows further research into the relationship between style and success (). The links provided to the Goodreads versions of our books also allow our data to be combined with reader-based response data. An exciting new avenue of literary study aims to better understand the causes and conditions of readers’ responses to texts (; ; ) and our data provides the infrastructure to undertake such a research program across a large, diverse set of professionally published contemporary writing.