MultiHATHI: A Complete Collection of Multilingual Prose Fiction in the HathiTrust Digital Library

Sil Hamilton; Andrew Piper

(1) Overview

Context

Digital Humanities researchers interested in studying fiction face hurdles in acquiring sufficient quantities of non-English literature on which to conduct their experiments. While digital heritage collections like HathiTrust have made progress in providing researchers with meaningful access to non-English literature, works written in languages other than English continue to suffer from disproportionately more metadata and accessibility issues than works written in English (; ; ).

The prevalence of low quality metadata is apparent when surveying metadata statistics on the HathiTrust Digital Library, where non-English works are more likely to be missing (non-)fiction tags relative to works written in English (Figure 1). A discrepancy of this magnitude has ramifications for studies seeking to incorporate the most general distribution of written materials possible: if 10% (or more) of available texts in a given language are effectively unavailable because of metadata issues, the potentialities afforded to the humanities by big data are hampered (). This danger is exacerbated given non-English texts make up 47% of HathiTrust’s catalogue.

Figure 1

Percentage of books missing a fiction tag for the 20 most frequent languages.

We seek to provide this missing metadata with a bespoke fictionality classifier trained on monolingual data. Fictionality for our purposes is an institutionally-defined classification indicating whether a work is intended to be fictional or not, a classification rendered through how a work is written (). The resulting dataset provides quantitative researchers with a complete list of volumes predicted to be fictional or non-fictional available in the HathiTrust Digital Library. We increase the total number of tagged works in HathiTrust up from ca. 9.7 million to ca. 10.2 million works, totaling an increase of ca. 400,000 works. Our final dataset captures 95% of all works provided by HathiTrust. To ease researcher access into this collection, we provide additional metadata which allows for easy subsetting of the global list according to individual researcher preferences (Table 1).

Table 1

List of attributes included in our dataset.


ATTRIBUTE	DESCRIPTION

HTID	The HathiTrust ID by which the work is accessible.

Access Restrictions	Whether the work is made public by HathiTrust.

HathiTrust Bibliography Key	The respective bibliography key for the work. For retrieving MARC records.

Title	The title of the volume in question.

Year Published	The year in which the work was published.

Language	The language in which the work was published.

Author	The author of the work in question.

Fictionality	Whether the work is intended to be fictional (1) or not (0).

Length	The length of the work.

(2) Method

Steps

We began constructing our dataset by obtaining an extant list of works currently provided by HathiTrust. This Hathifile lists all volumes made available on the HathiTrust Digital Library platform. We cleaned the dataset and proceeded to download bibliographic data for every single entry. We subsetted the dataset for works missing fiction tags and pass these on to our classification process running on the HTRC Data Capsule. We then reintegrated the predicted values. A detailed description of this classification process follows.

Classification Process

We used a Transformer-based classification process to predict missing fiction tags. Given our goal was to increase the number of classified non-English works in the Hathifile, we required a multilingual bidirectional encoder. We selected XLM-RoBERTa (base) for this task given it was pretrained on substantially more literary data than competing models and thus remains the state of the art in cross-lingual language models ().

We equipped XLM-RoBERTa with an additional classification layer and trained this layer for five epochs on 144,000 examples of 512-word spans of English fiction and non-fiction drawn from the CONLIT dataset (). We assessed model performance on novels written in ten different languages as contained in the European Literary Text Collection () and private non-fiction corpora consisting of textbooks and biographies. Further tests were conducted on private non-fiction German, Japanese, and French corpora. We found that our model performs well (minimum 80% F1-score) in all tests despite having only been trained with English samples (see Quality Control for further details). We quantized the model to improve classification speed in the data capsule ().

Moving the model into the HTRC Data Capsule, we proceeded with downloading ten random pages from each volume to be classified. Given ten classifications per volume, we used majority vote to assign the volume an overall fictionality tag. We exported the predicted ca. 400,000 tags and re-integrated the data into the overall Hathifile. As can be seen in Figure 2, our process significantly improves the metadata coverage of non-English languages in the Hathi catalogue.

Figure 2

Number of books tagged as fiction for the 18 most frequent languages, before and after classification.

When examining the temporal distribution of non-English texts in the HathiTrust Digital Library (Figure 3), we find our classification process provides the researcher with substantially more texts published during the post-war period. While this data is copyrighted and thus access is limited by HathiTrust, researchers interested in studying non-English contemporary fiction will find HathiTrust has more to offer than one may previously have believed due to incomplete metadata.

Figure 3

Relative number of non-English books by decade before and after classification.

Quality Control

We sampled and tested classified works to ensure accuracy. To assess the quality of our classifications, we provided native speakers of ten different languages with titles and pages from twenty random works of fiction and non-fiction written in their respective languages. This process indicates our RoBERTa-based classifier achieves a harmonized F1-score of 88.9%. We provide per-language scores in Table 2. We then cleaned and de-duplicated the dataset to ensure label consistency.

Table 2

List of evaluated languages and their respective precision, recall, and F1 scores.


LANGUAGE	PRECISION	RECALL	F1

German	80%	88%	84%

Italian	100%	90%	95%

Japanese	100%	90%	95%

Russian	90%	90%	90%

Dutch	80%	100%	88%

Hebrew	80%	100%	88%

Danish	100%	76%	87%

Chinese	100%	83%	91%

Arabic	50%	100%	66%

Polish	90%	100%	94%

Limitations

Our dataset is limited by access restrictions. While many of the works are accessible via public APIs provided by HathiTrust, the majority of written fiction published after 1923 is only accessible through data capsules through the HathiTrust due to intellectual property restrictions in the United States. We account for this limitation by indicating in the metadata whether a given volume is accessible to the public. A further limitation concerns the imperfect nature of our classification algorithm. While the overall harmonized score for our classifier is high, we note we only tested the ten most frequently used languages in the dataset. Researchers interested in data that is 100% accurate can use our data as a means of reducing the burden of manually curating data sets. At the same time, prior research has shown that imperfectly classified data can be used for large-scale inferences of cultural behavior (; ).

(3) Dataset description

Object name

MultiHATHI

Format names and versions

.CSV

Creation dates

Start date: 2022–05–01 End date: 2022–10–30

Dataset Creators

Sil Hamilton and Andrew Piper

Language

English, German, French, Spanish, Italian, Russian, Japanese, Chinese, more.

License

Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

Repository name

FigShare

Publication date

01/12/2022

(4) Reuse potential

Prior work in the Digital Humanities has highlighted the importance of multilingual corpora for cultural study (; ; ). In treating this absence, we present researchers with a dependable list of volumes with predicted fictionality tags representing over 500 languages. In doing so, we significantly increase the number of readily available (non-)fictional texts currently provided by the HathiTrust Digital Library. This dataset can be used to further our understanding of categories like ‘fictionality’ (), ‘narrativity’ (), ‘genre’ (), ‘place’ () across numerous cultural contexts beyond English. At the same time, it can be useful for the NLP community in search of multilingual data sets for genre-specific or linguistic prediction tasks (Ogueji, 2021).

We also wish to underscore the viability of training multilingual Transformer-based classifiers with monolingual data. While this technique has been previously investigated in the study of natural language processing (; ), we are not aware of any prior work in the Digital Humanities explicitly employing a multilingual classifier trained solely on monolingual data. The consequences of this technique are innumerable for low-resource languages whose digitized material may not be sufficient in volume to train bespoke models. Future researchers will want to verify whether the same technique can be applied in other classification tasks (i.e. topic classification, sentiment analysis).

We release together with the dataset a collection of Python scripts intended to enable researchers to more easily access and conduct experiments on private works accessible on the HTRC Data Capsule at https://git.sr.ht/~srhm/hathi-scripts. While HathiTrust does provide a set of basic utilities for conducting certain NLP experiments, downloading larger volumes of text remains a disproportionately difficult task especially given the security conditions of the capsule. We hope these scripts will aid future researchers in conducting experiments on the platform.

Repository location

https://doi.org/10.6084/m9.figshare.21354798.

Journal of Open Humanities Data

Data Papers