1 Overview

As language is subject to continuous change, the computational analysis of digital heritage should attune models and methods to the specific historical contexts in which these texts emerged. This paper aims to facilitate the “historicization” of Natural Language Processing (NLP) methods by releasing various language models trained on a 19th-century book collection. These models can support research in digital and computational humanities, history, computational linguistics and the cultural heritage or GLAM sector (galleries, libraries, archives, and museums). To accommodate different research needs, we release a wide variety of models, from word type embeddings (word2vec and fastText) to more recent language models that produce context-dependent word or string embeddings (BERT and Flair, respectively). Word type embeddings generate a single vector for a token, regardless of the textual context in which the token appears. On the other hand, “contextual” models generate a distinct token embedding according to the textual context at inference time.

Repository location The dataset is available on Zenodo at http://doi.org/10.5281/zenodo.4782245.

Context This work was produced as part of Living with Machines (LwM),1 an interdisciplinary project focused on the lived experience of Britain’s industrialization during the long 19th century. The language models presented here have been used in several research projects, to assess the impact of optical character recognition (OCR) on NLP tasks (van Strien et al., 2020), to detect atypical animacy (Coll Ardanuy et al., 2020), and for targeted sense disambiguation (Beelen et al., 2021).

2 Method

2.1 Original corpus

The original collection consists of ≈48K digitized books in English, made openly available by the British Library in partnership with Microsoft, henceforth Microsoft British Library Corpus (MBL). The digitized books are available as JSON files from the British Library web page.2 Figure 1 gives an overview of the number of books by publication date. The bulk of the material is dated between 1800 and 1900, with the number of documents steeply rising at the end of the 19th century. Since all copyrights are cleared and the data are in the public domain, they have already become a popular resource for (digital) historians and literary scholars.3 However, one notable issue with this collection (when used for historical research) is the somewhat opaque selection process of books: while the data provides decent coverage over the 19th century, the exact criteria for inclusion remain unclear and future work might profitably consider assessing the characteristics of this collection in more detail (e.g. Pechenick, Danforth, and Dodds (2015)).

Figure 1 

Number of books by publication date. The preprocessed dataset has 47,685 books in English consisting of 5.1 billion tokens. The red vertical dashed lines mark the boundaries between the time periods we used to slice the dataset. See Section 2.2 for details.

2.2 Steps

Preprocessing Each book was minimally normalized: we converted the text to ASCII, fixed common punctuation errors, dehyphenated broken tokens, removed most punctuation and separated the remaining punctuation marks from tokens. While the large majority of books in the MBL corpus are written in English, the collection still contains a substantial amount of documents in other languages. Therefore, we filtered by English language, using spaCy’s language detector (Honnibal, Montani, Van Landeghem, & Boyd, 2020). Finally, we used syntok4 to split the book into sentences and tokenize the text. This process resulted in one file per book where each line corresponded to a sentence with space-separated tokens.5

Data selection For each model architecture, we trained an instance using the whole dataset (i.e., books from all over the 19th century; see Figure 1). For the word2vec and fastText models, we have also trained instances on text published before 1850. Moreover, for BERT, we have fine-tuned four model instances on different time slices, with data from before 1850, between 1850 and 1875, between 1875 and 1890, and between 1890 and 1900, each slice containing ≈1.3B tokens per period, except for 1890–1900, which included ≈1.1B tokens. While this periodization was largely motivated by the number of tokens, the different models (that resulted from the data partitioning) may enable historians to track cultural changes over the long 19th century.6

Word2vec and fastText We trained the word2vec (Mikolov, Chen, Corrado, & Dean, 2013) and fastText (Bojanowski, Grave, Joulin, & Mikolov, 2016) models as implemented in the Gensim library (Rehurek & Sojka, 2011). In addition to the preprocessing steps described above, we lowercased all tokens before training. For word2vec, we used the skip-gram architecture, which we trained for one epoch.7 We set the dimension of the word embedding vectors to 300 and removed tokens appearing less than 20 times. The same hyperparameters were used for training fastText models.8

Flair Flair is a character language model based on the Long Short-Term Memory (LSTM) variant of recurrent neural networks (Akbik et al., 2019; Hochreiter & Schmidhuber, 1997). Even though less popular than the Transformers, it has been shown to obtain state-of-the-art results in Named Entity Recognition (NER). We trained a character-level, forward-pass Flair language model on all the books in the MBL corpus for one epoch and sequence length of 250 characters (during training). We used the default character dictionary in Flair. The LSTM component had one layer and a hidden dimension of 2048.9

BERT To fine-tune BERT model instances, we started with a contemporary model: ‘BERT base uncased’,10 hereinafter referred to as BERT-base (Devlin, Chang, Lee, & Toutanova, 2019; Wolf et al., 2019). This instance was then fine-tuned on the earliest time period (i.e., books predating 1850). For the consecutive period (1850–1875), we used the pre-1850 language model instance as a starting point and continued fine-tuning with texts from the following period. This procedure of consecutive incremental fine-tuning was repeated for the other two time periods.

We used the original BERT-base tokenizer as implemented by Hugging Face11 (Wolf et al., 2019). We did not train new tokenizers for each time period. This way, the resulting language model instances can be compared easily with no further processing or adjustments. The tokenized and lowercased sentences were fed to the language model fine-tuning tool in which only the masked language model (MLM) objective was optimized. We used a batch size of 5 per GPU and fine-tuned for 1 epoch over the books in each time-period. The choice of batch size was dictated by the available GPU memory (we used 4 × NVIDIA Tesla K80 GPUs in parallel). Similar to the original BERT pre-training procedure, we used the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.0001, β1 = 0.9, β2 = 0.999 and L2 weight decay of 0.01. In our fine-tuning procedure, we used a linear learning-rate warm-up over the first 2,000 steps. A dropout probability of 0.1 was applied in all layers.

Quality control The quality of our language models was evaluated on multiple downstream tasks. In van Strien et al. (2020), we investigated the impact of OCR quality on the 19th-century word2vec model and showed how language models trained on large OCR’d corpora still yield robust word embedding vectors. However, OCR errors can be unevenly distributed over time and potentially distort the comparison of language models. The BERT models have been used in Coll Ardanuy et al. (2020) and Beelen et al. (2021), where they generally improved the performance of various downstream tasks when the data of the experiments was contemporaneous to that of the language models, thereby confirming their quality via extrinsic evaluation.

Lastly, we need to stress that our language models can contain historical stereotypes and prejudices (related to race, gender, or sexual orientation, among others). We did not attempt to quantify or remove these biases. Therefore, these models should be used critically and responsibly, to avoid the propagation of historical biases (see also Hengchen and Tahmasebi (2021)).

3 Language model zoo

Object name histLM.

Format names and versions The models are shared as ZIP files (one per model architecture). The directory structure is described in the README.md file.

Creation dates 2020-01-31 to 2020-10-07.

Dataset creators Kasra Hosseini, Kaspar Beelen and Mariona Coll Ardanuy (The Alan Turing Institute) preprocessed the text, created a database, trained and fine-tuned language models as described in this paper. Giovanni Colavizza (University of Amsterdam) initiated this work on historical language models. All authors contributed to planning and designing the experiments.

Language The language models have been trained on 19th-century texts in English.

License The models are released under open license CC BY 4.0, available at https://creativecommons.org/licenses/by/4.0/legalcode.

Repository name All the language models are published in Zenodo at http://doi.org/10.5281/zenodo.4782245. We have also provided scripts to work with the language models, available on GitHub at https://github.com/Living-with-machines/histLM.

Publication date 2021-05-23.

4 Reuse Potential

Even though word2vec has been around for almost a decade—an eternity in the fast-moving NLP ecosystem—the word type embeddings it produces persist as popular instruments, especially for interdisciplinary research (Azarbonyad et al. 2017; Hengchen, Ros, & Marjanen, 2019). The more recent fastText model extends on word2vec by using subword information. Contextualized language models have meant a breakthrough in NLP research (e.g. Smith (2019) for an overview), as they represent words in the contexts in which they appear, instead of conflating all senses, one of the main criticisms of word type embeddings. The potential of using such models for historical research is immense as they allow a more accurate context-dependent representation of meaning. These embeddings can also be used in existing tools for historical research (e.g. Hosseini, Nanni, and Coll Ardanuy (2020)).

Given that existing libraries, such as Gensim, Flair, or Hugging Face, provide convenient interfaces to work with these embeddings, we are confident that our historical models will serve the needs of a wide-variety of scholars, from NLP and data science to the humanities, for different tasks and research purposes, such as measuring how words change meaning over time (Kulkarni, Al-Rfou, Perozzi, & Skiena, 2015; Tahmasebi, Borin, & Jatowt, 2018), automatic OCR correction (Hämäläinen & Hengchen, 2019), interactive query expansion12 or, more generally, any research that involves diachronic language change.