In this paper, we present a new dataset for the task of toponym resolution in digitized historical newspapers in English. Toponym resolution is a subtask of entity linking, focused on detecting and resolving mentions of places (i.e., toponyms) to their corresponding referent in a gazetteer or other knowledge base. Resolving toponyms in texts enables new forms of large-scale semantic and geographic analyses. However, most approaches to entity linking and toponym resolution are optimized to perform well with clean texts originally intended for a global audience and they do not generalize well to noisy, historical, or regional texts (Ehrmann, Romanello, Flückiger, & Clematide, 2020; Gritta, Pilehvar, Limsopatham, & Collier, 2018; Wang & Hu, 2019). Some entity linking datasets have been created to address this issue, such as Ehrmann et al. (2020) and Hamdi et al. (2021), both built from digitized historical newspaper collections.
Our dataset differs from others in its emphasis on the geographical aspect of newspaper data. The British provincial press—from which we sampled our articles—was strongly anchored in place: articles and advertisements were selected and edited with a local audience in mind. After the repeal on the ‘taxes on knowledge’ in the 1850s and 1860s, the provincial press proliferated; its readership expanded as did the number of titles, trumping the London-based press in size. Despite this plethora of available materials, to date historians have mostly favored the Metropolitan papers at the expense of the local press, which remains largely understudied (Beelen, Lawrence, Wilson, & Beavan, under submission; Hobbs, 2018). As shown in Lieberman, Samet, and Sankaranarayanan (2010) and Coll Ardanuy et al. (2019), the distribution of places mentioned in newspapers varies considerably depending on their intended audience (grounded in a certain place and time), hindering the resolution of ambiguous place names. Our dataset has been created to assess the robustness of entity linking and toponym resolution methods in this particularly challenging but common scenario. We hope that improved toponym resolution for these newspapers will translate into greater interest in them as research materials.
This dataset is comprised of 343 articles carefully sampled from a variety of provincial nineteenth-century newspapers based in four different locations in England. The articles have been manually annotated with mentions of places, which are linked—whenever possible—to their corresponding entry on Wikipedia. A total of 3,364 toponyms have been annotated, of which 2,784 have been linked to Wikipedia. The text of the articles is OCR-generated and has not been manually corrected. The dataset has been created with the aim of becoming a benchmark for several tasks: fuzzy string matching and toponym recognition and resolution, among others, all of which contribute to the challenging pursuit of improving semantic access to OCRed historical texts in English.
This dataset has been produced as part of Living with Machines,1 a multidisciplinary research project focused on the lived experience of industrialization in Britain during the long nineteenth century and, in particular, on the social and cultural impact of mechanization as reported in newspapers and other sources. Living with Machines is one of many projects that harness the growing volume of digitized newspaper collections for humanities research.2 A fraction of the annotated data has been used in previous studies from Living with Machines, in particular Coll Ardanuy et al. (2019), and for fuzzy string matching in Hosseini, Nanni, and Coll Ardanuy (2020) and Coll Ardanuy et al. (2020).
The initial source of the data was formatted as Metadata Encoding and Transmission Standard/Analyzed Layout and Text Object (METS/ALTO) files3 and consisted of 72 newspaper titles of publications (including subsequent variant titles) from the English counties of Lancashire and Dorset. These were obtained from the genealogy company Find My Past, custodians of the British Newspaper Archive, the most extensive corpus of digitised British newspapers.4 This METS/ALTO file format contains both logical and physical layout information, along with document textual contents, expressed as Extensible Markup Language (XML).5 It is verbose and does not lend itself directly to manipulation in natural language processing pipelines and tools. Instead, we used Extensible Stylesheet Language Transformations (XSLT)6 to extract the plain text of each article; each article being explicitly segmented and identified in the METS logical structure map, the plain text extracted being all physical ALTO textblocks attributed to that article. This plain text is supplemented by minimal metadata extracted into in a companion file. This step is performed by alto2txt, which is a Python wrapper for those XSLT transformations, and is being prepared for public release via GitHub. This corpus consisted of 11,761,898 articles (as defined above). This metadata was ingested into a PostgreSQL7 relational database for ease of querying and filtering, its relational schema mirrors directly the hierarchy of the metadata XML files.
We created a subsample that consists of 343 articles published between 1780 and 1870 in local newspapers based in four different locations: Manchester and Ashton-under-Lyne (a large town and a medium-sized market town, broadly representing the industrial north of England), and Poole and Dorchester (respectively medium-sized port and market towns, representing the rural south).8Figure 1 gives an overview of the number of annotated articles per decade and place of publication. We biased our sample toward articles that have a length between 150 and 550 words and an OCR quality confidence score greater than 0.7 (calculated as the mean of the per-word OCR confidence scores as reported in the source metadata). Most of the text is legible, even though it contains many OCR errors. See Table 1 for a more detailed overview of the sample.
|Number of articles||36||36||36||36||21||34||36||36||36||36|
|Avg word count||300||323||313||325||311||368||378||354||312||288|
|Avg OCR quality: mean||0.89||0.86||0.88||0.89||0.75||0.77||0.87||0.88||0.84||0.9|
|Avg OCR quality: sd||0.18||0.21||0.19||0.18||0.27||0.27||0.21||0.19||0.23||0.14|
We did not perform any manual post-processing to correct the errors produced in the OCR or layout recognition steps. Therefore, the toponyms in this dataset often contain OCR errors (e.g., ‘iHancfjrcter’ for ‘Manchester’). Additionally, our dataset is rich with name variations that are characteristic of historical data, such as spelling variations (e.g., ‘Leipsic’ for ‘Leipzig’) and other forms of name change (e.g., ‘Kingstown’ for ‘Dún Laoghaire’).
Six annotators from different disciplinary backgrounds (history, literature, data science, and linguistics) manually annotated the toponyms present in the subsample. We used the Inception annotation platform9 (Klie, Bugert, Boullosa, de Castilho, & Gurevych, 2018). A toponym is a mention of a location in a text. We defined a location as any entity that is static and can be represented by its geographical coordinates. Toponyms were classified into the following categories: BUILDING (names of buildings, such as the ‘British Museum’), STREET (streets, roads, and other odonyms, such as ‘Great Russell St’), LOC (any other real world places regardless of type or scale, such as ‘Bloomsbury’, ‘London’, or ‘Great Britain’), ALIEN (extraterrestrial locations, such as ‘Venus’), FICTION (fictional or mythical places, such as ‘Hell’), and OTHER (other types of entities with coordinates, such as events, like the ‘Battle of Waterloo’). Where possible, toponyms were linked to the corresponding Wikipedia entries (from which geographic coordinates can be derived) by their URL. This would be left empty if the location had no Wikipedia entry or the annotators were uncertain as to the correct disambiguation, either because the OCR made it impossible to correctly determine the referent or due to insufficient context.10 While the annotations were made on the OCRed text, it was possible for the annotator to consult the original page image online on the British Newspaper Archive. Annotators were encouraged to discuss difficult choices with each other, and to document their decisions in a shared document. Table 2 gives an overview of the annotations for each class.
|CLASS||ANNOTATIONS||UNIQUE TOPONYMS||UNIQUE WIKIPEDIA LINKS||UNLINKED TOPONYMS|
To assess the quality of the annotations, we had 77 newspaper articles annotated by two people, for a total of 740 annotation pairs. We used the Inception agreement functionality to assess the inter-annotator agreement between the two sets of annotations. Using the Krippendorff’s alpha (nominal) measure, we obtained an agreement of 0.87 for place name detection and classification and 0.89 for linking to Wikipedia. To further ensure the quality of our resource, after the annotation process, a curator went through all the annotations and made final decisions on which annotations to keep and which to discard, making sure the annotations were consistent throughout the dataset.
3 Dataset Description
Format names and versions
We are sharing the annotated files in the WebAnno TSV (tab-separated values) file format, version 3.2.11 There are 343 files, one for each newspaper article. Accompanying the dataset is an additional tsv file that contains the metadata associated with each article: word count, OCR quality mean and standard deviation, date (and decade) of publication, place of publication, newspaper publication code and publication title, and an additional field (annotation_batch) in which the article is assigned to one of three batches that are similarly distributed in terms of place and decade of publication (this field was used during the sampling process, and may be useful for researchers wishing to split the dataset for experimental purposes). We have also prepared a README file and the original annotation guidelines in Markdown markup. The present paper describes version 2 of the dataset.
2019-01-01 to 2021-07-27.
The dataset is released under open license CC-BY-NC-SA, available at https://creativecommons.org/licenses/by-nc-sa/4.0/.
The dataset is stored in the British Library shared research repository at https://doi.org/10.23636/r7d4-kw08.
4 Reuse Potential
The vast archive of the British Newspaper Archive and other British historical newspaper corpora will be re-used by hundreds of scholars in the coming years. Establishing benchmark datasets like this provides a foundation for others to assess the performance of methods related to the identification and location of places in historical newspapers. Although toponym density was always greatest for newspapers’ immediate locality, all newspapers included a rich diversity of national and international place names linked to reports of trade, war, conquest and state politics. Our annotations cover the different scales of places that make up the locations of the political, economic, and everyday life reported in nineteenth-century provincial newspapers. We hope that this dataset contributes to improving methods for finding difficult-to-recognize toponyms in digitized texts and linking them to context-appropriate knowledge base records.