Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications

This paper describes the production of a title-level list of British, Irish, British Overseas Territories and Crown Dependencies newspapers (1621–2019) held by the British Library, and its potential for reuse and research. The data was extracted from the British Library’s catalogue of over 24,000 British and Irish newspaper titles, cleaned, and published on the British Library Research Repository, an open access repository for the research produced by staff and research associates of the British Library. Bespoke versions of the data have been made available to specialist users, notably the British Library/Alan Turing Institute’s ‘Living with Machines’ project, enabling greater historical analysis of nineteenth-century British news and selective digitisation. YANN RYAN

The newspaper catalogue is at title-level, with changes in a newspaper title and regional variants resulting in a new catalogue record, and often a new catalogue record where there has been a change in format (i.e., microfilm or digital copies). Over the long period of collecting, inevitable inconsistencies and gaps in the metadata have built up. The titles in the dataset are exactly as reflected in the catalogue, following the British Library cataloguing practice of the times when the titles were acquired.
The data was extracted from the British Library catalogue (through the Aleph Integrated Library System) by the Collection Metadata team. Aleph stores metadata relating to the newspaper collection, in the form of MARC records, in a number of fields, often with complicated holding information. The Collections Metadata team at the British Library extracted years of publication from the free-text information about date holdings, which is published alongside the original holdings field. This was then aligned to data from a separate Master Negative Database of microfilm copies of newspaper print originals (newspaper titles linked by system ID numbers), as well as an up-to-date list of digital holdings on the British Newspaper Archive, which hosts digitised newspapers from the British Library collection (https://www.britishnewspaperarchive.co.uk).
The initial data extraction was followed by a process of cleaning the extracted data by the News Collections team. We manually adjusted some 2,000 holdings records, mostly where years had not been extracted or there were inconsistencies surrounding place of publicationeither where alternative spellings or punctuation had been used, or to ensure that the entire dataset used the same set of UK county boundaries. For this, and for joining the initial dataset to the microfilm and digital holdings, custom scripts for cleaning and extracting the data were developed using R, before exporting to the final .csv format.

SAMPLING STRATEGY
To produce a reusable dataset, the decision was made to limit this to British and Irish newspapers, where there were fewer complications with the data, such as dealing with languages other than English, the need for research into the history of some titles, or requiring consultation with other British Library curators in relevant area studies. A complete listing of all titles in the newspaper collection will be a follow-up project, scheduled to take place in 2021.

REUSE POTENTIAL
The past six years or so have seen the rise of 'Collections as Data': the idea that metadata from holdings of cultural heritage collections can function as data to be analysed in its own right (see Collections as Data National Forum, 2018, for a definition and discussion of the term). Tim Sherratt, for example, has used the metadata from the National Library of Australia's Trove digitised newspaper collection to undertake historical analyses (Sherratt, 2019).
As newspapers are digitised, detailed metadata are produced in tandem, providing issue-level details on the place and date of publication, which can then be exploited by researchers (Fyfe, 2016). However, this only relates to the portion of the collection which has been digitised, currently consisting of just over 8% of the entire British Library newspaper collection of 450 million pages. Up until now, no easily available survey of the print holdings of the Library has been available to researchers. We see the main reuse potential of this dataset as four-fold: Firstly, the list can be used in conjunction with the physical holdings and the Library's Explore catalogue (https://explore.bl.uk) as a general finding aid, or to narrow down one's search to a specific corpus of newspapers. While Explore is already an excellent search tool, this list aids discovery by enabling easy browsing by date and location.
Secondly, it opens up newspaper data to the non-specialist. We purposely standardised and simplified the data fields so that users could take advantage of the filtering, sorting and graphing functions in software such as Excel or Google sheets.
Thirdly, it allows for geographical and diachronic analyses of the British and Irish newspaper industries, allowing for easy production, for example, of time-series statistics on the establishment of new titles, or of maps of individual 'hotspots' of newspaper growth on a county or city level. While geographic coordinates were beyond the scope of the dataset, a code to accurately georeference the structured data is being developed specifically for use with the title list (Ryan et al., 2020).
Finally, understanding the print collection helps us to understand the digitised portion in context. Researchers across the world now use the data from the British Library's digitised newspaper collection for historical research. Many of these projects employ large-scale text mining or image analytics over the entire collection to make broad historical claims: previous projects, for instance, have used the corpus to estimate dates when electricity took over from horses, or to analyse 'subjective well-being' (Lansdall-Welfare et al., 2017;Hills, Proto, Sgroi, & Seresinhe, 2019). However, it is also recognised that these types of claims must be understood in terms of the idiosyncrasies existing in the digitised collection. The corpus ultimately only represents a fraction of the entirety of the Library's newspaper collection and has not been produced to be particularly systematic or representative (Shaw, 2005). This list helps to contextualise the data in the digitised collection. A project undertaking text mining, for example, may adjust the weighting methods if one understands the proportions of the print holdings of a particular place that each digitised newspaper represents. The 'Living with Machines' project -a British Library and Alan Turing Institute initiative using data analysis to understand the lived experience of the nineteenth century -is already using a version of this list to carry out a 'topographical survey' of the digitised newspaper collection (Vane, 2020).