Converting the British Library’s Catalogue of British and Irish Newspapers into a Public Domain Dataset: Processes and Applications

Yann Ryan; Luke McKernan

OVERVIEW

REPOSITORY LOCATION

CONTEXT

Produced as part of the British Library’s Heritage Made Digital newspaper project, digitising historical newspapers and exploring options for creative re-use of newspaper data (https://blogs.bl.uk/thenewsroom/2019/01/heritage-made-digital-the-newspapers.html).

METHOD

STEPS

The original data comes from the British Library’s catalogue of world newspapers (there is no separate newspaper catalogue, but all newspaper titles are included in the British Library’s Aleph management system and discoverable via its integrated catalogue at https://explore.bl.uk). This has been built up over c. 150 years of collecting of newspapers, a collection that now comprises some 35,000 titles or 60 million issues dating from 1619 to the present day. The collection of British and Irish titles (around two-thirds of the entire newspaper collection) runs from 1621 to the present day. It is not absolutely complete, but most titles published from the 1840s onwards are held, and effectively all titles are held from 1869 onwards, when Legal Deposit was introduced, by which publishers are required to send one copy of each newspaper issue to the British Library. There are a few omissions (either entire titles or gaps in the run of a title), while for reasons of space usually only one edition of an issue has been taken by the Library since 1869.

The newspaper catalogue is at title-level, with changes in a newspaper title and regional variants resulting in a new catalogue record, and often a new catalogue record where there has been a change in format (i.e., microfilm or digital copies). Over the long period of collecting, inevitable inconsistencies and gaps in the metadata have built up. The titles in the dataset are exactly as reflected in the catalogue, following the British Library cataloguing practice of the times when the titles were acquired.

The data was extracted from the British Library catalogue (through the Aleph Integrated Library System) by the Collection Metadata team. Aleph stores metadata relating to the newspaper collection, in the form of MARC records, in a number of fields, often with complicated holding information. The Collections Metadata team at the British Library extracted years of publication from the free-text information about date holdings, which is published alongside the original holdings field. This was then aligned to data from a separate Master Negative Database of microfilm copies of newspaper print originals (newspaper titles linked by system ID numbers), as well as an up-to-date list of digital holdings on the British Newspaper Archive, which hosts digitised newspapers from the British Library collection (https://www.britishnewspaperarchive.co.uk).

The initial data extraction was followed by a process of cleaning the extracted data by the News Collections team. We manually adjusted some 2,000 holdings records, mostly where years had not been extracted or there were inconsistencies surrounding place of publication—either where alternative spellings or punctuation had been used, or to ensure that the entire dataset used the same set of UK county boundaries. For this, and for joining the initial dataset to the microfilm and digital holdings, custom scripts for cleaning and extracting the data were developed using R, before exporting to the final .csv format.

SAMPLING STRATEGY

To produce a reusable dataset, the decision was made to limit this to British and Irish newspapers, where there were fewer complications with the data, such as dealing with languages other than English, the need for research into the history of some titles, or requiring consultation with other British Library curators in relevant area studies. A complete listing of all titles in the newspaper collection will be a follow-up project, scheduled to take place in 2021.

QUALITY CONTROL

Not Applicable.

DATASET DESCRIPTION

OBJECT NAME

British and Irish Newspapers: A title-level list of British, Irish, British Overseas Territories and Crown Dependencies newspapers held by the British Library

FORMAT NAMES AND VERSIONS

Excel; CSV; plaintext

CREATION DATES

2016-01-22–2019-11-18

DATASET CREATORS

Danskin, Alan – Collection Metadata Standards Manager (British Library)

Lester, Stephen – Curator, Newspaper Collections (British Library)

McKernan, Luke – Lead Curator, News and Moving Image (British Library)

Ryan, Yann – Curator, Newspaper Data (now post-doctoral researcher, Queen Mary University of London)

LANGUAGE

English

LICENSE

CC0

REPOSITORY NAME

British Library Research Repository

PUBLICATION DATE

2019-11-18

REUSE POTENTIAL

The past six years or so have seen the rise of ‘Collections as Data’: the idea that metadata from holdings of cultural heritage collections can function as data to be analysed in its own right (see ). Tim Sherratt, for example, has used the metadata from the National Library of Australia’s Trove digitised newspaper collection to undertake historical analyses ().

As newspapers are digitised, detailed metadata are produced in tandem, providing issue-level details on the place and date of publication, which can then be exploited by researchers (). However, this only relates to the portion of the collection which has been digitised, currently consisting of just over 8% of the entire British Library newspaper collection of 450 million pages. Up until now, no easily available survey of the print holdings of the Library has been available to researchers. We see the main reuse potential of this dataset as four-fold:

Firstly, the list can be used in conjunction with the physical holdings and the Library’s Explore catalogue (https://explore.bl.uk) as a general finding aid, or to narrow down one’s search to a specific corpus of newspapers. While Explore is already an excellent search tool, this list aids discovery by enabling easy browsing by date and location.

Secondly, it opens up newspaper data to the non-specialist. We purposely standardised and simplified the data fields so that users could take advantage of the filtering, sorting and graphing functions in software such as Excel or Google sheets.

Thirdly, it allows for geographical and diachronic analyses of the British and Irish newspaper industries, allowing for easy production, for example, of time-series statistics on the establishment of new titles, or of maps of individual ‘hotspots’ of newspaper growth on a county or city level. While geographic coordinates were beyond the scope of the dataset, a code to accurately georeference the structured data is being developed specifically for use with the title list ().

Finally, understanding the print collection helps us to understand the digitised portion in context. Researchers across the world now use the data from the British Library’s digitised newspaper collection for historical research. Many of these projects employ large-scale text mining or image analytics over the entire collection to make broad historical claims: previous projects, for instance, have used the corpus to estimate dates when electricity took over from horses, or to analyse ‘subjective well-being’ (; ). However, it is also recognised that these types of claims must be understood in terms of the idiosyncrasies existing in the digitised collection. The corpus ultimately only represents a fraction of the entirety of the Library’s newspaper collection and has not been produced to be particularly systematic or representative (). This list helps to contextualise the data in the digitised collection. A project undertaking text mining, for example, may adjust the weighting methods if one understands the proportions of the print holdings of a particular place that each digitised newspaper represents. The ‘Living with Machines’ project — a British Library and Alan Turing Institute initiative using data analysis to understand the lived experience of the nineteenth century — is already using a version of this list to carry out a ‘topographical survey’ of the digitised newspaper collection ().

Journal of Open Humanities Data

Data Papers