Scissors and Paste: The Georgian Reprints, 1800–1837

This dataset, part of the Scissors and Paste Project (https://osf.io/nm2rq), describes instances of reprinting and text reuse (scissors-and-paste journalism) in British newspapers between 1800–1837. It was derived from the 19 th -Century British Library Newspapers, Part 1 digitised newspaper collection by using plagiarism detection software to identify instances of substantially similar text. It contains a series of manifests that describe a) instances of shared content b) the likely directionality of copying and c) which instances are evolutionary dead-ends and have no known reprints. It is comprised of 1,824 TSV files, divided into four directories, each representing one month between January 1800 and December 1837.

allows us to work with reprinted and reworked materials with a greater confidence as to their provenance. News content, broadly defined as the time-sensitive recordings of events, was likely to be reprinted quickly and maintain a high fidelity regardless of the number of generations, making it particularly well suited for electronic discovery.
The Scissors and Paste Project (http://www.scissorsandpaste.net) tracks reprinting and reuse in the long-19 th century (1783-1914) across the Anglophone world. The initial phase of the project involved the development of a suite of tools and methodologies to efficiently identify reprint families and then suggest both directionality and branching within these subsets. From these case-studies, detailed analyses of additions, omissions and wholesale changes can offer insights into the mechanics of reprinting that left behind few if any other traces in the historical record.
The Georgian Reprints represents the first discrete dataset to come out of this project, focusing on the years 1800-1837 within Great Britain. It is comprised of 1,824 monthly listings of reprinting within the 19 th -Century British Library Newspaper collection. As the wider project progresses, it is expected that further datasets for additional years and wider corpora will be made available through the project website (https://osf.io/nm2rq). A fuller description of the methods used to create the dataset, and the rationale behind these, can be found in the related article within the Journal of Victorian Culture [6].

Steps The Source Data
The Georgian Reprints is derived from 226,507 pagelevel XML files from the 19 th -Century British Library Newspapers, Part 1 collection [7]. The collection contains

DATA PAPER
Scissors and Paste: The Georgian Reprints, 1800 -1837 transcriptions from 51 newspaper titles, regularised into 31 distinct publications, across 38 years. The original page-level XML transcriptions files were first transformed using XSL (included in the dataset) into non-encoded plaintext files; all metadata and XML tags were removed. The NormalisedDate, title, and pageSequence metadata tags were retained and used to name the plaintext files, which served as unique identifiers in the subsequent processing steps.
Matching with Copyfind64 v4.1.4 The plaintext files were first analysed, and instances of shared text found, using the plagiarism-detection software Copyfind64 v.4.1.4 by Lou Bloomfield [8]. The following settings were used: These settings were designed to be as forgiving as possible to OCR errors while not accruing an unmanageable number of false positives. The collection was divided into one-month sets across the 38-year period. Each of these monthly sets was compared against itself and the succeeding seven months. Manual testing suggested that any matches after 200 days were either false positives, annual notices, advertisements or miscellany content rather than news. The manifests outputted by Copyfind64-the Raw Matching Reports-were then further processed by a series of heuristic filters, described in the next section.
Memetracker applied three, basic filtering heuristics to remove false positives from the raw matching reports. These heuristics are not set by the user, but are instead hard-coded into the software; these can, however, be modified by re-compiling the annotated source code available on Github. The first removed all self-matches-reprints in which the earlier and later instances were both from the same newspaper-through a simple deletion of these entries. Manual testing consistently indicated that these were advertisements, notices or other forms of boilerplate text. The second heuristic removed all matches that exceeded 200 days. This harmonized the data to precisely 200 days, inclusive, as the initial eight-month filter within Copyfind64 varied slightly throughout the year. The final heuristic further constricted the word count required for a match. The settings chosen for Copyfind required at least 200 matching words divided among phrases of no fewer than 10 words each. Memetracker, on the other hand, looked at the three quantitative similarity measures-the overall perfect match and the imperfect matching scores for each document-and filtered out those that had a perfect match of fewer than 160 words as well as an imperfect match of fewer than 90 words in both documents. These levels were chosen by testing the Raw Matching Report from the year 1815 to remove as many false positives as possible while retaining all true matches.
ReprintMapper, unlike Memetracker, describes specific ancestor-descendent relationships rather than all matching content. It applies identical heuristics as Memetracker before applying additional processing instructions. First, it removed all same-day matches. Although it was technically possible for one newspaper to reprint material from another on the same day, the lack of edition metadata and the paucity of newspapers printed in the same geographical location made such matches highly unlikely. Future iterations of this dataset may take geographical information into account more precisely.
Next, it compared all possible predecessors of each reprint on the number of matching words and the date difference. The match with the highest fidelity was determined to be the most likely ancestor of that reprint. Comparing computer and manually created stemma (trees) indicated that ordering on raw word matches resulted in identical or near-identical results to ordering based on close reading. Where two matches had identical fidelity, the earlier match was determined to be the ancestor as, in the absence of other information, it was logical to ascribe ancestry to the earliest possible source.
A manifest of these ancestor-descendent relationships was then outputted. A second manifest was also created of all pages that appeared to be evolutionary dead-ends; that is, where they did not appear to be the ancestor of any subsequent reprints.

Sampling strategy
The original XML dataset contained 42 corrupted files (out of a total corpus of 15700 files) that could not be transformed into plain text transcriptions; a full listing is available at https://github.com/mhbeals/BL19thC_Reprints/ tree/master/Errors. All other files within the collection were analysed.

Quality Control
Over the 38-year period, several publications altered their title. During the initial comparison process, the title indicated by the XML "title" tag was used to prevent data loss. In the final derived dataset, these titles have been normalised to enable consistent analysis across all years. A full manifest of titles and their normalisations in the derived dataset has been included. Versions of this data without this normalisation can be found at the Scissors and Paste Project Website.
After running Copyfind64 on a five-year set of pages, I sampled those raw matches that were excluded by Memetracker and ReprintMapper. I found that fewer than 2% of matches were incorrectly removed from the dataset, occasionally by being over 200 days apart, but largely owing to being a very short articles. This represents a likely loss of 300 out of 15700 records across the 38-year period. There was no evidence that this percentage was higher or lower in particular years or titles. This was considered an acceptable false-negative rate as lowering the threshold would have significantly increased the rate of false positives.

(3) Dataset description Object name
Scissors and Paste: The Georgian Reprints.

Format names and versions
TSV, XSL.

Dataset Creators
The dataset was devised and created by M. H. Beals, Loughborough University.

Language
The dataset contains 1,824 TSV files, divided into four directories. Each directory contains 456 files, each representing one month between January 1800 and December 1837. The headings for the TSV files in each directory are as follows:

RawMatchingReports
Files in this directory represent all matches as determined by Copyfind64, the month of the filename referring to that of the earlier of the two pages. See Table 1.

Memes
List those pairs of pages that share a significant amount of content. Individual matches are only listed once; that is, B-A and not also A-B. See Table 2.

AncestorDescendent
Files in this directory list every page, linking it to the one match that is most likely its direct ancestor (though not necessarily its direct predecessor). If there are no later variants of the page, the page is excluded from the list. The column headers are identical to those in Memes.

Deadends
Files in this directory describe pages that do not have any descendent pages, as determined by ReprintMapper. See Table 3.

(4) Reuse potential
The dataset was created to explore trends and correlations in text reuse within 19 th -century British newspapers. By understanding the extent to which identical, or near-identical, texts spread in rapid succession, it becomes clearer the degree to which Britain shared a common knowledge of domestic and global events. By understanding the general directionality of this news flow, we are also able to better understand the power relationship between metropolitan and provincial newspapers, as well those in port, industrial and agricultural communities. The dataset was also created to supplement existing knowledge about the political and commercial alignments of individual newspapers by allowing for high-resolution, longitudinal studies of shared content. Thus, there is particular potential for reuse of the dataset in periodical studies. It provides a quantitative context for any discussions of the influence of a particular newspaper, especially if it is further filtered to articles known to have originated in that title.
Other potential uses are as a reference text and as a basis for further research into specific memes or reprint families. As a reference text, The Georgian Reprints is currently the largest index of reprints within British periodicals. Any individual working with the British Library newspaper collection, in whatever context and from whichever disciplinary background, can look up the individual pages they are working with and see if that content is a reprint or was reprinted elsewhere. As explicit attribution was rare in this period, evidence indicating the possible origins of a text can help inform users as to its usefulness or fundamentally change the arguments based upon it.
Those researching particular events or texts can also further develop the dataset by filtering for texts on a particular topic (manually or through topic modelling of the original collection) and then adding specific descriptions to the pair listings. These augmented datasets could then be used to qualify the trends and correlations seen across the wider dataset. For example, news of a certain genre or regarding a particular topic may have a different pattern or rate of dissemination than the corpus as a whole. Likewise, although Memetracker filtered out the majority of advertisements by removing same-title matches, a large number of national advertisements for books, patent medicines and the lottery are also listed. Filtering for these entries using full-text searching within the original collection could offer new insights into Georgian advertising. Likewise, filtering for only same-title matches in the Raw Matching Reports is likely to return a corpus largely composed of local advertising.

Limitations and Provisos
Although a complete representation of the original digitised newspaper corpus, there are some key limitations to the data within the Raw Matching Reports. First, the 19 th -Century British Library Newspapers, Part 1 collection contains only 31 titles and does not represent a complete corpus of the British press for this period. Careful examination of which titles are included is recommended. Second, the Raw Matching Reports

RYEAR
This column indicates the year in which the later of two matching pages was printed.
RMONTH This column indicates the month in which the later of two matching pages was printed.

RDAY
This column indicates the day in which the later of two matching pages was printed.

RTITLE
This column indicates the title of the later of two matching pages.

RPAGE
This column indicates the page number of the later of two matching pages. Page numbers were given an S prefix w when two editions of the same date-title combination were discovered within in the original XML collection. In the original collection, these may have been designated with either an S or a V in the XML filename.

OYEAR
This column indicates the year in which the earlier of two matching pages was printed.
OMONTH This column indicates the month in which the earlier of two matching pages was printed.

ODAY
This column indicates the day in which the earlier of two matching pages was printed.

OTITLE
This column indicates the title of the earlier of two matching pages.

OPAGE
This column indicates the page number of the earlier of two matching pages. Page numbers were given an S prefix w when two editions of the same date-title combination were discovered within in the original XML collection. In the original collection, these may have been designated with either an S or a V in the XML filename.

PAGE
This column indicates the page number. Page numbers were given an S prefix w when two editions of the same date-title combination were discovered within in the original XML collection. In the original collection, these may have been designated with either an S or a V in the XML filename. only list those document pairs for which 200 words of shared content can be computationally identified. As the machinereadable transcriptions of these pages were obtained through optical character recognition (OCR) from digitised images, the accuracy rate can vary significantly; some pages are largely illegible. While these errors are unlikely to result in false positives, they may have caused a large number of false negatives. Therefore, the number of true matches is certainly higher than those recognised by the comparison process; these manifests should, therefore, be considered a minimum rather than an average or maximum reprinting rate for any given title or period. Likewise, ReprintMapper can only find the best match within the corpus. If two descendants of a single ancestor are present, but their common ancestor is not, ReprintMapper will link the later to the earlier version, even if these actually represent two different branches. This false positive must be excluded manually by using contextual knowledge. Finally, documentation as to the editions or individual copies digitised by the original British Library project were not indicated in the page-level metadata and could not be accounted for in the text comparison process. An important final proviso is that the transcriptions used in the text comparison process were at page rather than article resolution; that is, each file representing a whole page of text rather than a smaller subdivision of it. This decision was taken owing to (a) the imprecision of computational subdivision for newspapers from this period and (b) the improvement in the matching of ancestors and descendants when there was evidence of multiple reprints from a single source. However, if a page has reprints from two separate ancestors within the corpus, ReprintMapper will only link it to source with the larger match; the other connection will be lost. Iterations of the process at article level would produce additional pairs but lose the added certainty obtained from matching multiple-article reprints.