Annotated References in the Historiography on Venice: 19th–21st centuries

We publish a dataset containing more than 40’000 manually annotated references from a broad corpus of books and journal articles on the history of Venice. References were considered from both reference lists and footnotes, include primary and secondary sources, in full or abbreviated form. The dataset comprises references from publications from the 19th to the 21st century. References were collected from a newly digitized corpus and manually annotated in all their constituent parts. The dataset is stored on a GitHub repository, persisted in Zenodo, and it is accompanied with code to train parsers in order to extract references from other publications. Two trained Conditional Random Fields models are provided along with their evaluation, in order to act as a baseline for a parsing shared task. No comparable public dataset exists to support the task of reference parsing in the humanities. The dataset is of interest to all working on the domain of reference parsing and citation extraction in the humanities.

• The inherent complexity and variety of referencing practices in the humanities, both at the syntactic and semantic levels. Such variety is mostly due to disciplinary traditions, to the use of footnotes as a textual space in itself, and the variety of cited sources. • The lack of annotated data with sufficient coverage in two critical areas: locality (of language and scholarly practice) and time (going backwards at least to the 19 th century, when modern academic scholarship starts).
These two challenges make reference parsing in the humanities not intrinsically different than for the sciences, simply more involved. Several projects already exist which specifically aim at providing frameworks to extract reference data also from, or specifically from humanities' publications [5,7,10]. The manually annotated dataset of references released here is part of the Linked Books project 1 , whose goal is to develop an in-depth approach to the problem of indexing humanities' publications via citations. The project only considers a field in historiography, the history of Venice, but does so by considering local historiography and all the modern period of the discipline (19 th century to nowadays). The core idea of the project is to involve research libraries in a collaborative and distributed digitization and indexation process, by developing and providing the necessary IT infrastructure. The dataset being released was produced by librarians working for the Linked Books project during the period 2014 to 2016. Its use is to power the reference parsing and extraction modules of the platform in use for the daily operations of the project. To the best of our knowledge, no comparable dataset has been published yet.
This release is directed towards practitioners in the domain of reference parsing, with the hope that it could be of use to enrich their datasets. It is also for all interested into this specific machine learning task, with the hope that they can improve on the results here presented. Lastly, it is meant to contribute and encourage a better integration of datasets and technical tools in this domain.

Methods
The main characteristic of this dataset is its provenance, namely the corpus of publications from which it was extracted: a mixed set of monographs and journal article, with special attention to local publications in non-English languages. Secondly, it contains references to any possible source cited by historians, over a very long period of time, thus the annotation taxonomies were refined with a bottom-up approach. Thirdly, it contains references from both reference lists and footnotes, including abbreviated references which are commonplace in the humanities. Lastly, the dataset is used to train parsers using a standard technique in this domain, Conditional Random Fields.

Steps
Annotation was conducted using Brat [11] 2 . It proceeded by page: all the references of a randomly picked page are annotated. If a reference spans two pages, both pages are entirely annotated. A first testing period was needed to stabilize the annotation taxonomy, whose resulting annotations have been discarded thereafter. The main challenge encountered during annotation is the presence of outlier tags: rarely occurring, yet sufficiently distinct as to warrant a category on their own. This is especially true for unpublished primary sources, whose tag variety is greater than published materials. Outlier tags need to be taken into account for automated parsing. After annotation, all annotations are consolidated and exported for further use.

Sampling strategy
The selection of the corpus of publications from which to extract references to annotated is described in detail elsewhere [3]. The rationale was to select: recent monographs and the complete archive of specific journals, at the aid of library catalog, scholarly bibliographies and domain experts. The result was a first collection of 1922 monographs and 3 journals: Ateneo Veneto, Archivio Veneto and Studi Veneziani, for a total of 552 issues. After digitization and OCR, the latter done using ABBYY FineReader Corporate v12, a second sampling was conducted for annotation, namely: • 196 monographs were randomly picked and their reference lists completely annotated. • 144 journal issues were randomly picked and a set of references were annotated from their footnotes (a minimum of two contiguous pages for each article in the issue, leaving annotators to select pages dense in references). The first issue was published in the year 1866, the last in 2013, in order to cover all periods of interest and variations in referencing practices therein.

Quality Control
The quality of the annotations is guaranteed by the joint work of annotators, who were working at the same time in the same room, thus consulting each other on problematic choices. No double-keyed annotation on a subset of the data has been conducted at this date.

Annotation taxonomies
The main annotation distinction was made between generic and specific tags, or whole references and their components. Generic tags included the distinction between primary sources (such as archival documents), secondary sources (books) and meta sources (secondary sources published within a container source, such as journal articles or contributions in edited volumes). This classification choice is motivated by a) the difference in their components (specific tags) and b) the needs of the look-up module in our pipeline (which matches a reference with a unique identifier in an internal or external repository, such as a library catalog, in order to define a citation. Different external resources are used for any given generic category). Specific tags include instead all the possible components of the three classes of references mentioned above, such as author, title and publication year for secondary sources, archive, archival reference and archival unit for primary sources. More examples are given in Tables 1 and 2. The full taxonomy is available in the GitHub repository associated with this article.

Dataset description
The annotated dataset is given as a zipped JSON file within a repository containing extra details and code to train parsing models.  3 . The interested reader can find an introduction to CRFs in [12]. Preliminary parsing results on subsets of the dataset are already reported elsewhere, including a more detailed description of the challenges encountered, features used and ablation tests conducted in order to select the best performing combination of features [2,3]. The full code is provided for replication in the repository associated to this article. Both models were trained as follows. First, the annotated dataset was consolidated in order to group similar and under represented tags under the same tag. Details are given in the repository's README file. Afterwards, for each model 10% of the relevant annotations were kept aside for validation, of the remaining 90%, 25% is considered as test and 75% as train data. Using a quasi-Newton gradient descent method (L-BFGS), there are two main parameters in CRFs: c1 for L1 and c2 for L2 regularizations, respectively. The provided models use the following parameters: • Model 1, c1: 0.07; c2: 0.378. • Model 2, c1: 0.09; c2: 0.447.
Another relevant choice for the CRF models is the dependency window to consider, which was set to two tokens before and after the one under consideration.
The evaluation of both models with best parameters, on the test set is given in Tables 1 and 2, to be read along with the confusion matrices in Figure 1.
Both models perform acceptably well if one considers the most important tasks they have. For model 1, these regard being correct on the most discriminative (and represented) tags such as author, title or archival reference. For model two, this entails getting the extraction task correctly (begin-end), something more important than getting the classification correctly, which is performed    decently given that most errors entail miss-classifications at the classification but not the begin-end task. The 5-fold validation and final validation results are given in Table 3, by considering models now trained on all but validation data.

Reuse potential and future work
The main use of this dataset is to train new or enrich existing reference extraction tools with more data, of a kind normally difficult and costly to find. The dataset might be of use also to teachers and interested researchers willing to experiment machine learning techniques in order to improve upon our results: the code is shared in order to encourage not only replication but especially improvement.
The dataset comes with a number of limitations, most notably its domain specificity. As it happens it is unknown, and an interesting open question, to what extent this annotated dataset can perform well on similar tasks but for different contents. To the extent possible, the provided models are largely language-independent, due to the fact that the corpus already contains a variety of languages.
We plan to focus next on the release of larger quantities of both manually and automatically produced annotations as linked data. We suggest that two immediate open challenges for the community are: sharing and federating annotated data for reference parsing in the humanities under unique standards; subsequently developing general parsers which could be reliably applied to a variety of different collections.