Citation indexes, such as Google Scholar, the Web of Science and Scopus, are one of the main literature retrieval tools available to modern scholars. They rest on by-now reasonably reliable large-scale reference parsers. Nevertheless, the disciplines traditionally part of the humanities are still poorly covered by citation indexes of any sort , something that both hinders the work of humanists and the understanding of the humanities as scholarly disciplines , not to mention their evaluation . A key aspect of the problem is the lack of citation data, especially for local publications not in English, and for non-article publication such as scholarly monographs. The availability of citation data depends on the technical challenge of reference parsing and extraction from literature in the humanities.
Reference parsing does not exist in isolation, but depends on the digital availability of publications beforehand, and its ultimate results rest on the possibility to disambiguate any extracted reference and link it back to the identifier of the resource it points to. Open challenges to the former step are copyright, digitization and classification of publications, the main open challenge to the latter step is the absence of global repositories of metadata on the sources of humanists – especially sensible for archival materials, whose meta-ecosystems are less integrated than library catalogs.
Reference parsing poses a set of challenges in itself, which are of two kinds:
- The inherent complexity and variety of referencing practices in the humanities, both at the syntactic and semantic levels. Such variety is mostly due to disciplinary traditions, to the use of footnotes as a textual space in itself, and the variety of cited sources.
- The lack of annotated data with sufficient coverage in two critical areas: locality (of language and scholarly practice) and time (going backwards at least to the 19th century, when modern academic scholarship starts).
These two challenges make reference parsing in the humanities not intrinsically different than for the sciences, simply more involved. Several projects already exist which specifically aim at providing frameworks to extract reference data also from, or specifically from humanities’ publications [5, 7, 10].
The manually annotated dataset of references released here is part of the Linked Books project1, whose goal is to develop an in-depth approach to the problem of indexing humanities’ publications via citations. The project only considers a field in historiography, the history of Venice, but does so by considering local historiography and all the modern period of the discipline (19th century to nowadays). The core idea of the project is to involve research libraries in a collaborative and distributed digitization and indexation process, by developing and providing the necessary IT infrastructure. The dataset being released was produced by librarians working for the Linked Books project during the period 2014 to 2016. Its use is to power the reference parsing and extraction modules of the platform in use for the daily operations of the project. To the best of our knowledge, no comparable dataset has been published yet.
This release is directed towards practitioners in the domain of reference parsing, with the hope that it could be of use to enrich their datasets. It is also for all interested into this specific machine learning task, with the hope that they can improve on the results here presented. Lastly, it is meant to contribute and encourage a better integration of datasets and technical tools in this domain.
The main characteristic of this dataset is its provenance, namely the corpus of publications from which it was extracted: a mixed set of monographs and journal article, with special attention to local publications in non-English languages. Secondly, it contains references to any possible source cited by historians, over a very long period of time, thus the annotation taxonomies were refined with a bottom-up approach. Thirdly, it contains references from both reference lists and footnotes, including abbreviated references which are commonplace in the humanities. Lastly, the dataset is used to train parsers using a standard technique in this domain, Conditional Random Fields.
Annotation was conducted using Brat 2. It proceeded by page: all the references of a randomly picked page are annotated. If a reference spans two pages, both pages are entirely annotated. A first testing period was needed to stabilize the annotation taxonomy, whose resulting annotations have been discarded thereafter. The main challenge encountered during annotation is the presence of outlier tags: rarely occurring, yet sufficiently distinct as to warrant a category on their own. This is especially true for unpublished primary sources, whose tag variety is greater than published materials. Outlier tags need to be taken into account for automated parsing. After annotation, all annotations are consolidated and exported for further use.
The selection of the corpus of publications from which to extract references to annotated is described in detail elsewhere . The rationale was to select: recent monographs and the complete archive of specific journals, at the aid of library catalog, scholarly bibliographies and domain experts. The result was a first collection of 1922 monographs and 3 journals: Ateneo Veneto, Archivio Veneto and Studi Veneziani, for a total of 552 issues. After digitization and OCR, the latter done using ABBYY FineReader Corporate v12, a second sampling was conducted for annotation, namely:
- 196 monographs were randomly picked and their reference lists completely annotated.
- 144 journal issues were randomly picked and a set of references were annotated from their footnotes (a minimum of two contiguous pages for each article in the issue, leaving annotators to select pages dense in references). The first issue was published in the year 1866, the last in 2013, in order to cover all periods of interest and variations in referencing practices therein.
The quality of the annotations is guaranteed by the joint work of annotators, who were working at the same time in the same room, thus consulting each other on problematic choices. No double-keyed annotation on a subset of the data has been conducted at this date.
The main annotation distinction was made between generic and specific tags, or whole references and their components. Generic tags included the distinction between primary sources (such as archival documents), secondary sources (books) and meta sources (secondary sources published within a container source, such as journal articles or contributions in edited volumes). This classification choice is motivated by a) the difference in their components (specific tags) and b) the needs of the look-up module in our pipeline (which matches a reference with a unique identifier in an internal or external repository, such as a library catalog, in order to define a citation. Different external resources are used for any given generic category). Specific tags include instead all the possible components of the three classes of references mentioned above, such as author, title and publication year for secondary sources, archive, archival reference and archival unit for primary sources. More examples are given in Tables 1 and 2. The full taxonomy is available in the GitHub repository associated with this article.
|Avg / total||0.837||0.840||0.836||128’794|
|Avg / total||0.875||0.874||0.874||147088|
The annotated dataset is given as a zipped JSON file within a repository containing extra details and code to train parsing models.
Format names and versions
JSON, Python 3.
2014 to 2016.
Giovanni Colavizza, Matteo Romanello, Martina Babetto and Silvia Ferronato.
English. Contents are in a variety of languages, mainly Italian, English, French, German, Spanish and Latin.
CC BY Attribution 4.0 International.
GitHub and Zenodo.
Statistics and contents
Basic statistics of the dataset.
|Generic from monographs (reference lists)||11’360|
|Generic from journal articles (footnotes)||29’711|
|Annotated documents over the whole database||14%|
|Avg. annotated pages per annotated document||17|
Parsers trained with the dataset
Two parsers were trained using the annotated dataset. First, a parser assigns specific tags on the full-text of new publications (model 1: citation parsing), secondly, another parser assigns generic and begin-end tags to the same full-text, relying on the results of the first parser (model 2: citation extraction and classification). Both parsers use Conditional Random Fields (CRF), a standard technique for text parsing tasks 3. The interested reader can find an introduction to CRFs in . Preliminary parsing results on subsets of the dataset are already reported elsewhere, including a more detailed description of the challenges encountered, features used and ablation tests conducted in order to select the best performing combination of features [2, 3]. The full code is provided for replication in the repository associated to this article.
Both models were trained as follows. First, the annotated dataset was consolidated in order to group similar and under represented tags under the same tag. Details are given in the repository’s README file. Afterwards, for each model 10% of the relevant annotations were kept aside for validation, of the remaining 90%, 25% is considered as test and 75% as train data. Using a quasi-Newton gradient descent method (L-BFGS), there are two main parameters in CRFs: c1 for L1 and c2 for L2 regularizations, respectively. The provided models use the following parameters:
- Model 1, c1: 0.07; c2: 0.378.
- Model 2, c1: 0.09; c2: 0.447.
Another relevant choice for the CRF models is the dependency window to consider, which was set to two tokens before and after the one under consideration.
Both models perform acceptably well if one considers the most important tasks they have. For model 1, these regard being correct on the most discriminative (and represented) tags such as author, title or archival reference. For model two, this entails getting the extraction task correctly (begin-end), something more important than getting the classification correctly, which is performed decently given that most errors entail miss-classifications at the classification but not the begin-end task.
The 5-fold validation and final validation results are given in Table 3, by considering models now trained on all but validation data.
|Model||5-fold average f1 score||Validation precision||Validation recall||Validation f1 score|
|2: extraction and classification||0.930||0.908||0.908||0.908|
Reuse potential and future work
The main use of this dataset is to train new or enrich existing reference extraction tools with more data, of a kind normally difficult and costly to find. The dataset might be of use also to teachers and interested researchers willing to experiment machine learning techniques in order to improve upon our results: the code is shared in order to encourage not only replication but especially improvement.
The dataset comes with a number of limitations, most notably its domain specificity. As it happens it is unknown, and an interesting open question, to what extent this annotated dataset can perform well on similar tasks but for different contents. To the extent possible, the provided models are largely language-independent, due to the fact that the corpus already contains a variety of languages.
We plan to focus next on the release of larger quantities of both manually and automatically produced annotations as linked data. We suggest that two immediate open challenges for the community are: sharing and federating annotated data for reference parsing in the humanities under unique standards; subsequently developing general parsers which could be reliably applied to a variety of different collections.