(1) Context and motivation
The area of Digital Humanities (DH) stands at the intersection of computing and the humanities. Each day, DH involves more collaborative and transdisciplinary research, bringing digital tools and methods for studying the humanities. DH is an essential growing research field for which natural language processing (NLP) has much to offer. NLP may enable researchers to explore large amounts of data and discover new elements and correlations according to different aspects of each research area. This trend is also known for historical research, and it is claimed that computers are already ubiquitous in historical scholarship (Romein et al., 2020).
Named Entity Recognition (NER) is an NLP task able to find proper nouns in a given text and classify them according to various pre-defined categories (Jurafsky & Martin, 2000). In recent years, the advent of neural networks that automatically learn effective features from raw text (Collobert et al., 2011) has allowed NER systems to bypass the manual selection of features. The effectiveness of word and character embedding (continuous real vectors) obtained by pre-trained neural language models (LMs) (Chiu & Nichols, 2016; Collobert et al., 2011; dos Santos & Guimarães, 2015; Lample et al., 2016) has significantly impacted these systems. These embeddings encapsulate semantic, syntactic, and morphological information about the language in a way that can easily be incorporated into a sequence tagger, usually implemented as a recurrent neural network. Such systems are available for easy use in NLP frameworks such as spaCy.1
More recent state-of-the-art NER systems use contextualised embeddings such as BERT and Flair (Santos et al., 2019a; Souza et al., 2019). This is accomplished by adding to the system the embedding matrix of a pre-trained LM and incorporating the pre-trained LM itself as the initial layers of the system. An example of a system that uses this method for English and German is presented in Akbik et al. (2018). For the Portuguese language, such systems are presented in Souza et al. (2019) and Santos et al. (2019a).
This paper aims to present a corpus of texts dated 1758–1761, the Parish Memories, enriched with the (automatic) annotation of entities of the classes person (PER), location (LOC), and organisation (ORG). The reasons for this choice of classes are both the availability of systems that recognise them and their relevance for historians, as explained later in the paper.
For the automatic annotation, we evaluated two available Portuguese NER systems. One is a commonly used framework, spaCy, which includes Portuguese, among other languages. The second was developed specifically for the Portuguese language and has been well evaluated in many domains (Santos et al., 2019a). The evaluation considered a portion of the corpus, where the automatic extraction was contrasted with human analysis. The manual annotation is presented and discussed regarding the particularities of the corpus, with insights for improving the annotation process, both manual and automatic, in further research. For this paper, however, its main role is to demonstrate the quality of the enriched corpus provided.
The paper is organised as follows: Section 2 presents the original corpus, introduces the NER systems considered for evaluation, and describes the manual annotation; Section 3 discusses the systems’ evaluation; Section 4 presents the resulting annotated corpus and the usefulness of obtaining knowledge through the automatic processing of textual historical corpora; Section 5 presents conclusions and future work.
(2) Material and methods
This section presents the corpus provided, the systems considered for automatic annotation, and the process of manually annotating the entities.
(2.1) The Corpus: Portuguese Parish Memories – Alentejo
The Parish Memories-Alentejo (PM-A) corpus is an essential source for obtaining Portugal’s description in 1758–1761. Digitised copies of the original documents are available at Arquivo Nacional da Torre do Tombo.2Figure 1 presents an example of the original manuscript.
The collection corresponds to a survey sent to those responsible for all country dioceses at the beginning of 1758, and then distributed to the parish priests. The Portuguese crown sent the survey to better know the country’s real situation less than three years after the big earthquake of 1755. At that time, it was usual for the sovereign states and some academies to conduct surveys to improve their knowledge about certain territories or topics. Portugal is no exception, and as such, the coverage and great detail of the Parish Memories make it one of the most important surveys of its kind (Chorão, 1987). It contains detailed descriptions of almost each parish of which there were about 4,232 of these religious units in Portugal, in 1798 (Santos, 1995). The parish was the smallest organisational unit in the country at that time.
The survey was organised into three major parts: land, mountain, and river. It included 27 questions about the land, 13 about the mountains, and 20 about the rivers. The parish priests only responded to what suited their territory, and their texts were handwritten. Regarding the land section, the questions included historical, administrative, jurisdictional (ecclesiastical and secular), geographical, and economic topics. It also asked about the impact of the 1755 earthquake in each locality and requested people to describe the reconstruction’s status. Questions about the mountains were asked to obtain descriptions of orography, fountains, medicinal herbs, mines, lagoons, villages, and monasteries. Concerning the river section, the detail was also significant: size and intensity of flow, navigability, the direction of the stream, fish and fishery-related activities, bridges, mills, cultivation of the banks, and many other subjects were asked about. Each of the three parts ended with an invitation to describe what was specific and relevant about each place and was not analysed in the previous topics (Santos et al., 2020a). Some parish priests accepted this last invitation, and others had nothing special to point out.
Transcribed digital versions of these reports are available through CIDEHUS website.3 These transcribed versions constitute a sub-corpus with texts from 366 parishes of today’s Alentejo region, Southern Portugal’s largest administrative area. The PM-A represents 90% of the total of Alentejo’s parishes in 1758.
In these originally handwritten texts, the orthography variations are remarkable, as there was no standardisation at that time, in the sense that there was no uniform spelling that everyone followed (Mateus & Cardeira, 2007). This classical Portuguese language period is, in fact, characterised by significant variation in orthography (Gonçalves, 2003). Nasal diphthongs have many spellings, and the use of double consonants and pseudo-etymological spelling were frequent. The sibilants register is also an excellent example of variation, among others (Kemmler, 2001).
In the PM-A corpus, these phenomena are pervasive and illustrate the orthographic variation: ‘tãobém/também’ [also], ‘administrassam/administração’ [administration], ‘officiais/oficiaes’ [oficial], ‘parocho/paroco’ [priest], ‘oitosentos/oitocentos’ , and ‘freguesia/freguezia’ [parish]. The variation is sometimes difficult to predict, as we can find many spelling variants of a single word, like in ‘noticcia/noticia/notiçia/notticia’ [news]. In this period of the history of the Portuguese language, there was a remarkable renewal in the lexicon; many new words entered the written language (Verdelho, 1987), enlarging the possibilities of variation in words.
Due to the large number of texts (one for each of the 366 parishes), the original documents were transcribed by a group of people, including undergraduate students and research fellows, over a period of 12 years (2008–2020). It also comprises transcriptions previously published by other authors outside our research group. The transcriptions were collected as MS Word documents. Although not having a significant impact on the historical research objectives, these different origins reflect distinct, and sometimes contrasting, transcription criteria. The resulting variation is a challenge for the automatic processing of classical Portuguese as most existing tools operate in the contemporary Portuguese language (Cameron et al., 2020). In addition, uppercase and lowercase letters were used randomly in the original documents, with some transcribers keeping the randomness and others interpreting and updating it to current Portuguese. This constitutes a particular challenge for NER, since uppercase is usually a strong indicator. Also, the syntax and punctuation marks do not precisely follow the contemporary usage and conventions, which constitutes another challenge.
(2.1.1) Annotated Entities in the Parish Memories
Named entities recognised by most NER systems are generally of the types PER, LOC, and ORG. These entities are also essential topics for historians as they are for NLP in general. They structure a necessary part of the historian’s inquiry, answering basic questions: Who? Where? Which institution? In the past, everything happened in a specific place, was developed or executed almost always by someone, even if it is not known by whom, and organisations/institutions often framed the actions. Figure 2 shows examples of each category occurring in the PM-A.
(2.2) The NER Systems
There are many software options regarding the NER task (Schmitt et al., 2019). We evaluated two NER systems: the Portuguese NER model available at the spaCy framework and a Portuguese NER model, based on Flair contextual embeddings assessed in several domains (Santos et al., 2019a). The former was developed to consider industrial needs, and the latter is an academic product that has shown the best results in several evaluations. The intention was to choose the NER system with the best performance for these classical Portuguese texts. Both spaCy and Flair frameworks provide their own tokenisation.
The spaCy framework is ready to use for several natural language tasks. It features Convolutional Neural Network (CNN) models for NER, POS tagging, among other tasks. Pre-built statistical neural network models to perform these tasks are available for several languages, including contemporary Portuguese. It contains three Portuguese models based on word embeddings. We have used the most significant model available at the time.4 The Portuguese multi-task CNN sequence tagging model was trained on the Portuguese WikiNER (Nothman et al., 2013) corpus for the NER task. Word vectors were trained using FastText CBOW (Bojanowski et al., 2017) on Wikipedia5 and OSCAR (Ortiz Suárez et al., 2020).
The second system used was BCF6 (BiLSTM-CRF+FlairBBP), which is composed of a neural network previously used for NER in English and German (Akbik et al., 2018) and a language model called FlairBBP. This model was developed based on a raw text corpus of 4.9 billion words from contemporary Portuguese texts (Santos et al., 2019a). BCF was previously evaluated in several domains, including geoscience (Consoli et al., 2020), law (Santos et al., 2019b), and health (Santos et al., 2020b), and has been the one with the best performance across domains. We trained the BCF system in the First HAREM corpus (Santos & Cardoso, 2007), a Portuguese corpus manually annotated to develop and evaluate Portuguese NER systems. BCF is publicly available.
(2.3) Manual Annotation
We performed manual annotation in a part of the PM-A corpus. We chose one of the large texts, aiming for a good number of examples for each class. Our choice of text was considered adequate, since the smallest number of examples in a class is 368 (as presented in Table 1). We followed the guidelines proposed by the HAREM project (Santos & Cardoso, 2007). We considered only three main categories, as previously explained. We made no distinctions between different sub-types of mentions. For instance, in HAREM, the PER category was also marked for sub-types regarding its kind as being an individual or an occupation, as in Carlos I or Sua Majestade (The King), respectively. For performing the manual annotation, two annotators, historians, used the INCEpTION open-source tool (Klie et al., 2018).7
Table 1 presents the number of manually annotated instances for each class. The two annotators agreement was measured based on Kappa statistics, usually employed for measuring annotation quality (McHugh, 2012), using the script provided by Python’s sklearn (Pedregosa et al., 2011) library.9 The resulting Kappa was 0.71. Although higher agreement has been achieved for this type of annotation task, for this work we considered it satisfactory since our main goal was not to deliver a gold standard, but instead, to provide the automatic annotation of the collection. Another reason for accepting this agreement level is that the range 0.61–0.80 is, in general, considered as substantial agreement.
(3) Evaluation, Results, and Discussion
This section discusses the performance of the automatic annotation, based on the usual metrics for evaluating the NER systems: recall, precision, F-measure. For obtaining the metrics, we exported the manually annotated text in IOB tagging scheme, CoNLL-2002 (Sang & Erik, 2002). We used the original CoNLL-2002 Perl script.10 We present the results below. Regarding the number of annotated instances by the two systems (Tables 2 and 3), we can see that the class ORG is considerably smaller than the one in the manual annotation. One of the systems (BCF) is more conservative than the other, resulting in a smaller number of instances in the three classes.
Tables 4 and 5 present the evaluation measures. Considering that the systems were trained on contemporary texts and have not been adapted for the 18th century language, and neither had new examples that could reflect the entity annotation differences, the results point to a fairly good performance. In general, considering all classes, we can see that the BCF system performed better in the identification and classification of named entities, being more conservative it achieved a better precision, with general F1 of 38% (spaCy) and 45% (BCF). Note that these systems resulted in a general F1 of 60.74% (spaCy) and F1 of 80.58% (BCF) when evaluated in a test set of contemporary Portuguese, the Mini-Harem golden corpus,11 considering these same three categories.
These differences in the results point to the impact of differences in the language regarding current vs 18th century language. Since uppercase is less consistent in these texts and there is a lot of orthographic variations, it affects both the nature of entities and context which is also used for the identification of the entities, for instance for hospital (ORG) we have ospital, ispital, espital.
As expected, we confirmed the better performance of BCF, which might be due to the use of language models that provide high-quality representations based on context (Akbik et al., 2018; Santos et al., 2019a).
For the category LOC, BCF presents an F1 of 51%, for PER, 54%, with precision higher than recall for PER and recall higher for LOC. There is a remarkable fall in the recall for the ORG category in both cases, meaning that the system did not find many of the ORG cases.
In the example below (from Monforte – Nossa Senhora da Graça, p. 1195), the context points to a kind of relation (paying money) that indicates that the referent is an ORG and not LOC.
A igreja tem de fabrica 12 mil reis e o Reverendo Prior tem obrigação de dar para a <LOC> Real Capela de Villa Vicoza </LOC>, vinte e sinco mil reis.
The amount of money the church has for works is twelve-thousand reis, and the Reverend Prior must give the <LOC> Royal Chapel of Villa Vicoza </LOC> twenty-six thousand reis.
Despite the difficulties, based on this analysis, we chose BCF to annotate the corpus. The corpus enriched with the automatic identification of three categories of named entities is described next.
(4) Resulting Dataset and Its Applications
This section presents a description of the enriched dataset provided with examples and gives some notes on entity extraction applications in history research.
(4.1) Description of the dataset
The dataset, Parish Memories with named entities,12 consists of 366 transcribed texts from the original handwritten collection, where each text contains the description of a parish. It amounts to approximately 650,000 word tokens and 35,000 word types. These texts are annotated with entities PER, LOC, ORG. The dataset also contains the list of the extracted entities for each text and a global list with all entities from the collection. All these lists have frequency counts of terms in each category, number of mentions in each text. The transcribed texts with no annotations are also included. We performed the annotation automatically, and the agreement of the system with the manual annotation was around 0.65, whereas the agreement between human annotators was 0.71. The manually annotated text used in the evaluation and the corresponding list of entities are provided in the dataset.
In total, the collection has 13,600 person mentions, 23,511 location mentions, and 1,321 organisation mentions. The total number of automatically identified mentions is 38,432. The annotation is given in two IOB formats: (i) CoNLL, which lists each word/token with both positive and negative tags, and (ii) the in-text tags in which positive tags follow the respective words. In both formats (B-CAT) identifies the first word of an entity belonging to category CAT and (I-CAT) identifies the continuation of an entity, as exemplified in Table 6. The dataset provides the manually annotated texts, used for evaluation purposes, in the CoNLL format. A better visualisation interface of the annotated texts can be obtained with the INCEpTION tool, as shown in Figure 3.
|CoNLL:||In text tags:|
edificada O na O Estremadura B-LOC , O e O vizinhancas O de O Castella B-LOC , O em O distancia O de O huma O legoa O , O na O ProvinciaB-LOC do I-LOC Alenteio I-LOC
… edificada na Estremadura <B-LOC> , e vizinhancas de Castella<B-LOC> , em distancia de huma legoa , na Provincia <B-LOC> do <I-LOC> Alenteio <I-LOC>
Table 7 shows examples of automatically extracted entities for each category, while Table 8 illustrates some of the most frequent expressions referring to entities. For counting the global frequencies, we normalised words to lowercase and removed diacritics. However, there are other spelling variant issues to deal with for further statistical analysis (e.g., matris/matriz).
|Deos||Aviz||Companhia de Jesus|
|Divus Augustus Cesar||Elvas||Conselho|
|el-Rey D. Deniz||Estremadura||Fontalia|
|Francisco de Freytas||Lisboa||Igreja Parochial|
|Rey de Portugal||Occidente||Ordem da Comarca|
|Salvator Mundi||Poente||Senado da Camara|
|Sancta Maria||Porta da Trayção||Torre dos Coelheyros|
|Senhor Rey D. Deniz||Reyno||Villa|
|sua magestade||128||evora||552||igreja matris||41|
|vossa excelencia||87||norte||409||igreja parochial||29|
|sao pedro||58||elvas||214||villa de monforte||19|
In terms of using this type of resulting information, we can say that these frequencies confirm the impact of the two most prominent characters in Ancien Régime society: God and the King. Regarding LOC entities, the domination of “villa” is very expressive and the main towns of the South are also present as well as Lisbon, since the distance from each village to the Court is mentioned in every text. Addressing the organisations, “villa” is also dominant. The disambiguation among these two uses of the term is indeed a challenging factor for the annotations.
Available in an open-access format, this corpus provides both the list of extracted entities and the entities annotated in-text, along with its original textual context. Keeping context is essential for historians to decode the precise meaning of a word and to assess the relevance of historical data, and this approach allows us to preserve it.
Entities, once identified, can be linked to other information, and texts can be reused by other researchers and projects. When we tag historical sources, enabling better document retrieval and information extraction, we improve the capacity of analysis. It is possible to find correlations between entities faster than manually or even to find new ones that would not be found otherwise. In this way, when we extract named entities from this kind of corpora, we can cross this information with other datasets from the same period, working with a larger number of sources. For instance, the PER category can be linked to prosopographical data, making it easier to identify biographical aspects, like birth or burial places, among others. This type of approach allows us to extract data to compare and carry out different analyses.
Regarding the connection with geographic information systems, although we are not aware of the exact borders of all the freguesias in the corpus, we know the exact location of the most important villages and the locations of some points of interest referred to in the texts (e.g., archaeological ruins). By mapping this data with artificial grids (lattices) (Birch et al., 2007) for representing the regions geographically, we can compare the parishes with each other to verify the institutions’ homogeneity in the various geographical areas. We can also find the most valued institutions of that time through the frequency of citations in certain regions, clarifying the profile: ecclesiastic, secular, or mixed. Inspired by central-place theory (Christaller & Baskin, 1966; Romão, 2019) that measures the influence area of a geographical point, we can also verify the area of influence of a parish, organisation, or person by measuring the distance over where it is mentioned in other Parish Memories. These are only a few of the many possibilities of research that can be developed through this corpus.
We acknowledge that the data, being entirely automatically produced, is far from perfect, especially for the class ORG, which is rather incomplete. The idea is to provide some input, even if there is missing information; the findings brought about by the extraction may serve as partial evidence when seen through the lens of the historians. For instance, even for the case of ORG, which is low in recall, by looking at the data we can observe the prevalence of churches, councils, counties, brotherhoods, hospitals, and parishes. The material made available can be further improved by other research groups interested in either the material themselves or in the technical challenges for improving it. This first release serves as silver data to train other NER systems that may be more apt for texts from the classical period.
(5) Concluding Remarks and Future Work
This paper presents a research resource for Digital History, a transcribed subset of originally handwritten texts from the 18th century, describing the geography and economy of Alentejo in Portugal, which is now enriched with the annotation of named entities in three categories: PER, LOC, and ORG. The annotation was made automatically for the whole corpus, whereas a part of it was manually annotated for evaluation purposes. The dataset provides both automatic and manual annotations.
There are similar studies made for other languages, including Hubková et al. (2020) who present a study for named entities in a Czech historical corpus. In our case, it is a Portuguese historical corpus, which can be useful not only for historians, but also for architects, demographers, territory administrators, and planners.
As future work, we plan to develop further manual annotations that consider new categories, such as time, occupations, and social categories of people, which are crucial distinctions for historical research purposes. The last of which intends to adjust these labels to the society of the time and its legal system, strongly marked by the inequality of each person or group before the law. Based on the difficulties of distinguishing ORG and LOC, we will consider the GPE class (geopolitical entities for municipalities, countries, and others), first introduced in the ACE13 project, mainly meant to overcome (or ignore) the metonymy problem between ORG and LOC.
We aim to develop new and specific guidelines based on the lessons learned with this first manual annotation. We will improve the manual annotation process with discussions between annotators and will add a curation phase as we had a reasonable difference in the number of annotated entities among the human annotators. To improve the annotation quality, it is essential to create new guidelines adequate to the classical period of Portuguese language. With such improved annotation, we can tune the system with the new annotated categories. A new LM that includes 18th century texts with a detailed spelling variants description is also relevant to building a more robust system. We believe that future more robust NER system can be equally helpful to other 18th century handwriting sources and perhaps to earlier times.
There are problems with variations in spelling to be tackled in order to achieve better statistics and indexation. One of the solutions can be to lemmatise variants or to annotate variation, indexing all variants to the correspondent lemma. Examples of variants are: igreja matris/igreja matriz, parochia/parrochia. This pre-processing task is very challenging (Baron & Rayson, 2008; Dereza, 2018). There are some tools for classic languages, like CLTK: the classical language toolkit,14 but concerning the classic period of the Portuguese language, existing tools15 still need to be trained for this period.
From here we can link the Parish Memories to other sources, such as the book Corografia Portuguesa (Costa, 1706–1712), which contains data for the same region about the foundations of the cities and convents, bishops’ catalogues, comments on illustrious men and genealogies of noble families, topographic and nature descriptions, and other observations.
Another possibility for future study is to identify other entities and concepts particular to specific domain studies, such as more fine-grained location categorisation with a focus on mountains and rivers, or mapping the descriptive data of the destruction caused by the Lisbon earthquake in 1755, for an Earthquake Damage Assessment. Studies from the corpus can serve to answer questions such as: Do the descriptions in the corpus reflect current scientific data? Can they be corroborated with archaeological data?
The INCEpTION annotation tool (Klie et al., 2018) includes semantic annotation (e.g., concept linking, fact linking, knowledge base population, semantic frame annotation) which are relevant features for historians. In this stage of the study, we only used it for annotating the three categories mentioned above; other features of the tool are relevant for the future development just described.
Supplementary File 1: Parish Memories with Named Entities Dataset. ParishMemorieswithNEs_V2.zip. DOI: https://doi.org/10.5281/zenodo.4946479
The dataset is organised as follows:
Parish Memories with Named Entities (ParishMemorieswithNEs) contains digitized, transcribed texts from 1758 Portuguese parochial surveys. This collection refers to the surveys from the Alentejo region. Here they are provided with annotation of named entities (person, organization, location). The annotation was done automatically, it is therefore incomplete and not totally precise, but potentially useful and can be further improved by other research groups, either interested in the material themselves or in the technical challenges for improving it. All the issues are discussed in more detail in the paper.
- – GlobalTotal: all classified elements with frequencies for each category
- – TextTotals: number of classified elements and number of occurrences for each text
- Texts+Entities(AutomaticAnnotation): 366 automatically annotated with NEs
- – PM-1: for each parish memory PM
- * PM1.txt (original texts)
- * ptTagged-PM1(NE tagged texts)
- * CoNLL-PM1(NE CoNLL format)
- * Named_Enties-PM1 (list of named entities)
- – PM-2: same as above for PM2
- – PM-366: same as above for PM366
obs: empty lines in CoNLL files correspond to original document breaks
- Texts+Entities(ManualAnnotation): manually annotated texts with NEs
- – ManualAnnot_CoNLL-PM.txt
- – ManualAnnot_Named_Entities-PM.txt
- – PM_Manual_SourceText.txt