(1) Context and motivation

Ever since the concept of a motif was introduced some 200 years ago, the quest to identify elements of content above the word level has been a standard preoccupation in literary science (; ). There, a motif stands for a recurrent theme, whereas in musicology a motif is considered “the smallest structural unit possessing thematic identity” (). In the field of folktale research, Stith Thompson defined motifs as “the smallest element in a tale having a power to persist in tradition” ().

The overlap between these definitions suggests that such higher-order content units exist as narrative building blocks, yet their automatic extraction by computational means has eluded folk narrative studies so far (). As we will argue below, folk narrative studies are not yet up to the task of a scalable pattern hunt. One reason for our scepticism is that in Thompson’s Motif Index of Folk Literature () alone over 45,000 motifs are listed on a global scale, but many more regional motif indexes exist whose material would doubtlessly inflate that number. If we want to apply machine learning for motif identification or discovery, first we need suitable datasets which enable research teams to replicate each other’s results. Below we report work in progress in this direction, but also guard against any hubris in our promises regarding motif detection, with its first analytical results to be reported elsewhere. Ideally we would like to see an emerging motif annotation system that crowd-sourced expert folklorists could use, similar to Prodigy.

The structure of this paper is as follows. In Section 1, we offer our motivation in context, bring examples of related research with converging trends, publicly available databases and datasets, and introduce the Ashliman Folktexts collection. Section 2 focuses on methodology, progressing from our motivation to support reproducible research in computational folkloristics toward dataset creation and repository access, including steps of data harvesting and cleaning, concluding with current limitations of use. Section 3 discusses features of the result, the Annotated Folktales (aft) corpus, with descriptive statistics. In Section 4, we briefly outline directions of future collection development to support folktale research.

As our pilot was not concerned with the structural analysis of folk narratives, this overview omits significant research results, such as those concerning the automatic detection of Proppian functions (), or their use in ontology building (). Instead, our focus will be on precursory efforts to support motif detection using two standard tools, the Thompson Motif Index (TMI) (), and the Aarne-Thompson-Uther tale typology (ATU) (). Important extensions to these, and to our current work, exist by Declerck and colleagues (; ). As motifs and motifemes abound in myths as well, we admit the latter into our scope under the reasoning that “myth is a traditional tale with secondary, partial reference to something of collective importance” (), considering the debate about the difference between myths and folktales as open (e.g. ; ).

We consider finding characteristic patterns of semantic content by automatic means an open research problem. The relevant research question is this: if we were going to extract features from the descriptive text of the TMI, what kind of features could we build, and could these features also be identified in tale corpora?

The convergence of two major trends in computational folkloristics () will likely shape the results of the next decade. The first is a focus on the evolutionary aspect of motif and/or tale type distributions, either with regard to certain tale types (; ; ; ; ), or to the geographical distribution of globally occurring narrative motifs (), even inferring the presence of lost narratives (). A genetic metaphor seems to inform some approaches, perhaps inspired by the modelling capacities inherent in Dawkins’ meme theory (); these compare tale types as motif sequences to ‘narrative DNA’ (; ; ; ), or look at the evolution of narrative/story networks as a quasi-biological process based on the mutation and recombination of narrative elements (; ), extended even to the framework of cultural evolution via population genetics (; ). Such methods resemble bioinformatic applications such as network motif identification (), a problem analogous with ours. The context is that of evolving semantics, an emerging research area both in lexical semantic change () and digital preservation (; ).

The second trend is to use probabilistic and/or multivariate statistical methods for the analysis of binary versus non-binary matrices of events over cases, where events can be index terms, motifs, motif sequences, and so on, and cases as an umbrella term stand for documents in general, such as abstracts describing narratives (), or tale types (), ultimately constituting text corpora or databases. On such collections, one can then experiment for instance with sub-corpus topic modelling (STM) by Latent Dirichlet Allocation (LDA) as a means of supervised passage exploration in partly unknown corpora ().

The little one can say about the plethora of methods listed is that, regardless of the corpora, their regionality, and the analytical units whose distributions characterise the body of texts in question, they express similarity between items in terms of distance, with more similar items forming dense groups as the outcome of mass comparison. Cluster analysis (), Principal Component Analysis (PCA) (), Labelled Latent Dirichlet Allocation (L-LDA) (), Support Vector Machines (SVM) (; ), or deep learning by Recurrent Neural Networks (RNN) (), however, share the same nature of being static snapshots of collections. Hence there is a contradiction in principle in addressing text evolution, a dynamic phenomenon, through tools tailored to static measurements: the notion asks for vector fields instead of vector spaces (). The most promising recent direction seems to be the combination of word embeddings – increasingly condensed and geometrically located types of word meaning (; ; ) – with deep learning: Pompeu () reports successful application of a Hierarchical Attention Network (HAN) for the prediction of ATU categories on a multilingual database of folk texts.

As the computing of results for both trends discussed above require datasets, the next section briefly addresses their availability.

(1.1.2) Databases and datasets

Progress in computational folkloristics requires that results be replicable. To this end we sought open access datasets of ATU-annotated tales in English, but could not identify suitable candidates on GitHub, Kaggle or Google, although websites with separate tale collections are available. Neither could we find the big folklore data anticipated by Tangherlini & Leonard () and Tangherlini (). Based on Meder () and Ilyefalvi (), the largest databases seem to be the Dutch Folktale Database of the Meertens Institute, and the Danish Folklore Archive’s Tang Kristensen Collection, the former in the magnitude of around 50,000 texts, the latter at around 34,000 texts (). Other important databases exist (), but are either beyond public access, or in their original languages only, or both. The notable exception is the Meertens Institute whose texts are in Dutch and Frisian plus a number of local dialects, but can be read in English translation as well.

Other researchers who have shared their data as supporting material for their articles include for instance Bortolini et al. (), da Silva & Tehrani, (), Tehrani (), and Tehrani, Nguyen, & Roos (). Declerck et al. () also report that a large amount of ATU data has recently been made available online by the Multilingual Folk Tale Database (MFTD), which also offers annotation facilities for tales in multilingual versions. We found only a single recent study () which published a corresponding tale corpus to promote reproducibility, albeit without ATU type labels.

Among the ATU-annotated tale collections publicly available on the internet, the most promising candidate was Ashliman’s Folktexts collection. The process of the conversion of this collection to the desired format will be described below.

(1.1.3) The Ashliman Folktexts collection

The Folktexts site has been populated and maintained since 1996 by D.L. Ashliman, who kindly agreed to donate his collection to the interested research communities. While other sites may sport a more lavish design, this one is the largest and most extensively annotated. It serves as a respected scholarly resource for folklorists, with a large and curated set of tale texts. Whereas our dataset contains only tales from pages with clear ATU annotations (214 pages), the total content of the website is much larger (370 pages), containing various creation myths, stories of changelings, Faust legends, and Christiansen’s tale types (). However, it is the ATU annotation that makes this corpus particularly valuable as a potential training dataset for classification methods.

Despite the richness of this resource, it has not frequently been used in folklore research as a larger corpus. Some previous studies reference it, yet these often only include a smaller portion of the entire set of texts (). To the best of our knowledge, none of the published studies provided open access to the data.

(2) Method

(2.1) Support for reproducibility in folklore studies

Reproducibility is a defining characteristic of science, yet a wide gamut of scientific fields has been plagued by a ‘replicability crisis’: a situation where trusted research findings have been impossible to reproduce (; ). While the problem has come to the fore in the health and social sciences, it has been acknowledged in disciplines as broad as archaeology (), public health (), biology (), and economics ().

Reproducible research entails that study results be accompanied by ():

  1. a detailed description of the methods used to obtain and operate on the data;
  2. the full dataset(s) used in the study;
  3. the full code used to transform the data and compute the results.

(2.1.1) Guiding principles

The following features guided our selection of tools and format for the code and data:

  • Open data: In order to use tale data consistently, it must be made freely and openly available to anyone. The dataset is therefore distributed under a Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0).
  • Extensible data: The dataset can be added to or modified, in order to develop a more complete repository of tales. This can be done by submitting pull requests to the project’s GitHub repository.
  • Open code: Any user is allowed to view and run the code that produces the dataset, as well as downstream analyses which use the dataset. This allows for inspection, refinement, and reasoning about the effects of transformation and statistical modelling on the data.
  • Common form: We have chosen to use the “tidy” dataframe as the structure of the dataset, in which (a) each variable forms a column, (b) each observation forms a row, and (c) a single type of observational unit forms the dataframe ().
  • Common tools: The data must also be structured in a way that allows for ease of use with the standard tools of the trade of data science, such as R or Python.
  • Modifiable form: The structure must allow for reshaping the data into sparse matrices, nested structures, and graph-based structures as dictated by the needs of a given text analysis, while starting from a common source dataset (that is, the aft).

(2.1.2) Accessing and growing the corpus

Snapshot versions of the aft corpus will be cached on Zenodo with development and collaboration ongoing in the trilogy GitHub repository, where a vignette provides information on how to access, use, and augment the dataset. Whereas long-term sustainability to curate the result will require academic resources, as a next step it would be logical to create temporary merger options with other multilingual tale collections, such as the MFTD, or the ones analyzed by Tehrani () or Karsdorp and Fonteyn (), for analytical studies using for instance multilingual word embeddings.

The open-source Git functionality allows motifs, tale types and annotated tales to be added over time, and for the corpus to serve as a communal resource. We welcome inquiries and suggestions about how best to manage this resource as a “commons” ().

(2.2) Data harvesting and cleaning

(2.2.1) Steps

Web-scraping of the Folktexts site was completed using the R statistical programming language. The following high-level summary is provided to allow for an understanding of the methods used and their limitations:

  1. Obtain URLs and associated label text for all ‘child’ pages of the main website to create a dataframe of page names and URLs, removing links to external websites.
  2. Retain all URLs with the pattern “type…”, which denote pages containing tales which belong to an ATU type, and recode links which do not follow this form, such as the page for Animal Brides and Animal Bridegrooms which was recoded as belonging to ATU type 402.
  3. Extract the ATU type ID from the URL for each page, resulting in a dataframe listing 214 webpages, each associated with a tale type and containing the page name, page URL, and associated ATU ID.
  4. Loop through each webpage identified in the dataframe above and extract the text, using the following steps: (a) extract HTML nodes from the page, creating a dataframe using the text, name and attribute elements of the nodes; (b) remove superfluous text other than tale texts, titles, and other associated metadata (e.g. source documents, notes); (c) using a fuzzy-joining method to align missing body text with the well-formatted HTML.
  5. Take the resulting dataframe and apply the following steps: (a) select the longest text, choosing between the tagged HTML version and the version extracted from the body; (b) select available metadata; (c) remove irrelevant entries using regular expressions; (d) create unique tale titles where these were duplicated across multiple variants of tales; (e) clean tale text data (e.g. removing remnant HTML tags, replacing internal double quotes with single quotes).
  6. Add manually extracted tales into a consistent format for web pages which generated errors during web scraping (ATU IDs 1696, 2, 545B, 57, 675, 75, 779J*, 676). Other than this final step, all steps were fully automatic.

(2.2.2) Limitations

Web-scraping is an inherently messy exercise, as the data contained in web pages are often not formatted with the intent of being analysed. While the output has been reviewed at a cursory level, we anticipate that greater use of the dataset will result in the need for additional cleaning and processing.

The provenance field does not meet the definition of ‘tidy’ outlined above, since multiple types of descriptors (e.g. country, region, tale collection) are stored in a single column. While additional cleaning may be able to distinguish some of these, we have chosen to leave it as entered in the original to avoid losing potentially valuable detail.

The final limitation is purposefully adopted for the sake of downstream analyses. We have included only tales which were annotated with a single tale type, despite the existence of some tales which can be characterized by multiple types. This decision was made in order to avoid repeating texts or using data structures which are tool specific.

(3) Results and discussion

(3.1.) Features of the Annotated Folktales (aft) dataset

(3.1.1) Data dictionary

The aft (henceforth standing for Annotated Folktales to allow for the future inclusion of other resources) dataframe contains 1518 rows, each corresponding to a single tale. Its eight columns are described briefly below:

  • atu_id: The ATU tale type identifier which classifies the tale.
  • tale_title: The title of the tale.
  • provenance: The person, place or tradition from which the tale came. In Ashliman’s collection, this refers variously to the person recording the tales (e.g. Giambattista Basile), the country or region from which the version of the tale came (e.g. North Africa), or the larger collection of tales in which the tale is found (e.g. the Kathasaritsagara).
  • notes: Additional notes related to the tale.
  • source: The bibliographic citation for the original published source of the tale.
  • text: The full text of the tale identified in tale_title.
  • data_source: The source of the annotated tales. At the time of this writing, the source of all tales is Ashliman’s Folktexts, but this is intended to change as the dataset grows.
  • date_obtained: The date on which the dataset identified as a data_source was last downloaded and compiled.

Table 1 below shows the initial characters of fields from the first six rows of the dataset, in order to illustrate its appearance:

Table 1

Example output of the dataset.


atu_idtale_titleprovenancesourcetext

910BThe Highlander Takes…ScotlandCuthbert BedeIn one of the glens of…

910BThe Prince Who AcquiredIndiaCecil Henry BompasThere was once a raja …

910BThe Three AdmonitionsItalyThomas Frederick CraneA man once left his co…

910BThe Three AdvicesIrelandT. Crofton CrokerThe stories current am…

910BThe Three Advices Which…IrelandPatrick KennedyThe name of the young …

1430Buttermilk JackThomas HughesOh mother, my buttermilk

(3.1.2) Descriptive statistics

The 1518 tales in the dataset average 979.1 tokens in length, though the individual texts vary with a minimum of 10 tokens and a maximum of 12,406 (Table 2).

Table 2

Summary statistics of the AFT dataset.


MEASUREVALUE

Number of tales1518

Number of tale types182

Mean tokens per tale979.1

Median tokens per tale642

Minimum tokens per tale10

Maximum tokens per tale12,406

Mean sentences per tale45.7

Median sentences per tale31

The histogram below (see Figure 1) shows the distribution of tale lengths for all tales in the corpus:

Figure 1 

Distribution of tale lengths.

The tales compiled in the aft data are annotated by ATU tale type, and represent 182 distinct types. There are on average 8.3 tales in each tale type, with a range of one to 31. The tale types with the largest representative group of tales in the corpus are shown in Table 3 below:

Table 3

Ten tale types with the largest number of representative tales.


ATU IDTALE NAMEN OF TALES

275The Race between Two Animals (previously The Race of the Fox and the Crayfish)31

777The Wandering Jew30

1645The Treasure at Home26

510BPeau d’Asne (previously The Dress of Gold, of Silver, and of Stars [Cap o Rushes])26

500The Name of the Supernatural Helper23

510ACinderella21

700Thumbling (previously Tom Thumb)21

155The Ungrateful Snake Returned to Captivity20

545BPuss in Boots20

980The Ungrateful Son (previously Ungrateful Son Reproved by Naive Actions of Own Son)20

(4) Implications/Applications

Under a Creative Commons license, we published on Zenodo and GitHub an open-access, ATU-annotated dataset of 1518 tales for motif detection by machine learning. This dataset resulted from the conversion of the Ashliman Folktexts collection, and is hoped to become the core of an expanding corpus to support reproducible research in computational folkloristics. As a next step we plan to integrate information from the TMI and the ATU, to be applied in trawling () for motifs by deep learning.