There are few disciplines in the humanities that show the impact of quantitative, computer-based methods as strongly as historical linguistics. While individual scholarship and intuition had played a major role for a long time, with only minimal attempts to formalize or automatize the painstaking methodology, the last twenty years have seen a rapid increase in quantitative applications. Quantitative approaches are reflected in the proposal of new algorithms that automate what was formerly done by inspection alone , in the publication of large cross-linguistic databases that allow for a data-driven investigation of linguistic diversity , and in numerous publications in which the new methods are used to tackle concrete questions on the history of the world’s languages (for recent examples, see [4, 5]).
While it is true that – due to increasing amounts of data – the classical methods are reaching their practical limits, it is also true that computer applications are still far from being able to replace experts’ experience and intuition, especially in those cases where data are sparse (as they are still for many language families). If computers cannot replace experts and experts do not have enough time to analyze the massive amounts of data, a new framework is needed, neither completely computer-driven nor ignorant of the assistance computers provide. Current machine translation systems, for example, are efficient and consistent, but they are by no means accurate, and no one would use them in place of a trained expert. Trained experts, on the other hand, do not necessarily work consistently and efficiently. In order to enhance both the quality of machine translation and the efficiency and consistency of human translation, a new paradigm of computer-assisted translation has emerged .
Following the idea of computer-assisted frameworks in translation and biology, scholars have begun to propose frameworks for computer-assisted language comparison (CALC), in which the flexibility and intuition of human experts is combined with the efficiency and consistency of computational approaches. In this study, we want to introduce what we consider the state of the art1 in this endeavor, and describe a workflow that starts from raw, cross-linguistic data. These raw data are then consistently lifted to the level of an etymologically annotated dataset, using advanced algorithms for historical language comparison along with interactive tools for data annotation and curation.
2 A workflow for computer-assisted language comparison
Our workflow consists of five stages, as shown in Figure 1. It starts from raw data (tabular data from fieldwork notes or data published in books and articles) which we re-organize and re-format in such a way that the data can be automatically processed (Step 1). Once we have lifted the data to this stage, we can infer sets of etymologically related words (cognate sets) (Step 2). In this first stage, we only infer cognates inside the same meaning slot. That means that all cognate words have the same meaning in their respective languages. Once this has been done, we align all cognate words phonetically (Step 3). Since we only infer cognate words that have the same meaning in Step 2, we now use a new method to infer cognates across meanings by employing the information in the aligned cognate sets (Step 4). Finally, in Step 5, we employ a recently proposed method for the detection of correspondence patterns  in order to infer sound correspondences across the languages in our sample.
Our workflow is strictly computer-assisted, and by no means solely computer-based. That means that during each stage of the workflow, the data can be manually checked and modified by experts and then used in this modified form in the next stage of the workflow. Our goal is not to replace human experts, but to increase the efficiency of human analysis by providing assistance especially in those tasks which are time consuming, while at the same time making sure that any manual input is checked for internal consistency.
Our study is accompanied by a short tutorial along with code and data needed to replicate the studies illustrated in the following. The workflow runs on all major operating systems. In addition, we have prepared a Code Ocean Capsule2 to allow users to test the workflow without installing the software.
3 Illustration of the workflow
The data we use was originally collected by Chén (2012) , later added in digital form to the SEALANG project , and was then converted to a computer-readable format as part of the CLICS database (https://clics.clld.org, ). Chén’s collection comprises 885 concepts translated into 25 Hmong-Mien varieties. Hmong-Mien languages are spoken in China, Thailand, Laos and Vietnam in Southeast Asia. Scholars divide the family into two main branches, Hmong and Mien. The Hmong-Mien languages have been developing in close contact with neighboring languages from different language families (Sino-Tibetan, Tai-Kadai, Austroasiatic, and Austronesian [11, p. 224]). Chén’s study concentrates on Hmong-Mien varieties spoken in China.
In order to make sure that the results can be easily inspected, we decided to reduce the data by taking a subset of 502 concepts of 15 varieties from the dataset. While we selected the languages due to their geographic distribution and their representativeness with respect to the Hmong-Mien language family, we selected the concepts for reasons of comparability with previous linguistic studies. We focus both on concepts that are frequently used in general studies in historical linguistics (reflecting the so-called basic vocabulary [12, 13, 14, 15]), and concepts that have been specifically applied in studies on Southeast Asian languages [4, 16, 17, 18, 19]. The 15 varieties are shown in their geographic distribution in Figure 2. While the reduction of the data is done for practical reasons, since smaller datasets can be more easily inspected manually, the workflow can also be applied to the full dataset, and we illustrate in the tutorial how the same analysis can be done with all languages in the original data sample.
3.2.1 From raw data to tokenized data
As a first step, we need to lift the data to a format in which they can be automatically digested. Data should be human- and machine-readable at the same time. Our framework works with data in tabular form, which is usually given in a simple text file in which the first line serves as table header and the following lines provide the content. In order to apply our workflow, each word in a given set of languages must be represented in one row of the data table, and four obligatory values need to be supplied: an identifier (ID), the name of the language variety (DOCULECT), the elicitation gloss for the concept (CONCEPT), and a phonetic transcription of the word form, provided in tokenized form (TOKENS). Additional information can be flexibly added by placing it in additional columns. Table 1 gives a minimal example for four words in Germanic languages.
|1||English||house||house||h aʊ s|
|2||German||house||Haus||h au s|
|3||Dutch||house||huis||h ʊɪ s|
|4||Swedish||house||hus||h ʉː s|
As can be seen from Table 1, the main reference of our algorithms is the phonetic transcription in its tokenized form as provided by the column TOKENS. Tokenized, in this context, means that the transcription explicitly marks what an algorithm should treat as one sound segment. In Table 1, for example, we have decided to render diphthongs as one sound. We could, of course, also treat them as two sounds each, but since we know that diphthongs often evolve as a single unit, we made this explicit decision with respect to the tokenization.
Transcriptions are usually not provided in tokenized form. The tokenization thus needs to be done prior to analyzing the data further. While one can easily manually tokenize a few words as shown in Table 1, it becomes tedious and error-prone to do so for larger datasets. In order to increase the consistency of this step in the workflow, we recommend using orthography profiles . An orthography profile can be thought of as a simple text file with two columns in which the first column represents the values as one finds them in the data, and the second column allows to convert the exact sequence of characters that one finds in the first column into the desired format. An orthography profile thus allows tokenizing a given transcription into meaningful units. It can further be used to modify the original transcription by replacing tokenized units with new values.3 How an orthography profile can be applied is illustrated in more detail in Figure 3.
Our data format can be described as a wide-table format [23, 24, 25] and conforms to the strict principle of entering only one value per cell in a given data table. This contrasts with the way in which linguists traditionally code their data, as shown in Table 2, where we contrast the original data from Chén with our normalized representation. To keep track of the original data, we reserve the column VALUE to store the original word forms, including those cases where multiple values are placed in the same cell. The separated forms are placed in the column FORM, which itself is converted into a tokenized transcription with the help of orthography profiles.
In order to make sure that our data is comparable with other datasets, we follow the recommendations by the Cross-Linguistic Data Formats initiative (CLDF, https://cldf.clld.org, ) and link our languages to the Glottolog database (https://glottolog.org, ), our concepts to the Concepticon (https://concepticon.clld.org, ), and follow the transcription standards proposed by the Cross-Linguistic Transcription Systems initiative (CLTS, https://clts.clld.org, ).
In the accompanying tutorial, we show how the data can be retrieved from the CLDF format and converted into plain tabular format. We also show how the original data can be tokenized with the help of an orthography profile (TUTORIAL 3.1).
3.2.2 From tokenized data to cognate sets
Having transformed the original data into a machine-readable format, we can start to search for words in the data which share a common origin. These etymologically related words (also called cognates) are the first and most crucial step in historical language comparison. The task is not trivial, especially when dealing with languages that diverged a long time ago. A crucial problem is that words are often not entirely cognate across languages . What we find instead is that languages share cognate morphemes4 (word parts). When languages make frequent use of compounding to coin new words, such as in Southeast Asian languages, partial cognacy is rather the norm than the exception, which is well-known to historical linguists working in this area . We explicitly address partial cognacy by adopting a numerical annotation in which each morpheme, instead of each word form, is assigned to a specific cognate set , as shown in Figure 4.
In order to infer partial cognates in our data, we make use of the partial cognate detection algorithm proposed by List et al. , which is, so far, the only algorithm available that has been proposed to address this problem. In the tutorial submitted along with this paper, we illustrate in detail how partial cognates can be inferred from the data and how the results can be inspected (TUTORIAL 3.2). In addition, the tutorial quickly explains how the web-based EDICTOR tool (https://digling.org/tsv/, ) can be used to manually correct the partial cognates identified by the algorithm (TUTORIAL 3.2).
3.2.3 From cognate sets to alignments
An alignment analysis is a very general and convenient way to compare sequences of various kinds. The basic idea is to place two sequences into a matrix in such a way that corresponding segments appear in the same column, while placeholder symbols are used to represent those cases where a corresponding segment is lacking (Figure 5) . As the core of historical language comparison lies in the identification of regularly recurring sound correspondences across cognate words in genetically-related languages, it is straightforward to make use of alignment analyses once cognates have been detected in order to find patterns of corresponding sounds. In addition to building the essential step for the identification of sound correspondences, alignment analyses also make it easier for scholars to inspect and correct algorithmic findings.
Automated phonetic alignment analysis has greatly improved during the last 20 years. The most popular alignment algorithms used in the field of historical linguistics today all have their origin in alignment applications developed for biological sequence comparison tasks, which were later adjusted and modified for linguistic purposes .
While the currently available alignment algorithms are all very complex, scholars often forget that the same amount of algorithmic complexity is not needed for all languages. Since most Southeast Asian languages have fixed syllable templates, alignments are often predicted by the syllable structure. As a result, one does not need to employ complicated sequence comparison methods in order to find the right matchings between cognate morphemes. All one needs to have is a template-representation of each morpheme in the data.
As an example, consider the typical template for many Southeast Asian languages : syllables consist maximally of an initial consonant (i), a medial glide (m), a nucleus vowel (n), a coda consonant (c), and the tone (t). Individual syllables do not need to have all these positions filled, as can be seen in the following example in Figure 6a.5
Once the templates of all words are annotated, aligning any word with any other word is extremely simple. Instead of aligning the words with each other, we simply align them to the template, by filling those spots in the template which have no sounds with gap symbols (“–”). We can then place all words that have been aligned to a template in our alignment and only need to delete those columns in which only gaps occur, as illustrated in Figure 6b.
Our accompanying tutorial illustrates how template-based alignments can be computed from the data (TUTORIAL 3.3). In addition, we also show how the alignments can be inspected with the help of the EDICTOR tool (TUTORIAL 3.3).
3.2.4 From alignments to cross-semantic cognates
As in many Southeast Asian languages, most morphologically complex words in Hmong-Mien languages are compounds, as shown in Table 3. The word for ‘fishnet’ in Northeast Yunnan Chuanqiandian, for example, is a combination of the morpheme meaning ‘bed’ [dzʱaɯ35] and the morpheme meaning ‘fish’ [ⁿpə53].6 The word for ‘eagle’ in Dongnu is composed of the words [po53] ‘father’ and [tɬəŋ53] ‘hawk’. As can be seen from the word for ‘bull’ in the same variety, [po53vɔ231], [po53] can be used to denote male animals, but in the word for ‘eagle’ it is more likely to denote strength [8, p. 328]. As a final example, Younuo lexicalizes the concept ‘tears’ as [ki55mo32ʔŋ44], with [ki55mo32] meaning ‘eye’ and [ʔŋ44] meaning ‘water’.
An important consequence of the re-use of word parts in order to form new words in highly isolating languages of Southeast Asia, is that certain words are not only cognate across languages, but also inside one and the same language. However, since our algorithm for partial cognate detection only identifies those word parts as cognate which appear in words denoting the same meaning, we need to find ways to infer the information on cross-semantic cognates in a further step.
As an example, consider the data for ‘son’ and ‘daughter’ in five language varieties of our illustration data. As can be seen immediately, two languages, Chuanqiandian and East Qiandong, show striking partial colexifications for the two concepts. In both cases, one morpheme recurs in the words for the two concepts. In the other cases, we find different words, but if we compare the overall cognacy, we can also see that all five languages share one cognate morpheme for ‘son’ (corresponding to the Proto-Hmong-Mien *tu̯ɛn in Ratliff’s reconstruction ), and three varieties share one cognate morpheme for ‘daughter’ (corresponding to *mphjeD in Ratliff’s reconstruction), with the morpheme for ‘son’ occurring also in the words for ‘daughter’ in East Qiandong and Chuanqiandian, as mentioned before.
While a couple of strategies have been proposed to search for cognates across meaning slots [36, 37], none of the existing algorithms is sensitive to partial cognate relations, as shown in Table 4. In order to address this problem in our workflow, we propose a novel approach that is relatively simple, but surprisingly efficient. We start from all aligned cognate sets in our data, and then systematically compare all alignments with each other. Whenever two alignments are compatible, i.e., they have (1) at least one morpheme in one language occurring in both aligned cognate sets, which is identical, and there are (2) no shared morphemes in two alignments which are not identical, we treat them as belonging to one and the same cognate set (see Figure 7). Note that this approach can — by design — only infer strict cognates with different meanings, since not the slightest form of form variation for colexification sinside the same language are is allowed. We iterate over all alignments in the data algorithmically, merging the alignments into larger sets in a greedy fashion, and re-assigning cognate sets in the data.
The results can be easily inspected with the help of the EDICTOR tool, for example, by inspecting cognate set distributions in the data, as illustrated in detail in the tutorial (TUTORIAL 3.4). When inspecting only those cognate sets that occur in at least 10 language varieties in our sample, we already find quite a few interesting cases of cross-semantic cognate sets: morphemes denoting the concept ‘one’, for example, recur in the words for ‘hundred’ (indicating that hundred is a compound of ‘one’ plus ‘hundred’ in all languages); morphemes recur in ‘snake’ and ‘earthworm’ (reflecting that words for ‘snake’ and ‘earthworm’ are composed of a morpheme ‘worm’); and ‘left’ and ‘right’ share a common morpheme (indicating an original meaning of ‘side’ for this part, such as ‘left side’ vs. ‘right side’).
3.2.5 From cross-semantic cognates to sound correspondence patterns
Sound correspondences, and specifically sound correspondence patterns across multiple languages, can be seen as the core objective of the classical comparative method and build the basis of further endeavors such as the reconstruction of proto-forms or the reconstruction of phylogenies. Linguists commonly propose sound correspondence sets, that is, collections of sound correspondences which reconstruct back to a common proto-sound (or sequence of proto-sounds) in the ancestor language, as one of the final stages of historical language comparison. In Hmong-Mien languages, for example, Wang proposed 30 sets  and Ratliff reduced the quantity of correspondence sets to 28 .
An example for the representation of sound correspondence sets in the classical literature  is provided in Table 5. The supposed proto-sound *ntshj- in proto-Hmong-Mien is inferred from the initials of four words in 11 contemporary Hmong-Mien languages.
|to fear/be afraid
Although this kind of data representation is typical for classical accounts on sound correspondence patterns in historical language comparison, it has several shortcomings. First, the representation shows only morphemes, and we are not informed about the full word forms underlying the patterns. This is unfortunate, since we cannot exclude that compound words were already present in the ancestral language, and it may likewise be possible that processes of compounding left traces in the correspondence patterns themselves. Second, since scholars tend to list sound correspondence patterns merely in an exemplary fashion, with no intent to provide full frequency accounts, it is often not clear how strong the actual evidence is, and whether the pattern at hand is exhaustive, or merely serves to provide an example. Third, we are not being told where a given sound in a given language fits a general pattern less well. Thus, we can find two different reflexes in language 8 in the table, [ɕ] and [dʑ], but without further information, we cannot tell if the differences result from secondary, conditioned sound changes, or whether they reflect irregularities that the author has not yet resolved.
To overcome these shortcomings, we employ a two-fold strategy. We first make use of a new method for sound correspondence pattern detection  in order to identify exhaustively, for each column in each alignment of our data, to which correspondence pattern it belongs. In a second step, we use the EDICTOR tool to closely inspect the patterns identified by the algorithm and to compare them with those patterns proposed in the classical literature.
The method for correspondence pattern identification starts by assembling all alignment sites (all columns) in the aligned cognate sets of the data, and then clusters them into groups of compatible sound correspondence patterns. Compatibility essentially makes sure that no language has more than one reflex sound in all partitioned alignment sites (see  for a detailed explanation of this algorithm).
Table 6 provides some statistics regarding the results of the correspondence pattern analysis. The analysis yielded a total of 1392 distinct sound correspondence patterns (with none of the patterns being compatible with any of the other 1392 patterns). While this may seem a lot, we find that 234 patterns only occur once in the data (probably reflecting borrowing events, erroneously coded cognates, or errors in the data).7 Among the non-singleton patterns, we find 302 corresponding to initials, 74 to medials, 389 to nucleus vowels, 95 to the codas, and 298 to the tone patterns. These numbers may seem surprising, but one should keep in mind that phonological reconstruction will assign several distinct correspondence patterns to the same proto-form and explain the divergence by means of conditioning context in sound change.8 So far, there are few studies on the numbers of distinct correspondence patterns one should expect, but the results we find for the Hmong-Mien dataset are in line with previous studies on other language families . More studies are needed in order to fully understand what one ought to expect in terms of the numbers of correspondence patterns in datasets of various sizes and types.
While the representation in textbooks usually breaks the unity of morphemes and word forms, our workflow never loses track of the words, although it enables users to look at the morphemes and at the correspondence patterns in isolation. Our accompanying tutorial shows not only how the correspondence patterns can be computed (TUTORIAL 3.5), but also how they can be inspected in the EDICTOR tool (TUTORIAL 3.5), where we can further see that our analysis uncovers the correspondence pattern shown in Table 5 above, as we illustrate in Table 7. Here, we can see that our approach confirms Ratliff’s pattern by clustering initial consonants of cognates for ‘blood’ and ‘fear (be afraid)’ into one correspondence pattern.9
Although our workflow represents what we consider the current state of the art in the field of computational historical linguistics, it is not complete yet, and it is also not perfect. Many more aspects need to be integrated, discussed, and formalized. Based on a quick discussion of the general results of our study, we will discuss three important aspects, namely, (a) the current performance of the existing algorithms in our workflow, (b) possible improvements of the algorithms, and (c) general challenges for all future endeavors in computer-assisted or computational historical linguistics.
4.1 Current performance
Historical language comparison deals with the reconstruction of events that happened in the past and can rarely be directly verified. Our knowledge about a given language family is constantly evolving. At the same time, debate on language history is never free of disagreement among scholars, and this is also the case with the reconstruction of Hmong-Mien.10 As a result, it is not easy to provide a direct evaluation of the performance of the computational part of the workflow presented here.
In addition to these theoretical problems, evaluation faces practical problems. First, classical resources on historical language comparison of Hmong-Mien are not available in digital form (and digitizing them would be beyond the scope of this study). Second, and more importantly, however, even when having recent data on Hmong-Mien reconstruction in digital form, we could not compare them directly with our results due to the difference in the workflows. All current studies merely consist of morphemes that were taken from different sources without giving reference to the original words . Full words, which are the starting point in our study, are not reported and apparently not taken into account. For a true evaluation of our workflow, however, we would need a manually annotated dataset that would show the same completeness in terms of annotation as the one we have automatically produced. Furthermore, since our workflow is explicitly thought of as computer-assisted and not purely computational, the question of algorithmic performance is rather aesthetical than substantial, given that the computational approaches are merely used to ease the labor of the experts.
Nevertheless, to some degree, we can evaluate the algorithms which we assembled for our workflow here, and it is from these evaluations that have been made in the past, that we draw confidence in the overall usefulness of our workflow. Partial cognate detection, as outlined in Section 3.2, for example, has been substantially evaluated with results ranging between 90% (Chinese dialects) and 94% (Bai dialects) compared to expert judgments. The alignment procedure we propose is supposed to work as good as an expert, provided that experts agree on the prosodic structure we assign to all morphemes. For the cross-semantic cognate set detection procedure we propose, we do not yet have substantial evaluations, since we lack sufficient test data. The correspondence pattern detection algorithm has, finally, been indirectly evaluated by testing how well so far unobserved cognate words could be predicted (see also ), showing an accuracy between 59% (Burmish languages) and 81% (Polynesian languages) for trials in which 25% of the data was artificially deleted and later predicted.
As another quick way to check if the automated aspects of our workflow are going in the right direction, we can compute a phylogeny based on shared cross-semantic cognates between all language pairs and see if the phylogeny matches with those proposed in the literature. This analysis, which can be inspected in detail in the accompanying tutorial (TUTORIAL 4.2), shows that the automated workflow yields a tree that correctly separates not only Hmongic from Mienic languages but also identifies all smaller subgroups commonly recognized.
4.2 Possible improvements
The major desideratum in terms of possible improvements is the inclusion of further integration of our preliminary attempts for semi-automated reconstruction, starting from already identified sound correspondence patterns. Experiments are ongoing in this regard, but we have not yet had time to integrate them fully.11 In general, our workflow also needs a clearer integration of automatic and manual approaches, ideally accompanied by extensive tutorials that would allow users to start with the tools independently. This study can be seen as a first step in this direction, but much more work will be needed in the future.
4.3 General challenges
General challenges include the full-fledged lexical reconstruction of words, i.e., a reconstruction that would potentially also provide compounds in etymological dictionaries. This might help to overcome a huge problem in historical language comparison in the Southeast Asian area, where scholars tend to reconstruct only morphemes, and rarely attempt at the reconstruction of real word forms in the ancestral languages . Furthermore, we will need a convincing annotation of sound change that would ideally allow us to even check which sounds changed at which time during language history.
This article provides a detailed account on what we consider the current state of the art in computer-assisted language comparison. Starting from raw data, we have shown how these can be successively lifted to higher levels of annotation. While our five-step workflow is intended to be applied in a computer-assisted fashion, we have shown that even with a purely automatic approach, one can already achieve insightful results that compare favorably to results obtained in a purely manual approach. In the future, we hope to further enhance the workflow and make it more accessible to a wider audience.
Supplementary information and material
The appendix that is submitted along with this study consists of two parts. First, there is a glossary explaining the most important terms that were used throughout this study. Second, there is a tutorial explaining the steps of the workflow in detail. In addition to this supplementary information, we provide supplementary material in the form of data and code. The data used in this study is archived on Zenodo (DOI: 10.5281/zenodo.3741500) and curated on GitHub (Version 2.1.0, https://github.com/lexibank/chenhmongmien). The code, along with the tutorial, has also been archived on Zenodo (DOI: 10.5281/zenodo.3741771) and is curated on GitHub (Version 1.0.0, https://github.com/lingpy/workflow-paper). Additionally, our Code Ocean Capsule allows users to run the code without installing anything on their machine; it can be accessed from https://codeocean.com/capsule/8178287/ (Version 2).