1 Overview

One of the most prominent trends in the linguistics of the 21st century is the unparalleled growth of machine-readable resources. While some legacy databases are steadily being converted into standardized formats like CLDF (Forkel & List, 2020), more and more new datasets get published every year. These datasets, while often including well-described languages, increasingly add languages that have so far enjoyed little to no scholarly attention, or have not been aggregated and made publicly accessible for various reasons (Dellert et al., 2020; Dellert, Daneyko, & Münch, 2019; Kassian, 2020). An example of the latter is TuLeD (Tupían Lexical Database) (Gerardi, Reichert, Aragon, List, & Wientzek, 2021; Gerardi, Reichert, & Aragon, 2021) which grew out of a wide variety of sources on Tupían languages (living and extinct) and was subsequently used in a phylogenetic classification of the Tupí-Guaraní branch (Gerardi & Reichert, 2021).

However, there is a clear need for further resources that would ideally capture even more of the linguistic and cultural diversity of South America. Our overarching goal is not only to continue providing sources to spread the knowledge on Amazonian languages and thus broaden our understanding of linguistic typology, but also to do so in a way that would enable us to empirically test some of the hypotheses put forth in the research literature. One such hypothesis suggests that two of these language families (Katukinan and Harakmbut) are genetically related (Adelaar, 2000, 2007). We also conjecture a macrofamily which adds the Arawan family to Katukinan1 and Harakmbut, following a proposal by dos Anjos (2011); Jolkesky (2016). For these reasons we are working on the Katukinan-Arawan-Harakmbut Database (KAHD) by aggregating the published sources and making sure the data is consistently transcribed, aligned, and enriched with information on cognacy. Far from being a purely lexical database, KAHD is planned to encompass phonetic-phonological and morphological information as well.

At present, the size of the database can neither support nor refute the genetic relationship between these languages. Our goal in this paper is thus to introduce the database as an instrument which could, among other things, be employed in attempts to answer this question of genetic relatedness. The quantitative methods presented in Section 2 are intended to demonstrate the current status of the database.

1.1 The Arawan language family

The Arawan family is roughly known since 1891, when Brinton recognized similarities between Arawá and Paumari, and consists of six languages:2 Paumari, Madi (and its dialects Jarawara, Jamamadí, and Banawá) (see Dixon 2004), Sorowaha, Deni, Kulina, and the extinct Arawá (Ehrenreich, 1897). The number of speakers varies, as well as their social vulnerability, and consequently the status of their language: vigorous for Sorowaha with less than 200 speakers, but threatened for Kulina, with 2500 speakers.

Most of the Arawan speakers were contacted during the end of the nineteenth century and some of them, as is the case of the Sorowaha, escaped from the intensive Indigenous territorial invasion process which took place in the middle Purus River, and (they) still live as a recently contacted group (Aparicio, 2015; Huber, 2012). Others remain isolated like the groups who live in the Hi-Merimã Indigenous Area in the middle Purus (Shiratori, Cangussu, & Furquim, 2021). The Table 1 presents information on ethnic population, speakers and status of the language according to (Eberhard, Simons, & Fennig, 2021) which stem from a source dating back to 2012, and the Figure 1 shows the location of the languages. These numbers do not necessarily reflect the current situation, but they offer a general picture of the state of the Arawan language family. The ethnic population for Kulina, for example, differs significantly from that given by Dienst 2014 (5500 in Brazil and 600 in Peru), while a source from 2015 cites a comparable figure for the Sorowaha (Aparicio, 2015). In the lack of official or more precise and recent sources, Ethnologue (Eberhard et al., 2021) seems to be the most reliable source to quote.

Table 1

The Arawan languages in KAHD. Information on ethnic population, speakers and status taken from Eberhard et al. (2021).


Arawá aru arua1263 0 0 Extinct

Dení dny deni1241 880 740 Developing

Madí Banawá Jaa bana1307 (780) 100 Educational

Jamamadí jama1261 780 450 Educational

Jarawara jara1276 (780) 230 Educational

Kulina cul culi1244 3500 3000 Threatened

Paumari pad paum1247 890 290 Moribund

Sorowaha swx suru1263 140 140 Vigorous

Location of the Arawan languages according to Glottolog
Figure 1 

Location of the Arawan languages according to Hammarström et al. (2021).

The Arawan communities are located in Brazil, in the south-western Amazonia, except for the Kulina speakers, who live near the Peruvian border (Ucayali). The Purus basin and the Juruá river are the historical seats of the Arawan groups. Their presence on the margins of the Purus and Juruá rivers, especially in the middle course of the Purus (which extends from the surroundings of the Acre River to the surroundings of the city of Tapauá, between the Acre River and the Tauamirim stream), was marked by the continuous exploration of rubber and the presence of proselytizing missionaries (Aparício, 2019). Only after the 1990s, their territories have started to be delimited and recognized by the Brazilian authorities (Aparício, 2011), although not soon enough to avoid the devastating effects of genocide and epidemics which happened since the rubber extraction had been introduced in the Purus (Kroemer, 1985).

The Arawá, a group whose name is now used for the Arawan language family, are a case in point. Their presence on the Juruá River was first signaled by Castelnau (1851, 87). The tribe was reported to have been exterminated by an epidemic of measles, introduced by the first migration of people from the north-eastern state of Ceará on the east coast of Brazil which was caused by the drought of 1877. The few survivors sought refuge with the Kulina, speakers of a language from the same family, who are said to have massacred them (Rivet & Tastevin, 1938, 72). Little is known about Arawá language and it is possible that the remnants of the group were incorporated into the Kulina, whose language they may have influenced.

1.2 The Harakmbut-Katukinan language family

Harakmbut is spoken along the Madre de Dios River and its upper tributaries in Peru. There are several dialects which fall into two large clusters (Helberg Chávez, 1984, xv,50) (Helberg Chávez & Solís Fonseca, 1990, 227–228). Toyoeri and Huachipaeri form one cluster, while the other is formed by Sapiteri, Arasaeri and Amarakaeri, which is the best known and has the largest number of speakers (see also van Linden (2022)). It was initially classified as belonging to the Arawak family (Matteson, 1972; McQuown, 1955), but more recently, based on lexical evidence, Adelaar (2000, 2007) has proposed that it is genetically related to the Brazilian Katukina family. Wise (1999) seems to consider it an isolate.

The Katukinan family is known thanks to the work of Tastevin (1920) (see also Rivet 1920; Rivet and Tastevin 1921, 1923) and Natterer (1817–1835). Rodrigues takes it for granted (Rodrigues 1986, 79–81). It was Adelaar (2000) who first proposed the link between Harakmbut and the Katukinan languages, which has not been challenged and seems to be widely accepted by now. Of the two dialects of Kanamari, Katukina is probably the only surviving of the family, since Katawixi was already said to have disappeared in 1926 (dos Anjos, 2011, 16–17). Table 2 offers a brief overview over the current situation of these two language families.

Table 2

The Harakmbut and Katukinan languages in KAHD. Information on ethnic population, speakers and status taken from dos Anjos (2011); Eberhard et al. (2021).


Harakmbut aru hara1260 2090 1910 Threatened


Kanamari knm cuti1242 ? 1700 Vigorous

Katukina Biá knm katu1276 ? 550 Vigorous

2 Method

In the era of rapidly growing number of linguistic resources, arriving at comparable results in cross-linguistic research entails working with comparable datasets and standardized sets of tools and specifications. Despite the proliferation of datasets, they often fail to conform to the data FAIRness (Findable, Accessible, Interoperable, Reproducible) principles outlined in Wilkinson et al. (2016) and may require laborous and costly preprocessing before any analysis can take place. In order to address this need, we decided to follow the standards of the CLDF (Cross-Linguistic Data Formats) initiative that enjoys growing popularity in the (computational) linguistic community (Forkel et al., 2018). CLDF offers ways to ensure the integrity of the data, its connection to the major reference catalogs like Glottolog (Hammarström et al., 2021) and Concepticon (List et al., 2022), as well as scripts written specifically for (historical) linguists to get the most out of their data. CLDF works with simple text formats that can be read and modified in any environment and allows for automatic validation of datasets against the specifications. Additionally, projects based on CLDF specification, like CLICS (Database of Cross-Linguistic Colexifications) (List et al., 2018) or the CLTS (Cross-Linguistic Transcription Systems) initiative, which endorses the use of unified phonetically transcribed forms (Anderson et al., 2018; List, Anderson, Tresoldi, & Forkel, 2021), constantly add new ways to explore available data and increase its cross-linguisitic interoperability. This framework alongside its tools as well as the agreed upon workflow was used in preparation of the TuLeD dataset (Gerardi, Reichert, & Aragon, 2021).

Similarly, in the case of the Arawan dataset, the data harvested from numerous sources is being curated and expanded using the Javascript graphical application EDICTOR (List 2017, 2021) from where it can be easily exported in csv format and used for further processing with various modules within LingPy, a state-of-the-art computational suite of computational tools for historical linguistics (List & Forkel, 2021, July 29).

The pre-release version of the dataset, which this paper describes, consists of 8 doculects (Good & Cysouw, 2013) and 556 concepts across 2503 forms.3 The lexical coverage for each language in the dataset is given in Table 3.

Table 3

Lexical coverage for each language in the database.


Amarakaeri 51

Arawa 36

Banawa 309

Deni 400

Jamamadi 294

Jarawara 419

Kanamari 37

Katawishi 56

KatukinaBiá 18

Kulina 405

Paumari 268

Proto-Arawan 386

Sorowaha 423

The choice of concepts respects the established lists like Swadesh (1955, 2017) as well the Leipzig-Jakarta list (Tadmor, Haspelmath, & Taylor, 2010), but also adds multiple concepts whose inclusion is motivated by their cultural prominence in the (daily) life of the native speakers.4 These concepts cover a variety of semantic domains: food and drink, kinship, the physical world, agriculture and vegetation, basic actions and technology, emotions and values, as well as fauna and flora, among others. The specifics for each concept, including semantic domains, except for some fauna and flora items can be accessed on Concepticon (List et al., 2022), since the names of concepts in our database are based on this source.

Cognacy was at first obtained through the five methods for automatic cognate detection implemented in LingPy and discussed in List, Greenhill, and Gray (2017) using the default parameters with the number of permutations set to 10,000 for each method, thus closely following the workflow of the original paper. The B-Cubed scores used for evaluation of each analysis are given in Table 4.

Table 4

Comparison of tests using B-Cubed scores.


SCA 0.952 0.963 0.944

LexStat 0.972 0.931 0.951

InfoMap 0.960 0.942 0.951

EditDistance 0.973 0.884 0.926

Turchin 0.985 0.810 0.889

Initially, we relied on the LexStat method because of how it performed (see Table 4) in cognate assignment and subsequently manually improved the results using expert judgment. This did not lead to any significant improvement, because the family appears to be quite shallow, as indicated by the low number of cognate diversity of cogids: 0.169 and for cogid: 0.186 (see List et al. (2017) for cognacy diversity in other families). Even though we have assigned cogids for partial cognacy and added morpheme glosses, partial cognacy will only be thoroughly addressed in the next release. This means that morphological segmentation will be made available as well.

LingPy also implements an alignment algorithm which was used for this pre-release version of the dataset.5 It should be noted that the resulting alignments have not been manually checked and no changes have been added to the output of LingPy. An example of the alignment for the concept “shoot with blow-gun” is given in Figure 2.

Example of alignment of from the KAHD Database
Figure 2 

Example of alignment of from the KAHD Database.

We have further computed maximal mutual coverage6 for all doculects in the dataset. The result is 6 doculects with an average mutual coverage of 219.

We have conducted a simple attempt of classification in order to compare the results with the proposed classification of Arawan by Dienst (2008), shown in Figure 3. We are not proposing a classification, but testing the validity of the automated cognacy against an already existing classification.

Stefan Dienst’s classification of Arawan languages
Figure 3 

Classification of Arawan from Dienst (2008).

We obtained a similar classification using our cognates as input to the UPGMA algorithm (Sokal & Michener, 1958). The result of this classification, an unrooted tree, is given in Figure 4.

UPGMA classification of Arawan from KAHD data
Figure 4 

UPGMA classification of Arawan from KAHD data.

3 Results and discussion

Despite the various ways of hosting scientific datasets on the web, the process of data validation and curation may require considerable time and cost investment alongside technical skills and acumen. An additional consideration is the need to increase interoperability between datasets for typological and phylogenetic analyses, among others. In the case of South American language families, having freely available data in standardized transcription and enriched with information on linguistic features like cognacy would bring together the many valuable contributions from ethnographers and linguists alike. We believe that the next crucial step can be made much easier by using the toolset built around the CLDF datasets. The effort involved in checking the data’s integrity is minimal and the steadily growing number of datasets published in adherence to these standards attests to its robustness and utility for (primarily) linguistic purposes. In making our database open-access, we rely on the cldfbench framework that greatly reduces the cost of the FAIR data curation by providing ways to read, write, and validate standardized CLDF datasets (Forkel & List, 2020).

This pre-release version is not yet hosted in the CLLD web-application despite being publicly available. The official release is planned to include a suitable graphical user interface, but the dataset can be accessed in its entirety via a permanent link in the EDICTOR which offers various search and analysis tools (List, 2017) as well as an option to download the full dataset.7

With the publication of the pre-release, we now begin to focus on the primary official release (version 1.0) which will contain enough data in all three families with cognacy assignment to preliminary test an interesting hypothesis regarding the relation between these families (Adelaar, 2000, 2007, Jolkesky, 2011; 2016). The inclusion of morphological items will provide valuable insights for comparison and allow for better typological description of languages, for which few resources are available.

We submitted our dataset to Zenodo8 for archiving.

4 Implications/Applications

The last decades have witnessed a growing amount of phylogenetic classifications of language families thanks to the use of lexical databases with cognacy assignment (Heggarty, 2021; Kolipakam et al., 2018; Sagart et al., 2019; Walworth, 2017; Zhang, Yan, Pan, & Jin, 2019). Such databases, beside elucidating the internal classification of language families, play a role in the understanding of displacement and linguistic contact, for example, through borrowing. Words of a language are valuable for understanding the culture where it is spoken (Harrison, 2008), even more so when the whole family is considered. In addition, culturally relevant lexical items offer us insights into possible genetic relations between individual languages, and it is even possible to putatively reconstruct items that were part of a proto-culture (Corrêa-da Silva, 2013; Rodrigues, 2010).

Apart from its value for (computational) historical linguistics mentioned in the previous section, the KAHD database also serves as language documentation and preservation effort for Amazonian language families since, as shown in Section 1.1, the number of speakers for some of the languages is diminishing at a fast rate (see e.g. D’Ávila 2019). Lehmann (2001, 5) affirms that the primary purpose of language documentation is to “represent the language for those who do not have direct access to the language itself.” KAHD strives to achieve this goal by collecting primary data and making it publicly available after careful pre-processing, e.g. by performing cognacy judgment. Aside from language documentation (Romaine, 2015), the preparation of the Arawan dataset reveals the vast amount of work which is still to be done. The relative scarcity of published linguistic research on this language family underscores the necessity for a project like KAHD that would become the central hub for collaboration and research into the lexical richness of these three underdescribed language families. Access to further sources to include in the dataset is essential in substantiating any theories on these language families.

An important future direction of the project is its use as a source for creating learning materials for the Indigenous communities, helping them raise their language vitality and providing an authentic context for the language acquisition. Dictionaries, for instance, are one type of pedagogical materials whose compilation could be made easier and more cost efficient by relying on a database like KAHD. An obvious advantage of an online database is the quick and effortless addition of new concepts and words. Thus, KAHD is being prepared with an eye toward wedding technology with ongoing language revitalization efforts. Moreover, as with KAHD’s precursor TuLeD, we intend to actively involve community members in shaping KAHD into a useful and free tool for a variety of purposes starting with the preparation of educational resources locally. We welcome any kind of contributions to the project.

Supplementary Files

All data relevant to the creation of this pre-release version of the Arawan dataset can be accessed and downloaded from our GitHub repository (https://github.com/LanguageStructure/KAHD_pre_release). All output files produced by running LingPy scripts are uploaded into the folder LingPy.