A Global Lexical Database (GLED) for Computational Historical Linguistics

Tiago Tresoldi

(1) Overview

Repository location

Context

The Global Lexical Database (GLED) is a resource for computational historical linguistics encompassing a dataset of basic vocabulary for most known natural languages, with accompanying information on machine-detected cognates and phonological alignments, along with per-family and global phylogenetic resources. The latest release holds 262,859 entries for 6,572 doculects (documented language varieties, see ) in 344 families (Figure 1) and is available under the CC-BY licence. The database’s key component, a lexical dataset ultimately derived from the word lists of the Automated Similarity Judgement Program (ASJP), carries lemmas for between 30 and 40 comparative concepts for each doculect, all rendered with a broad phonetic transcription. The average concept coverage per doculect is 90.3%, and the average mutual pairwise coverage between doculects is 82.2%. Table 1 details the distribution of concept counts across doculects, and Table 2 lists the concepts along with their coverage.

Table 1

Number of doculects per number of concepts expressed in absolute and relative terms. Note that the number of entries for a doculect will be higher than the number of concepts in the case of synonyms.


NUMBER OF CONCEPTS	DOCULECTS	PERCENTAGE OF DOCULECTS

30	330	5.0

31	306	4.7

32	361	5.5

33	401	6.1

34	595	9.1

35	627	9.5

36	786	12.0

37	605	9.2

38	627	9.5

39	736	11.2

40	1198	18.2

Figure 1

Location of the doculects included in the dataset, using information from Hammarström et al. (); colours are automatically assigned to differentiate language families.

Table 2

Absolute and relative doculect coverage per concept, along with the Concepticon mapping for each concept.


CONCEPT GLOSS	DOCULECTS (RATIO)	CONCEPTICON NAME/ID

1pl	5265 (0.801)	WE/1212

1sg	5379 (0.818)	I/1209

2sg	5231 (0.795)	THOU/1215

blood	6426 (0.977)	BLOOD/946

bone	6351 (0.966)	BONE/1394

breast	5957 (0.906)	BREAST/1402

come	6130 (0.932)	COME/1446

die	6125 (0.931)	DIE/1494

dog	6430 (0.978)	DOG/2009

drink	6058 (0.921)	DRINK/1401

ear	6475 (0.985)	EAR/1247

eye	6494 (0.988)	EYE/1248

fire	6417 (0.976)	FIRE/221

fish	6226 (0.947)	FISH/227

full	4190 (0.637)	FULL/1429

hand	5693 (0.866)	HAND/1277

hear	5898 (0.897)	HEAR/1408

horn	4317 (0.656)	HORN (ANATOMY)/1393

knee	5357 (0.815)	KNEE/1371

leaf	6077 (0.924)	LEAF/628

liver	5454 (0.829)	LIVER/1224

louse	5711 (0.868)	LOUSE/1392

mountain	5321 (0.809)	MOUNTAIN/639

name	6042 (0.919)	NAME/1405

new	5711 (0.868)	NEW/1231

night	6289 (0.956)	NIGHT/1233

nose	6404 (0.974)	NOSE/1221

one	6296 (0.958)	ONE/1493

path	6151 (0.935)	PATH/2252

person	5552 (0.844)	PERSON/683

see	6104 (0.928)	SEE/1409

skin	6182 (0.940)	SKIN/763

star	6220 (0.946)	STAR/1430

stone	6290 (0.957)	STONE/857

sun	5877 (0.894)	SUN/1343

tongue	6430 (0.978)	TONGUE/1205

tooth	6399 (0.973)	TOOTH/1380

tree	5850 (0.890)	TREE/906

two	6285 (0.956)	TWO/1498

water	6413 (0.975)	WATER/948

The collection is not as accurate as alternative global (e.g., ) and family or areal resources (e.g., ), which merge different sources, offer more significant concept coverages, and are manually curated for linguistic and data qualities. Such alternatives should be favoured when they encompass all the languages an investigation needs. Nonetheless, GLED constitutes a reliable and convenient source for probing language relationships, prototyping studies, and bootstrapping phylolinguistic analyses (). It is likewise designed to support the development of new methods for tasks in computational historical linguistics, including phonological alignment, cognate detection, and sound correspondence inference (). Finally, the language distances built in the database can be used for adjusted language sampling, as illustrated in Section 4.

(2) Method

The dataset provided by Jäger (), derived from ASJP (), was used as the lexical source, excluding doculects that did not fit the design (such as artificial languages, reconstructions, and duplicates). The original transcription system, “ASJPcode”, was mapped to a broad transcription consistent with CLTS/BIPA () through an orthographic profile (). Such a profile was based on the one produced by the author for including ASJP in the Lexibank project. Decisions followed the non-exhaustive examples of phonological mapping and tokenization given in the original ASJP paper and the phonemic transcriptions of the ASJP word lists provided by other datasets.

Per-family automatic cognate attribution was performed with LexStat () for small and medium families (i.e., less than 18,000 items) and the SVM technique () for large ones. Phonological alignments of the ensuing cognate sets were compiled with LingPy (). Finally, the data was organized in a singular tabular resource; entries were sorted, in order, by family, concept, language, and form (Table 3).

Table 3

A modified snippet from the lexical dataset, showing the most critical columns for a subset of Tupian words for the concept “dog”. The data includes a unique language name, a Glottocode (when available), the family name, a concept gloss derived from the Concepticon catalog, the phonological transcription of the word, the phonological alignment of the word in its cognate set (with hyphens indicating gaps), and a cognate set index.


LANGUAGE	CODE	FAMILY	CONCEPT	FORM	ALIGNMENT	COGSET

Aché	ache1246	Tupian	DOG	bɐegi	b ɐ e g i	16

Amundava	amun1246	Tupian	DOG	ɲɐɲwɐrɐ	ɲ ɐ ɲ w - ɐ r ɐ	17

Avá Canoeiro	avac1239	Tupian	DOG	jɐwɐrɐ	j ɐ - w - ɐ r ɐ	17

Paraguayan Guaraní	para1311	Tupian	DOG	dʒɐgwɐ	dʒ ɐ g w - ɐ - -	17

Kaiwá	kaiw1246	Tupian	DOG	jɐgwɐ	j ɐ g w - ɐ - -	17

Eastern Bolivian Guaraní	east2555	Tupian	DOG	jeimbɐ	j e - i m b ɐ	19

Tapieté	tapi1253	Tupian	DOG	ɲɐʔəmbɐ	ɲ ɐ ʔ ə m b ɐ	19

Cinta Larga	cint1239	Tupian	DOG	ɐwəli	ɐ w ə l i	20

Gavião Do Jiparaná	gavi1246	Tupian	DOG	ɐvələ	ɐ v ə l ə	20

Per-family distance matrices based on the proportion of shared cognates were obtained from this dataset (Figure 2), and unrooted trees were constructed with the Neighbor-Joining method (). Models for inferring phylogenetic trees were produced with a patched version of BEASTling () and monophyletically constrained using Glottolog 4.6 (). Bayesian MCMC analyses were carried out with BEAST2 (), and summary Maximum Clade Credibility (MCC) trees were obtained with TreeAnnotator (). Finally, custom scripts were employed to normalize distances and join these trees, along with the language isolates, into a single unrooted tree (Figure 3). It must be underlined that the latter is in absolutely no manner proposed as supporting “Proto-Human” hypotheses but merely as a convenient resource for measuring language distance.

Figure 2

A neighbour-net for the Tupian languages in the dataset, plotted with SplitsTree v4 ().

Figure 3

The “global” language tree from the combined Bayesian MCMC phylogenetic inferences, plotted with iTOL ().

The complete pipeline is accessible via the public GitHub repository at https://github.com/tresoldi/gled and takes approximately three days to be processed in a typical laptop (i5 processor, 8GB RAM, Fedora Linux 37). It will expedite planned forthcoming releases aggregating sources for languages missing in ASJP, such as recently documented isolates, and employing alternative methods for computational tasks, such as new methods of cognate detection.

(3) Dataset Description

Object name

gled

Format names and versions

The dataset has the following components:

– A TSV file (“gled.tsv”) with columns for (a) unique entry ID, (b) language ID (as provided in ASJP), (c) language name (provided by Glottolog, ASJP, or the author), (d) Glottocode when available, (e) Glottolog name when available, (f) family name, (g) concept gloss, (h) Concepticon ID (), (i) ASJP original form, (j) reconstructed form, (k) broad IPA transcription, (l) alignment, (m) cognate set ID, and (n) cognate set ID as an integer
– A YAML file (“gled.resource.yaml”) with the metadata as per the FrictionlessData project
– NEXUS files (“nexus/*.nex”) for families with more than one language
– Distance Matrices (“phylo/*.dst”) for families with more than one language, based on the percentage of shared cognates
– NJ trees in Newick notation (“phylo/*.tree”) for families with more than one language, based on the corresponding distance matrix
– Bayesian MCMC per-family (“trees/*.tree”) and global (“trees/global.tree”) trees in Newick notation

Language

English

Licence

CC-BY-4.0

Publication date

2022-11-27

(4) Reuse Potential

Provided that its limits in proportion and strictness, arising from ASJP and examined in Brown et al. () and Jäger (), are considered, the dataset provides many opportunities for reuse in empirical historical linguistics focused on lexical and phonetic data. Furthermore, as the doculects are linked to Glottolog, it is viable to integrate the data with other global-level resources, such as the World Loanword Database (), the World Atlas of Language Structures (), and Phoible ().

The distance matrices and phylogenetic trees offer a convenient starting point for comparing the results of different and more advanced analyses, notably with under-studied and under-resourced language families for which no distance matrix or phylogenetic tree with branch lengths is available. Table 4 illustrates such distances, showing values from the trees inferred without (NJ) and with (B) a molecular clock. Such distances can be managed to perform weighted random sampling at global, family, and sub-family levels, addressing issues such as sample bias and autocorrelation in cross-linguistic analyses.

Table 4

Distance between Swedish (swed1254) and other languages, as computed using the Neighbour Joining trees (NJ, from zero to infinite), the Bayesian trees (B, from zero to 4.0), and the normalized Bayesian trees (NB, from zero to 1.0).


LANGUAGE (GLOTTOCODE)	NJ	B	NB

Norwegian Bokmål (norw1259)	0.21	0.11	0.02

Danish (dani1285)	0.24	0.02	0.01

Dutch (dutc1256)	0.41	1.40	0.35

English (stan1293)	0.42	1.40	0.35

Italian (ital1282)	0.84	1.60	0.40

Hindi (hind1269)	0.90	1.95	0.48

Hittite (hitt1242)	0.90	1.97	0.49

Basque (basq1248)	∞	4.00	1.00

Journal of Open Humanities Data

Data Papers