(1) Overview

Context

The Global Lexical Database (GLED) is a resource for computational historical linguistics encompassing a dataset of basic vocabulary for most known natural languages, with accompanying information on machine-detected cognates and phonological alignments, along with per-family and global phylogenetic resources. The latest release holds 262,859 entries for 6,572 doculects (documented language varieties, see ) in 344 families (Figure 1) and is available under the CC-BY licence. The database’s key component, a lexical dataset ultimately derived from the word lists of the Automated Similarity Judgement Program (ASJP), carries lemmas for between 30 and 40 comparative concepts for each doculect, all rendered with a broad phonetic transcription. The average concept coverage per doculect is 90.3%, and the average mutual pairwise coverage between doculects is 82.2%. Table 1 details the distribution of concept counts across doculects, and Table 2 lists the concepts along with their coverage.

Table 1

Number of doculects per number of concepts expressed in absolute and relative terms. Note that the number of entries for a doculect will be higher than the number of concepts in the case of synonyms.


NUMBER OF CONCEPTSDOCULECTSPERCENTAGE OF DOCULECTS

303305.0

313064.7

323615.5

334016.1

345959.1

356279.5

3678612.0

376059.2

386279.5

3973611.2

40119818.2

Figure 1 

Location of the doculects included in the dataset, using information from Hammarström et al. (); colours are automatically assigned to differentiate language families.

Table 2

Absolute and relative doculect coverage per concept, along with the Concepticon mapping for each concept.


CONCEPT GLOSSDOCULECTS (RATIO)CONCEPTICON NAME/ID

1pl5265 (0.801)WE/1212

1sg5379 (0.818)I/1209

2sg5231 (0.795)THOU/1215

blood6426 (0.977)BLOOD/946

bone6351 (0.966)BONE/1394

breast5957 (0.906)BREAST/1402

come6130 (0.932)COME/1446

die6125 (0.931)DIE/1494

dog6430 (0.978)DOG/2009

drink6058 (0.921)DRINK/1401

ear6475 (0.985)EAR/1247

eye6494 (0.988)EYE/1248

fire6417 (0.976)FIRE/221

fish6226 (0.947)FISH/227

full4190 (0.637)FULL/1429

hand5693 (0.866)HAND/1277

hear5898 (0.897)HEAR/1408

horn4317 (0.656)HORN (ANATOMY)/1393

knee5357 (0.815)KNEE/1371

leaf6077 (0.924)LEAF/628

liver5454 (0.829)LIVER/1224

louse5711 (0.868)LOUSE/1392

mountain5321 (0.809)MOUNTAIN/639

name6042 (0.919)NAME/1405

new5711 (0.868)NEW/1231

night6289 (0.956)NIGHT/1233

nose6404 (0.974)NOSE/1221

one6296 (0.958)ONE/1493

path6151 (0.935)PATH/2252

person5552 (0.844)PERSON/683

see6104 (0.928)SEE/1409

skin6182 (0.940)SKIN/763

star6220 (0.946)STAR/1430

stone6290 (0.957)STONE/857

sun5877 (0.894)SUN/1343

tongue6430 (0.978)TONGUE/1205

tooth6399 (0.973)TOOTH/1380

tree5850 (0.890)TREE/906

two6285 (0.956)TWO/1498

water6413 (0.975)WATER/948

The collection is not as accurate as alternative global (e.g., ) and family or areal resources (e.g., ), which merge different sources, offer more significant concept coverages, and are manually curated for linguistic and data qualities. Such alternatives should be favoured when they encompass all the languages an investigation needs. Nonetheless, GLED constitutes a reliable and convenient source for probing language relationships, prototyping studies, and bootstrapping phylolinguistic analyses (). It is likewise designed to support the development of new methods for tasks in computational historical linguistics, including phonological alignment, cognate detection, and sound correspondence inference (). Finally, the language distances built in the database can be used for adjusted language sampling, as illustrated in Section 4.

(2) Method

The dataset provided by Jäger (), derived from ASJP (), was used as the lexical source, excluding doculects that did not fit the design (such as artificial languages, reconstructions, and duplicates). The original transcription system, “ASJPcode”, was mapped to a broad transcription consistent with CLTS/BIPA () through an orthographic profile (). Such a profile was based on the one produced by the author for including ASJP in the Lexibank project. Decisions followed the non-exhaustive examples of phonological mapping and tokenization given in the original ASJP paper and the phonemic transcriptions of the ASJP word lists provided by other datasets.

Per-family automatic cognate attribution was performed with LexStat () for small and medium families (i.e., less than 18,000 items) and the SVM technique () for large ones. Phonological alignments of the ensuing cognate sets were compiled with LingPy (). Finally, the data was organized in a singular tabular resource; entries were sorted, in order, by family, concept, language, and form (Table 3).

Table 3

A modified snippet from the lexical dataset, showing the most critical columns for a subset of Tupian words for the concept “dog”. The data includes a unique language name, a Glottocode (when available), the family name, a concept gloss derived from the Concepticon catalog, the phonological transcription of the word, the phonological alignment of the word in its cognate set (with hyphens indicating gaps), and a cognate set index.


LANGUAGECODEFAMILYCONCEPTFORMALIGNMENTCOGSET

Achéache1246TupianDOGbɐegib ɐ e g i16

Amundavaamun1246TupianDOGɲɐɲwɐrɐɲ ɐ ɲ w - ɐ r ɐ17

Avá Canoeiroavac1239TupianDOGjɐwɐrɐj ɐ - w - ɐ r ɐ17

Paraguayan Guaranípara1311TupianDOGdʒɐgwɐdʒ ɐ g w - ɐ - -17

Kaiwákaiw1246TupianDOGjɐgwɐj ɐ g w - ɐ - -17

Eastern Bolivian Guaraníeast2555TupianDOGjeimbɐj e - i m b ɐ19

Tapietétapi1253TupianDOGɲɐʔəmbɐɲ ɐ ʔ ə m b ɐ19

Cinta Largacint1239TupianDOGɐwəliɐ w ə l i20

Gavião Do Jiparanágavi1246TupianDOGɐvələɐ v ə l ə20

Per-family distance matrices based on the proportion of shared cognates were obtained from this dataset (Figure 2), and unrooted trees were constructed with the Neighbor-Joining method (). Models for inferring phylogenetic trees were produced with a patched version of BEASTling () and monophyletically constrained using Glottolog 4.6 (). Bayesian MCMC analyses were carried out with BEAST2 (), and summary Maximum Clade Credibility (MCC) trees were obtained with TreeAnnotator (). Finally, custom scripts were employed to normalize distances and join these trees, along with the language isolates, into a single unrooted tree (Figure 3). It must be underlined that the latter is in absolutely no manner proposed as supporting “Proto-Human” hypotheses but merely as a convenient resource for measuring language distance.

Figure 2 

A neighbour-net for the Tupian languages in the dataset, plotted with SplitsTree v4 ().

Figure 3 

The “global” language tree from the combined Bayesian MCMC phylogenetic inferences, plotted with iTOL ().

The complete pipeline is accessible via the public GitHub repository at https://github.com/tresoldi/gled and takes approximately three days to be processed in a typical laptop (i5 processor, 8GB RAM, Fedora Linux 37). It will expedite planned forthcoming releases aggregating sources for languages missing in ASJP, such as recently documented isolates, and employing alternative methods for computational tasks, such as new methods of cognate detection.

(3) Dataset Description

Object name

gled

Format names and versions

The dataset has the following components:

  • – A TSV file (“gled.tsv”) with columns for (a) unique entry ID, (b) language ID (as provided in ASJP), (c) language name (provided by Glottolog, ASJP, or the author), (d) Glottocode when available, (e) Glottolog name when available, (f) family name, (g) concept gloss, (h) Concepticon ID (), (i) ASJP original form, (j) reconstructed form, (k) broad IPA transcription, (l) alignment, (m) cognate set ID, and (n) cognate set ID as an integer
  • – A YAML file (“gled.resource.yaml”) with the metadata as per the FrictionlessData project
  • – NEXUS files (“nexus/*.nex”) for families with more than one language
  • – Distance Matrices (“phylo/*.dst”) for families with more than one language, based on the percentage of shared cognates
  • – NJ trees in Newick notation (“phylo/*.tree”) for families with more than one language, based on the corresponding distance matrix
  • – Bayesian MCMC per-family (“trees/*.tree”) and global (“trees/global.tree”) trees in Newick notation

Language

English

Licence

CC-BY-4.0

Publication date

2022-11-27

(4) Reuse Potential

Provided that its limits in proportion and strictness, arising from ASJP and examined in Brown et al. () and Jäger (), are considered, the dataset provides many opportunities for reuse in empirical historical linguistics focused on lexical and phonetic data. Furthermore, as the doculects are linked to Glottolog, it is viable to integrate the data with other global-level resources, such as the World Loanword Database (), the World Atlas of Language Structures (), and Phoible ().

The distance matrices and phylogenetic trees offer a convenient starting point for comparing the results of different and more advanced analyses, notably with under-studied and under-resourced language families for which no distance matrix or phylogenetic tree with branch lengths is available. Table 4 illustrates such distances, showing values from the trees inferred without (NJ) and with (B) a molecular clock. Such distances can be managed to perform weighted random sampling at global, family, and sub-family levels, addressing issues such as sample bias and autocorrelation in cross-linguistic analyses.

Table 4

Distance between Swedish (swed1254) and other languages, as computed using the Neighbour Joining trees (NJ, from zero to infinite), the Bayesian trees (B, from zero to 4.0), and the normalized Bayesian trees (NB, from zero to 1.0).


LANGUAGE (GLOTTOCODE)NJBNB

Norwegian Bokmål (norw1259)0.210.110.02

Danish (dani1285)0.240.020.01

Dutch (dutc1256)0.411.400.35

English (stan1293)0.421.400.35

Italian (ital1282)0.841.600.40

Hindi (hind1269)0.901.950.48

Hittite (hitt1242)0.901.970.49

Basque (basq1248)4.001.00