(1) Overview
Repository location
Context
The Global Lexical Database (GLED) is a resource for computational historical linguistics encompassing a dataset of basic vocabulary for most known natural languages, with accompanying information on machine-detected cognates and phonological alignments, along with per-family and global phylogenetic resources. The latest release holds 262,859 entries for 6,572 doculects (documented language varieties, see ) in 344 families (Figure 1) and is available under the CC-BY licence. The database’s key component, a lexical dataset ultimately derived from the word lists of the Automated Similarity Judgement Program (ASJP), carries lemmas for between 30 and 40 comparative concepts for each doculect, all rendered with a broad phonetic transcription. The average concept coverage per doculect is 90.3%, and the average mutual pairwise coverage between doculects is 82.2%. Table 1 details the distribution of concept counts across doculects, and Table 2 lists the concepts along with their coverage.
NUMBER OF CONCEPTS | DOCULECTS | PERCENTAGE OF DOCULECTS |
---|---|---|
30 | 330 | 5.0 |
31 | 306 | 4.7 |
32 | 361 | 5.5 |
33 | 401 | 6.1 |
34 | 595 | 9.1 |
35 | 627 | 9.5 |
36 | 786 | 12.0 |
37 | 605 | 9.2 |
38 | 627 | 9.5 |
39 | 736 | 11.2 |
40 | 1198 | 18.2 |
CONCEPT GLOSS | DOCULECTS (RATIO) | CONCEPTICON NAME/ID |
---|---|---|
1pl | 5265 (0.801) | WE/1212 |
1sg | 5379 (0.818) | I/1209 |
2sg | 5231 (0.795) | THOU/1215 |
blood | 6426 (0.977) | BLOOD/946 |
bone | 6351 (0.966) | BONE/1394 |
breast | 5957 (0.906) | BREAST/1402 |
come | 6130 (0.932) | COME/1446 |
die | 6125 (0.931) | DIE/1494 |
dog | 6430 (0.978) | DOG/2009 |
drink | 6058 (0.921) | DRINK/1401 |
ear | 6475 (0.985) | EAR/1247 |
eye | 6494 (0.988) | EYE/1248 |
fire | 6417 (0.976) | FIRE/221 |
fish | 6226 (0.947) | FISH/227 |
full | 4190 (0.637) | FULL/1429 |
hand | 5693 (0.866) | HAND/1277 |
hear | 5898 (0.897) | HEAR/1408 |
horn | 4317 (0.656) | HORN (ANATOMY)/1393 |
knee | 5357 (0.815) | KNEE/1371 |
leaf | 6077 (0.924) | LEAF/628 |
liver | 5454 (0.829) | LIVER/1224 |
louse | 5711 (0.868) | LOUSE/1392 |
mountain | 5321 (0.809) | MOUNTAIN/639 |
name | 6042 (0.919) | NAME/1405 |
new | 5711 (0.868) | NEW/1231 |
night | 6289 (0.956) | NIGHT/1233 |
nose | 6404 (0.974) | NOSE/1221 |
one | 6296 (0.958) | ONE/1493 |
path | 6151 (0.935) | PATH/2252 |
person | 5552 (0.844) | PERSON/683 |
see | 6104 (0.928) | SEE/1409 |
skin | 6182 (0.940) | SKIN/763 |
star | 6220 (0.946) | STAR/1430 |
stone | 6290 (0.957) | STONE/857 |
sun | 5877 (0.894) | SUN/1343 |
tongue | 6430 (0.978) | TONGUE/1205 |
tooth | 6399 (0.973) | TOOTH/1380 |
tree | 5850 (0.890) | TREE/906 |
two | 6285 (0.956) | TWO/1498 |
water | 6413 (0.975) | WATER/948 |
The collection is not as accurate as alternative global (e.g., ) and family or areal resources (e.g., ), which merge different sources, offer more significant concept coverages, and are manually curated for linguistic and data qualities. Such alternatives should be favoured when they encompass all the languages an investigation needs. Nonetheless, GLED constitutes a reliable and convenient source for probing language relationships, prototyping studies, and bootstrapping phylolinguistic analyses (). It is likewise designed to support the development of new methods for tasks in computational historical linguistics, including phonological alignment, cognate detection, and sound correspondence inference (). Finally, the language distances built in the database can be used for adjusted language sampling, as illustrated in Section 4.
(2) Method
The dataset provided by Jäger (), derived from ASJP (), was used as the lexical source, excluding doculects that did not fit the design (such as artificial languages, reconstructions, and duplicates). The original transcription system, “ASJPcode”, was mapped to a broad transcription consistent with CLTS/BIPA () through an orthographic profile (). Such a profile was based on the one produced by the author for including ASJP in the Lexibank project. Decisions followed the non-exhaustive examples of phonological mapping and tokenization given in the original ASJP paper and the phonemic transcriptions of the ASJP word lists provided by other datasets.
Per-family automatic cognate attribution was performed with LexStat () for small and medium families (i.e., less than 18,000 items) and the SVM technique () for large ones. Phonological alignments of the ensuing cognate sets were compiled with LingPy (). Finally, the data was organized in a singular tabular resource; entries were sorted, in order, by family, concept, language, and form (Table 3).
LANGUAGE | CODE | FAMILY | CONCEPT | FORM | ALIGNMENT | COGSET |
---|---|---|---|---|---|---|
Aché | ache1246 | Tupian | DOG | bɐegi | b ɐ e g i | 16 |
Amundava | amun1246 | Tupian | DOG | ɲɐɲwɐrɐ | ɲ ɐ ɲ w - ɐ r ɐ | 17 |
Avá Canoeiro | avac1239 | Tupian | DOG | jɐwɐrɐ | j ɐ - w - ɐ r ɐ | 17 |
Paraguayan Guaraní | para1311 | Tupian | DOG | dʒɐgwɐ | dʒ ɐ g w - ɐ - - | 17 |
Kaiwá | kaiw1246 | Tupian | DOG | jɐgwɐ | j ɐ g w - ɐ - - | 17 |
Eastern Bolivian Guaraní | east2555 | Tupian | DOG | jeimbɐ | j e - i m b ɐ | 19 |
Tapieté | tapi1253 | Tupian | DOG | ɲɐʔəmbɐ | ɲ ɐ ʔ ə m b ɐ | 19 |
Cinta Larga | cint1239 | Tupian | DOG | ɐwəli | ɐ w ə l i | 20 |
Gavião Do Jiparaná | gavi1246 | Tupian | DOG | ɐvələ | ɐ v ə l ə | 20 |
Per-family distance matrices based on the proportion of shared cognates were obtained from this dataset (Figure 2), and unrooted trees were constructed with the Neighbor-Joining method (). Models for inferring phylogenetic trees were produced with a patched version of BEASTling () and monophyletically constrained using Glottolog 4.6 (). Bayesian MCMC analyses were carried out with BEAST2 (), and summary Maximum Clade Credibility (MCC) trees were obtained with TreeAnnotator (). Finally, custom scripts were employed to normalize distances and join these trees, along with the language isolates, into a single unrooted tree (Figure 3). It must be underlined that the latter is in absolutely no manner proposed as supporting “Proto-Human” hypotheses but merely as a convenient resource for measuring language distance.
The complete pipeline is accessible via the public GitHub repository at https://github.com/tresoldi/gled and takes approximately three days to be processed in a typical laptop (i5 processor, 8GB RAM, Fedora Linux 37). It will expedite planned forthcoming releases aggregating sources for languages missing in ASJP, such as recently documented isolates, and employing alternative methods for computational tasks, such as new methods of cognate detection.
(3) Dataset Description
Object name
gled
Format names and versions
The dataset has the following components:
- – A TSV file (“gled.tsv”) with columns for (a) unique entry ID, (b) language ID (as provided in ASJP), (c) language name (provided by Glottolog, ASJP, or the author), (d) Glottocode when available, (e) Glottolog name when available, (f) family name, (g) concept gloss, (h) Concepticon ID (), (i) ASJP original form, (j) reconstructed form, (k) broad IPA transcription, (l) alignment, (m) cognate set ID, and (n) cognate set ID as an integer
- – A YAML file (“gled.resource.yaml”) with the metadata as per the FrictionlessData project
- – NEXUS files (“nexus/*.nex”) for families with more than one language
- – Distance Matrices (“phylo/*.dst”) for families with more than one language, based on the percentage of shared cognates
- – NJ trees in Newick notation (“phylo/*.tree”) for families with more than one language, based on the corresponding distance matrix
- – Bayesian MCMC per-family (“trees/*.tree”) and global (“trees/global.tree”) trees in Newick notation
Language
English
Licence
CC-BY-4.0
Publication date
2022-11-27
(4) Reuse Potential
Provided that its limits in proportion and strictness, arising from ASJP and examined in Brown et al. () and Jäger (), are considered, the dataset provides many opportunities for reuse in empirical historical linguistics focused on lexical and phonetic data. Furthermore, as the doculects are linked to Glottolog, it is viable to integrate the data with other global-level resources, such as the World Loanword Database (), the World Atlas of Language Structures (), and Phoible ().
The distance matrices and phylogenetic trees offer a convenient starting point for comparing the results of different and more advanced analyses, notably with under-studied and under-resourced language families for which no distance matrix or phylogenetic tree with branch lengths is available. Table 4 illustrates such distances, showing values from the trees inferred without (NJ) and with (B) a molecular clock. Such distances can be managed to perform weighted random sampling at global, family, and sub-family levels, addressing issues such as sample bias and autocorrelation in cross-linguistic analyses.
LANGUAGE (GLOTTOCODE) | NJ | B | NB |
---|---|---|---|
Norwegian Bokmål (norw1259) | 0.21 | 0.11 | 0.02 |
Danish (dani1285) | 0.24 | 0.02 | 0.01 |
Dutch (dutc1256) | 0.41 | 1.40 | 0.35 |
English (stan1293) | 0.42 | 1.40 | 0.35 |
Italian (ital1282) | 0.84 | 1.60 | 0.40 |
Hindi (hind1269) | 0.90 | 1.95 | 0.48 |
Hittite (hitt1242) | 0.90 | 1.97 | 0.49 |
Basque (basq1248) | ∞ | 4.00 | 1.00 |