(1) Overview


While Modern Chinese is known for its short words, its simple syllable structure, and its tones, in the distant past Chinese was a very different language; Old Chinese (1300–100 BCE) lacked tones, had complex syllable structure with consonant clusters, and used prefixes and suffixes to form new words. By the early 7th century, when the earliest extant Chinese pronunciation dictionary was published, Middle Chinese was already recognizably a form of the language we know today.

It was during the Hàn 漢 dynasty (206 BCE–220 CE) that the radical transition between those two stages occurred; it was the first enduring empire in Chinese history, and among the most formative periods for Chinese thought and literature. At this time, the Confucian cultural milieu accompanying classical scholarship thrived. The Confucian classics themselves were edited and (literally) set in stone, while poetry and belletristic prose flourished. The Hàn also saw unprecedented exposure to and influences from foreign cultures, from grapes to backgammon, with Buddhism standing out as the period’s most abiding foreign influence.

As part of the spread of Buddhism from the west, works of Buddhist literature were brought to China and translated by teams of editors (); amongst them, three figures from the later days of the Hàn dynasty are worth mentioning:

  • Ān Shìgāo 安世高 (fl. 148–170), a Central Asian translator active in the Chinese imperial capital of Luòyáng 洛陽, was the first translator of Buddhist texts into Chinese whose name we know ().
  • Lokakṣema (Zhī Lóujiāchèn 支婁迦讖) (fl. 147–189) was a Buddhist monk from Gandhara also active in Luòyáng ().
  • Kāng Mèngxiáng 康孟詳, of whom little is known, but is generally considered to be born in China from Sogdian parents ().

These three figures are of particular interest to us here because of their use of transcription in their translations. For instance, while a concept such as dharma ended up being translated into Chinese as *puɑp 法 (rule, way, doctrine), it can also be found in the translations of Lokakṣema as *dəm-mɑ 曇摩, a phonetic transcription of a Prakrit word comparable to Pāli damma or Gandhari dhaṃma.

The most extensive discussion of the implication of such transcriptions for the phonology of Late Hàn Chinese is Coblin (). Since that publication, however, a lot of things have changed: new manuscripts have been discovered and their authorship has been attributed to Ān Shìgāo (), providing new transcriptional data, while some other texts traditionally attributed to him have now been classified as later commentaries (); our understanding of Old Chinese phonology has dramatically changed and in particular it is now accepted that Old Chinese had a complex syllable structure with consonant clusters in syllable-initial and final position as well as prefixes and suffixes, cf. Baxter (), Baxter and Sagart (); finally, our understanding of languages that could have been close to the source languages of the texts being translated by Ān Shìgāo, Lokakṣema, and Kāng Mèngxiáng – in particular Gandhari () – has progressed.

These developments make it necessary to revisit Coblin’s conclusions regarding the contributions of the Buddhist transcriptional data to our understanding of Hàn Chinese, and the dataset presented here is an attempt to lay out all of the available Buddhist transcriptional data from the Late Hàn period and annotate it with state-of-the-art linguistic knowledge: Sanskrit, Pāli and Gandhari equivalents serve as points of comparison for what the pronunciation of the words might have been in the unknown source language, and Late Hàn Chinese and Middle Chinese reconstructions as illustrations of the transcriptions’ target language.

(2) Method

Base corpus

The basis of the dataset is Coblin (), whose Buddhist transcriptional data includes the following texts from the Taishō Tripiṭaka:

  • Ān Shìgāo
    • –  T13: Cháng āhán shí bàofǎ jīng 長阿含十報法經
    • –  T14: Rén běn yù shēng jīng 人本欲生經
    • –  T31: Yī qiē liú shè shǑu yīn jīng 一切流攝守因經
    • –  T32: Sì dì jīng 四諦經
    • –  T98: Pǔfǎ yì jīng 普法義經
    • –  T150A: Zá jīng sìshísì piān 雜經四十四篇
    • –  T150A (1): Qī chù sān guān jīng 七處三觀經
    • –  T150A (30): Jī gǔ [jīng] 積骨[經]
    • –  T150A (31): Jiǔ héng [jīng] 九横[經]
    • –  T602: Dà ānbān shǑuyì jīng 大安般守意經
    • –  T607: Dào dì jīng 道地經
  • Lokakṣema
    • –  T224: Dàoxíng bōrě jīng 道行般若經
    • –  T280: Dōushā jīng 兜沙經
    • –  T313: Āchù fóguó jīng 阿閦佛國經
    • –  T418: Bānzhōu sānmèi jīng 般舟三昧經
    • –  T458: Wénshūshīlì wèn púsà shǔ jīng 文殊師利問菩薩署經
    • –  T626: Āshéshì wáng jīng 阿闍世王經
  • Kāng Mèngxiáng
    • –  T184: Xiūxíng běnqǐ jīng 修行本起經
    • –  T196: Zhōng běnqǐ jīng 中本起經

Additions and removals

Over the years, scholars have expressed doubts regarding the inclusion of this or that text to the corpus of these translators, while other texts were proposed for inclusion. For the Ān Shìgāo corpus, a consensus gradually emerged and is described in detail in Zacchetti (), itself based on the work of Zürcher () and Zürcher (). Some of the texts in Zacchetti’s list were long considered to be part of Ān Shìgāo’s works but were not studied by Coblin. As a result, we added the following texts on top of Coblin’s Ān Shìgāo’s corpus:

  • T36: Běnxiàng yīzhì jīng 本相猗致經
  • T48: Shì fǎ fēi fǎ jīng 是法非法經
  • T57: Lòu fēnbù jīng 漏分佈經
  • T101: Zá āhán jīng 雜阿含經
  • T112: Bā zhèng dào jīng 八正道經
  • T603: Yīn chí rù jīng 陰持入經
  • T1508: Āhán kǑu jiě shí’èr yīnyuán jīng 阿含口解十二因緣經
  • T1557: Āpítán wǔ fǎ xíng jīng 阿毘曇五法行經

In addition, T602 Dà ānbān shǑuyì jīng 大安般守意經, originally listed in Coblin () was removed.

For Lokakṣema and Kāng Mèngxiáng, no new texts were added, but for Lokakṣema more transcription words were added from T224 Dàoxíng bōrě jīng 道行般若經, on the basis of Karashima (). All the transcription material mentioned so far for the three translators can be found in Hill et al. ().

On top of these, two manuscripts discovered in 1999 in the Kongō-ji 金剛寺 temple were ascribed to Ān Shìgāo in Zacchetti (); Vetter (), in his study of Ān Shìgāo’s lexicon, includes material from the Kongō-ji as well as from T101, and we have retrieved the transcription material from there. The final Ān Shìgāo corpus, starting from Coblin () and applying all the additions and removals, comprises the following texts:

  • T13: Cháng āhán shí bàofǎ jīng 長阿含十報法經
  • T14: Rén běn yù shēng jīng 人本欲生經
  • T31: Yī qiē liú shè shǑu yīn jīng 一切流攝守因經
  • T32: Sì dì jīng 四諦經
  • T36: Běnxiàng yīzhì jīng 本相猗致經
  • T48: Shì fǎ fēi fǎ jīng 是法非法經
  • T57: Lòu fēnbù jīng 漏分佈經
  • T98: Pǔfǎ yì jīng 普法義經
  • T101: Zá āhán jīng 雜阿含經
  • T112: Bā zhèng dào jīng 八正道經
  • T150A: Zá jīng sìshísì piān 雜經四十四篇
  • T150A (1): Qī chù sān guān jīng 七處三觀經
  • T150A (30): Jī gǔ [jīng] 積骨[經]
  • T150A (31): Jiǔ héng [jīng] 九横[經]
  • T603: Yīn chí rù jīng 陰持入經
  • T607: Dào dì jīng 道地經
  • T1508: Āhán kǑu jiě shí’èr yīnyuán jīng 阿含口解十二因緣經
  • T1557: Āpítán wǔ fǎ xíng jīng 阿毘曇五法行經
  • Kongō-ji: Ānbān shǑuyì jīng 安般守意經 (‘KA’)
  • Kongō-ji: Shí’èr mén jīng 十二門經, Jiě shí’èr mén jīng 解十二門經, and the anonymous commentary (‘TG’)

Altogether, this forms the Chinese basis of our dataset, along with the identification of the corresponding Sanskrit and/or Pāli equivalents. For these, we have relied on the identification made in Vetter () for the Kongō-ji texts and Hill et al. () for the rest.

Source summary

As a summary, the transcriptions listed in the dataset directly come from the following sources: for Ān Shìgāo, we collate Hill et al. (), which expands Coblin’s work with more texts and more entries for the existing texts, and Baley (), which collects transliteration terms from Vetter () for the Kongō-ji 金剛寺. For Lokakṣema and Kāng Mèngxiáng, we use Hill et al. () (which extends Coblin’s work on Lokakṣema using Karashima ()). A comparison of the number of entries between Coblin (), Hill et al. (), and our dataset, for each translator, can be found in Table 1.

Table 1

Entries in Coblin (), Hill et al. (), and the present dataset.


Ān Shìgāo3333 67


Kāng Mèngxiáng545454

Indic Transcriptions

As the Sanskrit/Pāli information in Hill et al. () was incomplete – for some entries only one of the two languages was provided – we have aimed to complete it where possible; in addition, we have used Baums and Glass () to provide Gandhari equivalents to the Sanskrit/Pāli whenever we were able to identify such equivalents. This will help explore the question of the translations’ source language(s) from a quantitative as well as qualitative point of view. We think that expanding this process to other languages of Central Asia, as their scholarship improves, would be desirable; in particular, we aim to explore Tocharian equivalents in a later project.

Chinese Reconstructions

We have added columns to provide reconstructions of various stages of Chinese phonology:

  • Late Hàn: Schuessler () and Schuessler ()
  • Middle Chinese: we use the Middle Chinese transcription system (based on the rime books and rime tables) described in Baxter ()

(3) Dataset Description

Object name

Chinese Transcription of Buddhist Terms in the Late Hàn Dynasty.

Format names and versions

OpenDocument Spreadsheet

Creation dates

2023-04-01 to 2023-05-06

Dataset creators

Julien Baley, SOAS University of London: Data curation, Investigation, Methodology, Validation.


English, Chinese (Late Hàn, Middle, Modern), Sanskrit, Pāli, Gandhari


Creative Commons Attribution 4.0 International

Repository name


Publication date



If you find errors in the dataset, please email the corresponding author.

(4) Re-use Potential

By bringing together the scholarly work of many different scholars, this dataset can serve as the basis for further analysis of transcription practices of the Chinese Buddhist translators of the late Hàn dynasty. For instance, the question of the attributions of translation works is a recurring one and in the case of translators such as Ān Shìgāo and Lokakṣema – as we have seen – the debate about the authorship of individual texts can take place over many centuries. Our dataset provides a quick reference that can help argue – on internal grounds – whether the transcriptional vocabulary used in a text is typical of a certain translation team and can therefore contribute to discussions of text attributions, including discussions of layering of the translation process.

Another potential re-use of our dataset is to help with interpreting Gandhari texts: a good number of the texts included in the present dataset are translations of texts that are no longer extant; with new excavations of manuscripts in Gandhari and other languages, as well as the gradual cataloguing of the existing ones, our dataset of equivalence between Chinese and Gandhari may help – in the future – to identify the source text of such translations or – since the editorial history of such texts is generally more complicated – at least to identify passages that bear similarities to our known Chinese texts and help interpret the Gandhari manuscripts and our understanding of the doctrinal development underlying the diffusion of such texts.

Finally, as the dataset contains Chinese transcriptions of Buddhist concepts and their equivalents in several languages, this information can be used to try and qualify the source language of those transcriptions. For example, does a given Chinese transcription of a Buddhist term show greater similarity to its equivalent in Sanskrit, Pāli, Gandhari or yet another language, and what does it tell us about the likely phonetic characterstics of the translation’s source language?

In the earlier example of dharma transcribed by Lokakṣema as *dəm-mɑ 曇摩, as the reconstruction of a final *-m is certain for *dəm 曇, this seems to exclude the possibility of a transcription from Sanskrit dharma, and instead the choice of two syllables, the first ending in *-m and the second starting with *m- and would indicate a gemination in the source language, as is for instance found in Prakrits such as Pāli damma and Gandhari dhaṃma.

Following such analysis at the corpus level, does a trend emerge from all the transcriptions from a certain translator or translator team? For instance, one may notice in Ān Shìgāo’s transcriptions a certain trend for sibilants to match Gandhari better than Sanskrit or Pāli, as illustrated in Table 2, while Lokakṣema – who was from Gandhara – shows more variation in his transcriptions: some words match more closely Pāli models, as in his use of *ʔɑ tśan dai 阿旃陀 that better matches Pāli accanta than Skt. atyanta or Gdh. acada, while others show a Gandhari slant, such as *tṣan diei 羼提 being closer to Gandhari kṣaṃti than to Pāli khanti.

Table 2

Sibilants in Ān Shìgāo’s transcriptions closely match Gandhari.


ŚāriputraSāriputta Śariputra舍利弗 śaᶜ liᶜ put

śramaṇasamaṇa amana沙門 a mǝn

kāṣāyakāsāvakaaya袈裟ka ai

Conversely, the parallel question can also be investigated: given the Chinese transcriptions, what can one learn about the dialect of Chinese spoken by the translator team? What phonological features of that dialect can be discovered from the choice of Chinese characters to transcribe certain syllables of the original Buddhist term? Such questions are of extreme importance to the reconstruction of the historical development of Chinese phonology during the late Hàn period.