(1) Overview

Context

Newar (also referred to as Nepāl Bhāṣā) is the indigenous language of the Kathmandu Valley. In its pre-print phase, this highly literate and creative culture produced thousands of works that have remained mainly unstudied in either western or Nepalese scholarship. Much of Newar literature is a mixture of Newar, Sanskrit, and Maithili (Malla, 1981, 6–9). While Newar literature is written in various scripts, the most common by far is the Pracalit script, which has thus also come to be known as Newar Lipi (Newar script) (Pandey, 2012). Thus, for both Indological interest in Nepalese manuscripts written in Sanskrit and for students of Newar language and culture, a means to compile a digital corpus more quickly through optical character recognition (OCR) becomes apparent.

OCR engines have gradually become more effective in recent decades. Handwritten text recognition (HTR) has proven to be far more problematic. Deep learning neural networks have made it possible to build HTR models based on images of handwritten text linked with corresponding transcriptions (called “ground truth”). A character error rate (CER) under 10% allows for effective automatic transcription (Muehlberger et al., 2019). Advances in computing power and storage made by the Transkribus platform developed by READ-COOP have enabled the training of large data sets involving multiple hands, allowing for generalised HTR models for particular writing styles (Hodel et al., 2021). Transkribus hosts two HTR engines: CITIlab-HTR+ (Michael et al., 2018) and PyLaia, a PyTorch-based model (Mocholí Calvo et al., 2018).

In principle, models for HTR of Indic texts can be developed similarly to those in Roman scripts. Transkribus already has two publicly available HTR+ models for printed 19th and 20th century Devanagari developed by Nicole Merkel-Hilf (2022). This project focused on expanding the abilities of HTR models to Indic texts in pre-print and non-Devanagari sources, focusing on Sanskrit and Newar (Nepāl Bhāṣā) manuscripts in Pracalit script from the 16th to 19th centuries.

(2) Method

An HTR trainer requires diplomatic transcriptions of Pracalit manuscripts to line up with text in manuscript photographs. Critically edited editions can speed up transcription and ground truth generation through de-correction. Databases like GRETIL, from which we sourced the published transcriptions, make it possible to bootstrap a non-existent HTR model by using texts from other scripts (Georg-August-Universität Göttingen, 2020). To this end, transcriptions were prepared based on the following four Nepalese manuscripts, each with different varieties of Pracalit script. For each entry in the list below, in order, the manuscript title is given in italics followed by call numbers in parentheses, deposit location, manuscript languages and date, and sources of the corresponding transcriptions:

  1. Hitopadeśa (MIK I 4851)
    Staatsbibliothek zu Berlin
    Mixed Newar and Sanskrit, 1561 CE
    Original transcription by Alexander James O’Neill
  2. Vetālapañcaviṃśati (HS. Or. 6414)
    Staatsbibliothek zu Berlin
    Newar, 1675 CE
    Adapted transcription based on unpublished materials by Felix Otter (Otter, n.d.a)
  3. Avalokiteśvaraguṇakāraṇḍavyūha (MS Add. 1322)
    Cambridge Digital Library
    Sanskrit, 18th century
    Adapted transcription based on an edition by Lokesh Chandra (Chandra, 1999)
  4. Madhyamasvayaṃbhūpurāṇa (RAS Hodgson MS 23)
    Royal Asiatic Society Online Collection
    Mixed Newar and Sanskrit, c. 1800
    Adapted transcription based on unpublished materials by Felix Otter (Otter, n.d.b) and the published Nagarjuna Institute transcription (Shakya & Bajracharya, 2001)

While the HTR+ engine appeared to have difficulty working with the lack of word division, PyLaia produced better results, and we used it for the rest of the training. We trained the model on 441 pages of manual transcriptions of the above four manuscripts, with validation performed on 242 pages that were not part of the training set. It was further tested and continues to be used on pages that were not part of the training or validation sets. We decided it would be most appropriate and culturally sensitive to transcribe into Unicode Pracilt (Unicode, Inc., 2021), see Figure 1.

Screenshot of a completed transcription of a folio of Hitopadea (MIK I 4851) in Transkribus
Figure 1 

Screenshot of a completed transcription of a folio of Hitopadeśa (MIK I 4851) in Transkribus.

Using 250 epochs, Transkribus trained a model with a CER on the training set of 2.6% and 0.1% on the validation set. This discrepancy may signify little more than that the latter had fewer complex characters to recognise. Therefore, the model produces accurate results when transcribing the same or similar hands to those responsible for these four manuscripts, see Figure 2.

Screenshot of the model’s learning curve on Transkribus
Figure 2 

Screenshot of the model’s learning curve on Transkribus.

Quality control

The model has a higher CER when applied to irregular forms of Pracalit script, including more ornate or rougher hands (Figure 3) However, with a trained base model, new hands require significantly fewer pages, ranging from ten to thirty pages of new ground truth. We will update and refine the model with new ground truth as we encounter variant hands.

A screenshot of a transcription on Transkribus of a cruder form of Pracalit, from Vetālapañcaviṃśati (HS. Or. 6414)
Figure 3 

An example of a cruder form of Pracalit, from Vetālapañcaviṃśati (HS. Or. 6414), transcribed on Transkribus.

The main limitation of this model’s initial and continued training is the lack of transcriptions. However, bootstrapping existing editions and transcriptions and feeding corrected machine-generated transcriptions back into the model are workable solutions.

In transcription, the model encounters difficulties with damaged or soiled manuscripts, irregular spacing, punctuation, and illustrations interrupting the text. It is worth noting that while the vast majority of Pracalit manuscripts are written in a scriptio continua, occasional spacing and irregular punctuation conventions produce mixed results for the model. While mistakes in ground truth produce incorrect transcriptions, a larger mass of correct ground truth reduces the impact of any one mistake.

(3) Dataset Description

Object name – OCR model for Pracalit for Sanskrit and Newar MSS 16th to 19th C., Ground Truth

Format names and versions – png and xml

Creation dates – 2022-04-01 – 2022-08-04

Dataset creators – Alexander James O’Neill, SOAS University of London, Data curation, Formal Analysis, Investigation, Methodology, Validation, Visualization

Language – Sanskrit and Newar

License – Creative Commons Attribution 4.0 International

Repository name – Zenodo

Publication date – 2022-08-05

(4) Reuse Potential

While it is possible to share models within Transkribus, this has limited potential for the shared creation of ground truth. As modelled by the GitHub collection “HTR united,” which combines the ground truth of French documents (Chaqué & Clérice, 2021), it is possible to make ground truth data sets available in ways that others can use within platforms such as Transkribus and elsewhere. We have therefore made our dataset publicly available on Zenodo in the form of PNG and XML files that can be used on HTR platforms (O’Neill, 2022). For the future, in collaboration with the Centre of Asian and Transcultural Studies (CATS) Bibliothek at the University of Heidelberg, we are participating in the development of a South Asian Studies-specific ground truth database in a FID4SA (Fachinformationsdienst für Südasien: Specialised Information Service for South Asia) dataverse, called “Ground truth data for HTR on South Asian Scripts,” as part of the University of Heidelberg’s research data archive heiDATA (Universität Heidelberg, 2022).

As the most labour-intensive part of philological practice, the ability to quickly produce machine-readable transcriptions of various witnesses of an Indic text is of great value to Indology and other disciplines. This enables high-speed searches and comparisons of corpora, as well as linguistic analysis through machine-learning methods (Meelen et al., 2021). In disciplines such as Newar studies, where there is both a paucity of trained scholars and a profusion of manuscripts, this tool can contribute to easing the burden of compiling and editing a digital corpus, which will benefit linguistic, literary, and historical analysis of the Newar language by easing the burden of work with primary manuscript sources.