Printed Text Recognition for Lexical Lists in Chinese-International Phonetic Alphabet (IPA) Glossing

Shihua Li; Nathan Hill

1 Overview

1.1 Repository location

1.2 Context

Transkribus, a specialized tool developed for Handwritten Text Recognition (HTR) and powered by the PyLaia engine, currently lacks publicly available models for Chinese character and/or IPA () symbol recognition. In the context of my own research on the Tujia language in China, I have encountered numerous related studies published predominantly between the 1980s and 2000s. The original materials are typically in an outdated print style, which makes it challenging to use the scanned copies for further analysis. Unlike other materials that have been readily digitized, the scanned versions of these Tujia materials do not offer the same ease of use. For instance, the search function cannot be applied to identify specific keywords, and it is not possible to convert the recorded lexical entries into convenient Excel or XML formats that are beneficial for our studies and research. Lexical lists or word lists have played a significant role in prior Tujia studies; however, the antiquated format and print style of these lexical lists poses challenges for further analysis. Given this context, I developed two models on Transkribus for the recognition of printed texts, as in Figure 1. The baseline model achieved a loss rate of 7.87% on the validation set, while the transcription model attained a character error rate (CER) of 5.90% on the validation set. The loss rate denotes the percentage of information that the baseline model failed to capture on a page level, and the CER of the transcription model indicates the rate of incorrect transcriptions generated at the character level. These models were trained using the digitized lexical lists of Burmish languages made available and openly accessible by Hill and Cooper (). These lexical lists pertain to Burmish languages, which belong to the Tibeto-Burman language family and are primarily spoken in the Republic of the Union of Myanmar and the neighboring country of China ().

Figure 1

Learning curves of latest models on Transkribus.

2 Method

To achieve accurate printed text recognition using Transkribus, two distinct models need to be developed: the baseline model and the transcription model. These models are designed to cater to specific requirements and factors, including text content, input text alignment, document format, and desired results, among other considerations. While Transkribus provides default models for each purpose, the generated results may not be optimal due to variations in the input data. Therefore, it is crucial to develop models that align with individual needs and document types. The process of employing Transkribus for printed text recognition can be divided into two main stages. Firstly, the text is segmented at the page level using the trained baseline model, and subsequently, corresponding transcriptions are generated for the segmented text. The training of models was based upon lexical lists adapted from Hill and Cooper (), comprising a total of 345 pages of files. Within the training process, 311 pages were employed, with the remaining 34 pages serving as a validation set. The baseline model underwent training over the course of 100 epochs, while the transcription model was subjected to 250 epochs. The most recent model has acquired knowledge encompassing approximately 2,363 Chinese characters, 100 IPA symbols, 44 common symbols, and Arabic numerals. This section will elaborate on both the training and usage of the baseline model, as well as the transcription model.

2.1 Layout segmentation – baseline model

The preliminary step in preparing for the transcription process involves layout segmentation. This entails dividing the content of each page into distinct text regions, text lines, baselines, and establishing the appropriate reading order. Prior to using Transkribus to generate transcriptions for the input data, it is essential to perform this segmentation. While Transkribus offers default baseline models, their effectiveness relies heavily on the alignment of the texts of the original document.

In the context of our study, our primary objective was to develop Transkribus models capable of effectively processing lexical lists as illustrated in Figure 2. These lexical lists typically comprise multiple columns, with one column presenting the lexical meanings in Chinese, while the remaining columns provide corresponding transcriptions in IPA for three distinct dialects. Figure 2 below illustrates this format. To segment the content in Figure 2, we have first employed the default baseline model provided by the Transkribus team, namely the ‘Horizontal Text Line Orientation’ (). According to the description provided by the Transkribus team on their desktop software, this default model is trained exclusively on the cBad dataset, which exhibits a similar layout. Consequently, this model primarily identifies horizontal and vertical lines.

Figure 2

Text regions identified by Transkribus models: default and desired.

2.1.1 Text region recognition

As depicted in Figure 2(a), the content within the image has been divided into four text regions, labeled as 1, 2, 3, and 4, delineated by the green squares. While this segmentation identifies additional text regions, it is evident that certain content in the image, such as the page number ‘98’, has been overlooked. For our purposes, it would be preferable to encompass all the content on the page within a single text region, as demonstrated in Figure 2(b). Figure 2(b) represents our desired outcome for the recognition of lexical list images at the text region level, achieved through manual correction.

2.1.2 Baseline recognition

In addition to text region recognition, the recognition of baselines provides further detail regarding the reading and transcribing order within blocks or paragraphs on the image. Figure 3(a) illustrates the baselines identified by the default Transkribus baseline model within each text region, represented by red lines. Upon reviewing the reading order, indicated by numbers in a vertical view, no apparent mistakes are observed, except for some ignored text. However, when considering the horizontal order, the numbering scheme appears somewhat inconsistent. In Figure 3(a), the first baselines (2 and 3) in the third column from the left should be numbered in the opposite order. Furthermore, the two baselines identified should be recognized as a single baseline since they represent different parts of the same lexical word. To address these inconsistencies, Figure 3(b) displays the manually corrected baselines. In summary, our expectation for the baseline model is to recognize content on the same line as a single baseline. The workflow involves initially segmenting the texts using the default baseline model. Subsequently, we trained a new baseline model using the manually corrected results.

Figure 3

Baselines identified by Transkribus models: default and desired.

2.2 Text transcription – transcription model

The transcription process in Transkribus relies on the PyLaia engine (), which is built using the PyTorch toolkit (). In general, we input transcriptions for the content previously identified by the baseline model. Specifically, we transcribe each baseline recognized, following the assigned baseline order. However, it is important to note that on Transkribus, we can only provide transcriptions for text that falls within the identified text regions and text lines. Therefore, if any text has been ignored by the model and not manually corrected, we will not be able to input transcriptions for that particular text.

The transcription model we developed was trained using a compilation of ten Burmish lexical lists, as shown in Table 1. These lexical lists originate from ten distinct sources, documenting various Burmese languages in China, specifically Achang, Bola, Chashan, Langsu, Leqi, and Zaiwa. They are predominantly composed of both Chinese characters and IPA symbols. As Table 1 shows, most of these works were published in the 2000s, with some dating back to the 1980s. Consequently, the quality of printing and alignment of the content may not be ideal in comparison to contemporary publications.

Table 1

Burmish lexical lists.


LANGUAGE RECORDED	SOURCE

Achang	Dai and Cui ()

Bola	He and Chen () and Dai, Jiang, and Kong ()

Chashan	Dai ()

Langsu	He and Chen (), and Dai ()

Leqi	He and Chen (), Dai and Li (), and Dai and Li ()

Zaiwa	Xu and Xu ()

Since we have trained the transcription model using these lexical lists, its focus primarily lies in recognizing Chinese characters and IPA symbols. As a result, the challenges we encountered mainly revolve around variations in the representation of Chinese characters across different systems, as well as differences in the usage of IPA symbols.

In relation to Chinese characters, the model endeavors to identify analogous alternatives within its lexicon. For instance, the model employs the character ‘作’ as a substitute for ‘昨’, and ‘了’ as a substitute for ‘子’. Additionally, it becomes apparent that the quality of the recognized baseline significantly impacts the error rate of the transcription model. If the text is inadequately covered by the baseline, the likelihood of incorrect transcription for a given character or symbol increases. In terms of IPA symbols, the error rate is generally low, and the overall transcription quality is better. However, there is a specific type of error observed in the transcriptions of IPA symbols generated by the trained model, particularly related to tone representation. The lexical lists we used employ two different tonal recording systems: Chao tone letter () and the corresponding representation by numbers. The model performs well with numbers for tone representation, but it tends to make numerous mistakes when transcribing tones with Chao tone letter. This indicates a specific challenge in accurately transcribing tones using Chao tone letter in our trained model.

In summary, we trained new models by using the default model as the base and incorporating manually corrected transcriptions for the recognized texts. This process helps improve the accuracy and quality of the transcriptions generated by Transkribus.

3 Performance of Transkribus models

To evaluate the efficiency of using Transkribus for transcribing lexical lists, a test was conducted involving three PhD students from Trinity College Dublin, all specializing in linguistics. Each participant was given one hour to transcribe a Tujia lexical list from Tian et al. (). The task involved typing the content seen on each page into Word or Excel files using their preferred input method. All three participants commenced simultaneously within the same room. The lexical list consisted of three columns per page: one for the transcriptions of the northern Tujia variant, one for the transcriptions of the southern Tujia variant, and another for the corresponding meanings in both Chinese and English.

During the test, each line was considered as one entry, and the precise character count was also recorded. Table 2 present some details regarding the participants, the number of entries and characters they completed within the given one-hour time frame, and the CER they achieved.

Table 2

Participants’ information and performance.


PARTICIPANT	INPUT METHOD	ENTRIES	CHARACTERS	CER (%)

1	TypeIt ()	64	2027	1.62

2	Keyman ()	109	3424	0.32

3	Sougou ()	60	1952	1.22

To obtain the transcriptions of the same Tujia lexical list using Transkribus, we first utilized the baseline model to segment the document’s layout. Since Participant B transcribed the most entries (109), we only needed to segment six pages containing a total of 131 lexical entries. However, the Transkribus model completed the segmentation for all 47 pages of the lexical list in just one minute. We then manually corrected the segmentation performed by the Transkribus baseline model, which took approximately six minutes. Next, we used the Transkribus transcription model we trained to transcribe the first six pages of the lexical list, which had a similar number of entries as Participant B. The model completed 131 lexical entries and 4,173 characters in one minute and 40 seconds, with a CER of 8.65%.

Similarly, during the 28 minutes we spent correcting the model’s transcriptions, most of the time was dedicated to correcting the English content. In summary, it took the model one minute for layout analysis, six minutes for manual correction, one minute and 40 seconds for transcription analysis, and 28 minutes for manual correction. Thus, the total time required for transcribing 109 lexical entries using Transkribus was 36 minutes and 40 seconds, while Participant B took one hour to complete the same task without even considering the error rate. In essence, for the most recent model we trained, manual correction proves essential for both layout analysis and transcription; however, it significantly reduces the overall time required compared to entirely manual transcription.

4 Dataset description

Object name – OCR model for lexical lists in Chinese-IPA Glossing, Ground Truth
Format names and versions – jpg, xml, pdf and docx
Creation dates – 2023-09-07
Data creators – Shihua Li, Trinity Centre for Asian Studies, Trinity College Dublin, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization
Language – Chinese, Burmish languages including Achang, Bola, Chashan, Langsu, Leqi, and Zaiwa
License – Creative Commons Attribution 4.0 International
Repository name – Zenodo
Publication date – 2023-09-07

5 Reuse potential and future development

The primary objective of training the baseline and transcription models on Transkribus is to facilitate the digitization of Tujia lexical lists for future research. Nevertheless, these models can also be effectively employed for documents primarily written in Chinese and IPA, which aligns with the customary practice observed in studies conducted on national languages in China. Consequently, this project will not only contribute to the digitization of lexical lists using Chinese-IPA glossing, but also enable the digitization of materials following the same glossing style. Additionally, these efforts will make valuable contributions to the Lexibank project by expanding the collection of wordlists pertaining to national languages in China. There is still room for improvement in both of the two trained models, as their training was based on a limited amount of data.

5.1 Layout analysis

The training of the baseline model for layout analysis is highly dependent on the specific characteristics of the documents being considered. The key factors that significantly influence the training process include the original alignment of content on each page and the desired format and structure of the transcription. In our case, we focused on training the model to accurately segment content into lines. However, for alternative purposes, it may be beneficial to refer to other publicly available baseline models or train new models for specific needs.

5.2 Chinese transcription

Contrary to our initial expectations, the transcription model exhibits impressive performance in recognizing and transcribing Chinese characters within untrained documents, particularly for those characters it has been previously trained on and can readily identify. However, when confronted with untrained characters, the model tends to generate new characters based on its existing inventory, leading to frequent inaccuracies. Additionally, the model is susceptible to errors when the original printing quality of the documents is not as high as anticipated. For example, tone ‘35’ was frequently recognized as tone ‘25’ due to the poor printing quality.

5.3 IPA transcription

Contrary to our expectations, the transcription model’s performance on IPA symbols falls short compared to its proficiency in recognizing Chinese characters. This discrepancy may be attributed to the fact that the ten Burmish lexical lists used for training cover a limited range of IPA symbols. Surprisingly, even when the original documents have excellent printing quality, the model still exhibits a tendency to make mistakes with similar IPA symbols, such as confusing [ɛ] and [e], [l] and ‘1’, and [ʔ] and ‘2’. Consequently, it becomes imperative to train the model using a more extensive dataset of IPA symbols to improve its accuracy and proficiency in identifying and recognizing such symbols.

5.4 Field Models

Transkribus is currently in the process of testing new layout analysis models known as Field Models. These models are specifically designed to analyze various types of contexts, such as newspapers that contain distinct text regions within the same page. Once the testing phase is complete and these models are officially released, we plan to proceed with the development of a new version of the layout analysis model based on the newly introduced Field models.

Journal of Open Humanities Data

Data Papers

Printed Text Recognition for Lexical Lists in Chinese-International Phonetic Alphabet (IPA) Glossing

Abstract