(1) Overview

Context

This paper presents a typeface model, Vaybertaytsh.YidTakNL, that can facilitate reading and researching Yiddish texts printed in the Vaybertaytsh typeface. Yiddish is the historical, traditional language of Ashkenazi Jews. Ashkenazi Jewry originates from the Holy Roman Empire. In the 11th century, Ashkenazim gradually started spreading throughout Europe. Yiddish gradually developed into quite a uniform literary language and is one of the most popular languages in Jewish culture and history, succeeding classical Hebrew and (Jewish) Aramaic. Yiddish was the vernacular used by Ashkenazi Jews on a daily basis. Since its mediaeval origins, it has been used at home, in religious institutions, and in literature. Later, it also became prominent in theatres, politics, schools, and journalism. ().

The syntax of Yiddish was, in its inception, a combination of Germanic with Hebrew and Aramaic, yet it has always also been influenced by local languages. Yiddish is written in the Hebrew Alphabet, which is primarily vowel-free. With time, the Semitic Hebrew script was adjusted into an alphabet containing consonants and vowels, for example, by using the letter Ayin (ע) for e, Aleph (א) for a and o, and the combination Vav-Yud (וי) and two Yuds (יי) as diphthongs. The Yiddish writing system continuously evolved, and conventions of word separation, the representation of unstressed vowels and diphthongs substantially changed during the 13th–19th centuries. Moreover, dialects differ in terms of grammar, as well as phonologically, in terms of the evolution of vowels. Scholars often use the distinct categories Old Yiddish, Modern Eastern Yiddish, and Western Yiddish (; ; ; ; ; ; ).

In early modern times, Amsterdam had a flourishing Ashkenazi Jewish community that spoke Western Yiddish on a daily basis. Their dialect most probably originates mainly in German-speaking areas. During this period, Amsterdam thrived as one of the foremost centres of Yiddish book production, renowned for both the exceptional quality and impressive quantity of its literary output. Western Yiddish has a distinct grammar, phonology, pronunciation, and vocabulary compared to German. From the 17th century, Western Yiddish in the Netherlands gradually became influenced by Dutch. During the 19th century, Western Yiddish declined and was replaced by the local languages (; ; ; ; ).

Yiddish texts were printed in the Vaybertaytsh typeface throughout Europe during the 16th–19th centuries. One of the oldest and most well-known examples is the Tz’enah Ur’enah prose work, also known as the Women’s Bible, containing segments from the Torah and Haftarahs used in Jewish prayer services. The earliest edition, which survived, is dated 1622 and was written in Hanau. In Amsterdam, it was used to print regulations throughout the 18th century. These regulations have a distinct vocabulary and layout and give fascinating insight into the relationship between the Ashkenazi Jewish leadership and the Dutch government.

Vaybertaytsh is a semi-cursive Ashkenazi typeface, also called vayberksav, taytsh, ivre-taytsh, Tsene-(u)rene-ksav, Tkhine-ksav, kleyn-taytsh and mashket/mesheyt (; ; ; ; ). In a semi-cursive script, elements of both cursive and print (block) writing are combined. For the untrained eye, this typeface can be difficult to read. Hence, there was a need for a text recognition model.

Dataset description

Object Name

Vaybertaytsh.YidTakNL

Access

The model is publicly accessible via Transkribus: https://readcoop.eu/model/vaybertaytsh-typeface-18th-19th-century/; The complete dataset, including transcriptions and images of the texts used for training this model, can be accessed at: doi.org/10.6084/m9.figshare.25422844.

See also https://doi.org/10.5281/zenodo.10017358 for an overview of a corpus of regulations and announcements written by Amsterdam’s Ashkenazi Jewish community between 1708 and 1846.

Dataset creators

Ronny Reshef (creator, annotator), Mirjam Gutschow (annotator)

Language

English, Yiddish, Hebrew

Licence

CC BY 4.0

Reuse Potential

The text recognition model and accompanying baseline model can be used to advance research on Jewish history and Yiddish language, literature and culture. The model is very robust, and using it will enable the preparation and editing of digital transcriptions of Yiddish texts written in Vaybertaytsh efficiently, quickly and precisely. Transcriptions made using the model can especially contribute to research on the linguistic development of (Western) Yiddish, for example, by producing word lists for analysing linguistic structures and patterns. Moreover, transcriptions made using the model can also be processed for lexicographical and scientific purposes.

Historians can use the model to transcribe texts easily and conduct content analysis for efficient and thorough study of data, occurrences, and stories. Both models can be further improved by enriching the dataset in terms of vocabulary, text types (literature, religious texts), and layout. Such improvements will benefit all the users of the models and, subsequently, advance research on Jewish history and Yiddish language and literature. Finally, as Hodel et al. () suggest, collaboration is the key to advancing text recognition models and Vaybertaytsh.YidTakNL and its accompanying baseline model are no exception to this. Libraries and archives can provide substantial quantities of images for GT sets, academics can contribute transcriptions, and experts of digital humanities can assemble and unify the inputs in a standardised manner. To preserve, make accessible, and study our cultural heritage, we need to combine our wide range of expertise and unite our efforts.

Text recognition of Vaybertaytsh

The Vaybertaytsh.YidTakNL model was created via Transkribus using PyLaia, a Handwritten Text Recognition (HTR) engine. Transkribus is a platform that enables (automatic) text recognition, structure recognition, textual analysis, tagging, and image analysis of handwritten and printed historical documents while applying machine learning to improve its technology continuously. It is a user-friendly, accessible platform that makes historical texts with different formats readable, editable, and searchable by dividing pages into text regions, lines, and words and recognising the text. The machine learning principle uses HTR models of neural networks based on so-called ground truths (GTs) submitted to train the model. A GT is a manually transcribed version of a document which is accurate and has been manually verified insofar as possible. When thusly well-trained on a large number of GTs, Transkribus is able to provide accurate results. (; ).

Modern HTR tools like Transkribus can process handwritten and historical texts much more efficiently than traditional Optical Character Recognition (OCR). OCR was developed in the 1990s to identify single characters in modern printed texts accurately. HTR engines can be trained to recognise a practically limitless quantity of shapes representing a specific character and also strings of characters, meaning (sub) words belonging to a particular corpus (; ).

HTR engines on Transkribus, such as the deep learning toolkit of PyLaia, can also be successfully used for printed documents. A GT of 10,000 words leads to results requiring relatively few corrections. The trained model is able to recognise printed texts which are comparable to the GT. Currently, Transkribus represents the state-of-the-art in text recognition systems available. At the time of this project, PyLaia was certainly one of the best HTR models available within Transkribus (; ; ; ; ). A new feature is the “Transformer Based Models”, or “Super Models”, which could be even superior to PyLaia and may even better facilitate future projects.

It is possible to train the model with a base model: an existing suitable HTR model and GT. A base model is not obligatory, yet it can be applied to transfer information from an already existing model to boost the creation of a new model. This allows the training algorithm to apply knowledge from an existing model based on a substantial dataset to perform pre-calibration and classification of the characters. This accelerates the process of training a new model and improves its performance (; ).

(2) Method

Steps

Transcriptions for Yiddish texts in the Vaybertaytsh typeface were prepared using Transkribus in 2023. Since Transkribus already had two public PyLaia HTR models for Yiddish, the Dybbuk and DiJeSt, we chose to base our language model on one of the two existing models. However, Vaybertaytsh is substantially different from other common Hebrew and modern Yiddish typefaces; therefore, much work needed to be done before the transcriptions were at a GT level (See Table 1 for a comparison of scripts).

Table 1

The classic Hebrew typeface Meruba and Vaybertaytsh.


MERUBA VAYBERTAYTSH UNICODE IDLETTER NAME

U+05D0 (1488)Alef

U+05D1 (1489)Bet

U+05D2 (1490)Gimel

U+05D3 (1491)Dalet

U+05D4 (1492)He

U+05D5 (1493)Vav

U+05D6 (1494)Zayin

U+05D7 (1495)Het

U+05D8 (1496)Tet

U+05D9 (1497)Yud

U+05DA (1498)Kaf Sofit

U+05DB (1499)Kaf

U+05DC (1500)Lamed

U+05DD (1501)Mem sofit

U+05DE (1502)Mem

U+05DF (1503)Nun Sofit

U+05E0 (1504)Nun

U+05E1 (1505)Samech

U+05E2 (1506)Aiyn

U+05E3 (1507)Pe Sofit

U+05E4 (1508)Pe

U+05E5 (1509)Tsadi Sofit

U+05E6 (1510)Tsadi

U+05E7 (1511)Qof

U+05E8 (1512)Resh

U+05E9 (1513)Shin

U+05EA (1514)Tav

The first transcriptions were made using the Dybbuk model since it seemed most suitable for our purposes. They were then corrected and double-checked. The transcriptions corrected manually finally had a high degree of precision so that the output could be used for scientific purposes. Since the initial results from the first models we created based on GT and the pre-existing Yiddish HTR models were unsatisfactory, we added more GT. We deployed a few private models to keep correcting and improving the model before making it public. In Table 2, we present a comparison of the word error rates for the three publicly available Yiddish handwritten text recognition (HTR) models within Transkribus. This analysis was conducted using a selection of ground truth texts written in Vaybertaytsh typeface. The texts used for the comparison are distinct from those employed for the development of the Vaybertaytsh.YidTakNL model.

Table 2

Word error rate for texts written in Vaybertaytsh typeface.


MODEL# OF WORDS# OF PAGESWORD ERROR RATEWORD ACCURACY

The Dybbuk for Yiddish Handwriting (Nov 13, 2022)63072292,5464,856

DiJeSt 2.0 (Nov 10, 2022)63072242,12243,036

Vaybertaytsh.YidTakNL (Nov 29, 2023)6307229,2685,468

The CER measures the character level error rate of incorrect transcriptions. A CER of 1% and lower can be achieved with a large enough corpus and sufficient training. However, a CER of less than 10% is generally considered sufficient for automatic transcriptions, depending on the desired application and goals (; ; ). The first private Vaybertaytsh.YidTakNL model was run in March 2023, using the Dybbuk as a base model and a training set size of 7163 words. The CER of the first model, Yid.Dutch.ver3, was 0.20%, as can be seen in Table 3. This model was then used to transcribe the next texts, which were also corrected by both authors.

Table 3

Overview of the 3 models with additional information.


MODELYID.DUTCH.VER3YIDNL.7VAYBERTAYTSH.YIDTAKNL

ID510305236557147

Date2023-03-302023-05-222023-11-29

No of words7 16321 66166 497

No of lines7522 4348 062

Max epochs250100100

Early stopping2020-

Epochs trained250100100

Learning rate0.00030.00030.0003

Base model46159 /The Dybbuk for Yiddish Handwriting46159 /The Dybbuk for Yiddish Handwriting

CER on Training Set0.60%0.80%0.60%

CER on Validation Set0.20%0.20%0.91%

No of pages in Training Validation Set217123

The second private model, YidNL.7, was deployed in May 2023, again using the Dybbuk as a base model and a training set size of 21661 words. Although it was based on more GTs, the CER of the second model was also 0.20%, as presented in Table 3. A possible explanation could be that the vocabulary of the second model was still limited, and additional GT was required to decrease the errors it made. However, considering the satisfactory score, it could also well be that a training set size of about 7,000 words is sufficient for this sort of typeface.

We transcribed additional texts in the Vaybertaytsh typeface to improve its performance. The public Vaybertaytsh.YidTakNL text recognition model has a training set size of 66497 words and a CER of 0.90% (see Table 3). Despite having a slightly less good CER (0.90% compared with 0.20%), Vaybertaytsh.YidTakNL performs considerably better than our private text recognition models. A possible explanation could be related to the increase in vocabulary and source diversity.

We also deployed a baseline model for layout to work more efficiently with the printed texts. The Vaybertaytsh.YidTakNL accompanying layout model was necessary since the default layout model was not able to recognise the unique layout of the regulations. The default model ineffectively divided the page into a few random regions when one region was sufficient. It also over-segmented the lines, needlessly splitting multiple lines into 2 or 3 parts.

In order to build the layout model, we manually separated running titles and leave numbers, signatures and catchwords and placed the numbering and titles of the chapters (or, as in this genre, mainly paragraphs) on separate lines. After manually correcting the layout of around 50 pages according to the set rules we developed, a baseline model was trained and constantly improved. Using the new baseline model, most pages were rendered, to a large extent, the way we wanted them to appear. The new baseline model, trained on a set of 60,036 words, renders most pages largely the way we want them to appear.

Quality control

The main challenges with training a model for Vaybertaytsh were recurrent confusions in the recognition of some characters, which are very similar to each other. Some examples are confusions between Yud (י), Vav (ו) and Geresh (‘), Resh (ר) and Dalet (ד), and Tsadi Sofit (ץ) and Fe Sofit (ף). Other challenges were omitting superscripts (diacritical signs), errors with word separation, irregular spacing, disregarding hyphenations, punctuation marks, and some hypercorrection errors, where letters were added for no clear reason. Rabus () had similar issues training a Transkribus model in Croatian Glagolitic. Table 4 summarises the most common letter confusions we encountered. These confusions were repeatedly manually corrected to teach the algorithm the difference between these letters. With every new model, there were fewer letter confusions and hypercorrections, and the overall result became more satisfactory. We attribute the ongoing enhancement of the model to the systematic incorporation of additional texts at a GT level, coupled with meticulous rectification of recurring issues. The public text recognition model functions much better than the previous models developed, and there is substantially less confusion.

Table 4

Recurrent issues with training the Vaybertaytsh model.


LETTERPICTURECONFUSED WITH (AND VICE VERSA)PICTURE

Mem מAiyn ע

Vav + Nun ו + נMem מ

Vav וYud י

Tet טShin ש

Dalet דResh ר

Mem מAlef א

)Lamed ל

He הHet ח

Geresh ׳Yud י

Geresh ׳Vav ו

Vav וNun Sofit ן

Vav וZayin ז

Samech סMem sofit ם

Tsadi Sofit ץFe Sofit ף

Aiyn עTet ט

Samech סTet ט

Nun נGimel ג

Nun נKaf כ

Bet בKaf כ

Shin שSamech ס

Tav תHet ח

Lamed לTsadi צ

Conclusion

The less-than-satisfactory initial results in early 2023 underscore the importance of tenacity and perseverance when developing an HTR model. Proofreading and correcting additional texts to GT status and periodically updating the model significantly improved the quality, rendering the next batch of pages with its subsequent version. Basing the first models on existing Yiddish HTR models was crucial since they provided a solid foundation for Vaybertaytsh.YidTakNL. The model can automatically recognise 18th century Yiddish text from Amsterdam, enabling information searches within the text without additional human editing. While achieving a flawless text suitable for a scholarly edition requires additional human work, this is common when preparing scholarly editions.

Given the model’s derivation from specific text types with characteristic layout and a moderate vocabulary (See ), pages of several other Yiddish texts printed in Amsterdam during the 18th century will be added to it in the near future. These texts feature alternative layout structures and distinct vocabularies, enriching and strengthening the Vaybertaytsh.YidTakNL model. Furthermore, as certain community regulations include cursive Ashkenazi, a distinctive typeface mimicking handwriting, the next model version will incorporate several fully set book pages using this style. Afterwards, we aspire to continue improving and elaborating the model by adding a variety of texts from other Ashkenazi communities in Europe. With time, we anticipate that the Vaybertaytsh.YidTakNL model will become more diverse and, as a result, improve significantly, thanks to community efforts.