1 Introduction

In the background of the current Russian-Ukrainian war, the Second World War is pervasive in the political discourse. This study is a first step towards the future goal of comparing the political discourse in wartime in a publication that can be considered the voice of the Belarusian authorities, with a resource allowing to articulate both close and distant reading.

The main aim of the present work is to evaluate a set of resources in order to construct a full pipeline for Belarusian texts from the digitized images of newspaper pages to a database with extended query and visualization capabilities. In order to facilitate working with scarce resources, the whole pipeline is to be based on freely available tools compatible with Linux. This exploratory study also allows identification of some missing or weak parts that will need additional work in the designed pipeline. As the task is defined, it covers both language non-specific components and Belarusian language components, either tools specifically designed for Belarusian or Belarusian models for non-specific tools.

In order to evaluate the adequacy of the pipeline, there is a second practical goal. This article aims to initiate a limited pilot study to reveal the drastic discourse shift in a totalitarian system. Indeed, one can expect that a press organ in a tightly controlled system can undergo a radical discourse reorientation, that would be far more progressive in a more free and polyphonic media space. The current analysis will focus on the association with the adjective ‘German’ in the Belarusian Soviet press before and after the Treaty of Non-Aggression of August 1939 between Germany and the Union of Soviet Socialist Republics. We would like to note that we do not consider the information treatment of this event in a Belarusian publication to be significantly different from the common Soviet discourse. Given the extreme authoritarianism of the Stalinist organization, a wider multilingual analysis would probably show identical results in all the languages of the Soviet press.

This article will successively present similar or related projects, the selected data, the structure of the pipeline from the rough PDF files to the exploitable corpus, and the results of an attempt to use the database in a minimal pilot study about the Soviet discourse shift after August 1939.

2 Similar studies

The topic of building and using pipelines from images to structured text data is standard in Digital Humanities, hence only several references among the many ones available were selected for this contribution. For example, Nundloll, Smail, Stevens, and Blair () present an ambitious pipeline from rough data to usable text data and propose some ways to improve the quality of the OCR (optical character recognition) output. In addition, this article lists other related text-driven projects in several fields. The present article is very similar to Schätzle, Hund, Dennig, Butt, and Keim () with a tool presentation and a brief pilot study, but with a focus on a concrete and specifically designed tool and not a combination of available tools.

Given the relatively easy access to data, there are many newspaper corpora (as it can be seen in the Clarin main repository). Among other relevant examples related to the totalitarian communist period, Chronopress for Polish (), DDR-Zeitungsportal for German or the digitized version of Izvestiia for Russian deserve mention. A wider perspective is provided in the introduction of Ehrmann, Romanello, Clematide, Ströbel, and Barman ().

The issue of OCR and its correction is discussed in Alex and Burns () and Nguyen, Jatowt, Coustaty, Nguyen, and Doucet (). The question of the correction of OCR errors and the use of ML (machine learning) techniques for that goal is not systematically addressed in the current pipeline. It is probably of crucial importance for future versions of the dataset.

The question of the potential of tools and models based on graphs, such as graph databases, is also increasingly discussed since it is directly in line with the development of linked data, for example in Perak () (these proceedings are specifically about this topic) or McGillivray, Cassotti, Basile, Di Pierro, and Ferilli (). This type of database offers a reliable option for representing corpora () or structuring digital editions (). Their main advantage is to facilitate the development of relatively independent and overlapping information layers (; ). The graph schema used for the present database is relatively similar to Efer (). Although the most frequently mentioned graph database appear to be Neo4j (when no tool is specifically designed), the present work relies on the more recent Memgraph.

In addition to a graph database, the work presented by Perak () used UDPipe for morphological and syntactic annotation, like the present pipeline. For this pipeline, the priority was given to morphological and syntactic annotation over identification of named entities, while the usual priority with newspaper corpora tends to be the opposite (). Identification of named entities was not considered for this initial Belarusian pipeline.

Although the visualization task remains limited in the present work, the planning of future development is nonetheless relevant. As a way to summarize complex data, visualization is an important topic in Digital Humanities, for example in Lamirel, Dugué, and Cuxac (), Allen () or Beck and Butt (). In this last reference, the authors also briefly discuss the question of data (mis)consistency, which goes far beyond the issue of visualization. Applications in the field of literary studies as in Scrivner and Davis () could give useful insights for rhetorical and discourse analysis.

3 Dataset description

The CoNLL-U files that were generated and used for the pilot study can be find under the following reference:

Object name Annotated files of the Soviet Belarusian newspaper Zviazda (years 1938, 1939, 1940) (0.1.1)

Direct linkhttps://zenodo.org/record/8424771

Format names and versions CoNNL-U

Creation dates 2023-06-17

Dataset creators Loïc Boizou

Language Belarusian

License CC-BY

Repository name Zenodo

Publication date 2023-10-09

4 The data

The corpus consists of Issues of Zviazda (in Belarusian Звязда). This newspaper, which still exists as a pro-government publication in Belarus, was the official publication of the Central Committee and the Minsk regional branch of the Communist Party (Bolsheviks) of Belarus. As such, it is the direct voice of the authorities. Nevertheless, it must be noted that in a situation where the whole press was tightly controlled by the state, all kinds of publications had to speak in unison to a very large extent. “Provincial Russian papers took their cues about wording and descriptions of events from these major publications. Following a pattern developed in 1917, what appeared in Pravda one day was likely to appear in Izvestiia on the same day or on the next day, and a day or so later in the regional or specialized newspapers and magazines” (, p. 388). But the same trend affected Zviazda and Soviet newspapers in all other languages as well.

As a rule, the newspaper consisted of four pages, but some special Issues were longer (e.g., 8 pages for Issue 281 of 1939). It was issued about five times a week, with no publication on fluctuating days. Zviazda was first published in Russian, and then, due to the Soviet korenizatsiia (‘indigenization’) policy (in Belarusian Карэнiзацыя) of the Twenties, it was published in both Russian and Belarusian from 1925, before finally adopting Belarusian as its only language in the summer of 1927 (entry Звязда in Энцыклапедыя Гісторыi Беларусi (Encyclopedia of Belarusian History), vol. 3, 1996).

The selected Issues cover the years 1938, 1939 and 1940. Apart from the year of the signing of the Treaty of Non-Aggression between Germany and the Union of Soviet Socialist Republics, it was decided to lengthen the time period to include the previous and following years. The year 1938 was significant for its multiple prewar crises, with a breakthrough year for the Spanish civil war that paved the way for the Republican defeat in 1939, the Anschluss and the Munich Conference. Throughout this year, the Soviet Union and Germany were obviously antagonistic powers in the diplomatic arena. The year 1940 was the main year of peaceful coexistence between the two former ideological enemies. This time span allows the Soviet discourse before and after the Treaty to be fully captured.

All these Issues were downloaded from the website of the Presidential Library of Belarus. They constitute a subset of a wider amount of Issues from 1918 to 1945. The Issues are digitized as PDF documents (resolution of 300 ppi) and cannot be used in this form to extract and process textual information. For the years 1938 and 1940 each page was scanned as one PDF page, but for the year 1939 pages were scanned in two parts (the upper and the lower part) with the middle part of the text being repeated in both scans (see Figure 1), except Issue 59/1939.

Figure 1 

Example of overlapping PDF pages.

A certain number of Issues are missing among the documents provided by the Presidential Library (see Table 1). Some original documents are also physically damaged to a certain extent (e.g., Issue 152 of 1940). The number of Issues and pages is provided in Table 2. Thirteen Issues do not have their usual four pages. Some special Issues are longer (281/1940 – 8 p., 1/1939, 58/1939 and 62/1939 – 6 p.). Issues 270/1938 and 38/1939 have 5 pages because one page was scanned twice. In some cases, Issues are not complete (134/1939 – 3.5 p., 54/1939, 187/1939 and 280/1939 – 3 p., 26/1939 and 251/1939 – 2 p.). For one Issue, a single article was scanned (11/1939).

Table 1

Missing Issues by year.


1938151, 190, 221, 228, 255

193937, 50, 66, 114, 118, 119, 127, 129, 206, 226, 241, 271, 284, 296

194059, 65, 88, 111, 204, 282, 294

Table 2

Basic statistics related to Issues.


YEARPAGES (NUMBER)ISSUES (NUMBER)

19381185296

19391148288

(2288 PDF half pages + 4 full pages)

19401188296

total3521880

The processing of these documents is explained in the following section about the pipeline.

5 The pipeline

The main task of the present study was to develop a pipeline from the PDF files to the searchable database. As a rule, all components had to be free and to carry out the bulk of the work, with minimal ad hoc developed functions (mainly for format conversion). The pipeline was developed on Fedora 38.

5.1 Text extraction and preprocessing

The conversion from image-based PDF to image files (TIFF or JPG) relies on Poppler while the OCR relies on Tesseract 5.3.0. Given the uneven quality of the PDF images of old newspapers, the OCR is relatively difficult, especially without any attempt to improve the recognition process (either by improving the quality of pictures, by providing a lexicon or by using additional tools like ocrd_tesserocr). We made several attempts to perform OCR with both TIFF and JPG as input and with two different page segmentation modes (with the default mode –pms -3 and with –pms -6). As it is now, the pipeline goes through JPG images with the default page segmentation mode because it recognizes the text flow through different columns. Nonetheless, it is unable to properly recognize the article structure on a given page, therefore some unrelated parts of the page can appear as following one another. Early attempts were made to link the flow text with the spatial structure of the page through –psm -6 but this approach needs to be tested further. An important missing step in the pipeline is the merging of both half pages and the deletion of the duplicated middle part. As a consequence, the text flow is always interrupted in the middle of the page. A similar issue arises when an article appears across several pages.

A limited attempt to give an approximate measure of the OCR quality was performed on the files corresponding to the first page of the Issue 200/1938 and to the first half page of the Issue 100/1939. Table 3 shows the word error rate for these two extracted text files.

Table 3

OCR word errors.


NUMBER OF TOKENS (ORIGINAL TEXT)NUMBER OF TOKENS (OCRED FILE)COMMON TOKENS (NUMBER)MISSPELLED OR MISSING TOKENS

200/1938 (first page)422342173795428 (10%)

100/1939 (first half page)186416021352512 (27%)

The WERs (word error rates) are respectively 10% and 27%. A significant part of the misspelled words shows a single character mistake that might be recoverable with some specific techniques of OCR correction. The most problematic issue manifests itself in the second file (100/1939), where the number of words is notably smaller in the OCRed file than in the original text: all the lowest lines of the PDF page that appeared in a darker area in the PDF page (over 250 words out of 1864 taking into account the number of tokens) were totally missed by the OCR tool, hence the lower WER for the second file (given the token numbers, the WER for the OCRed parts of the page might be around 13%, very close to the WER for the first file). It means that some parts of the content are missing in the extracted data and this serious shortcoming needs further investigations. The WER evaluated for the current data are extremely similar to the rates mentioned in Nguyen et al. () (in Section 2), from 9% to 27%, for a dataset that is mostly related to older newspapers, but consists of texts in English. Moreover, the OCR task was performed at least partly with commercial software.

Two limited cleaning operations were performed on the OCR results. First, a rudimentary script was written to dehyphenate the text and thus to improve the text flow. The result is segmented into sentences and words and converted into a CoNLL-U file by UDPipe 1 with the HSE Belarusian model (). Second, the Hunspell Python library was used to correct some country names and nationality nouns or adjectives, like Germany, German, Belarus, Belarusian and Soviet in the CONLLU files. It is very problematic to use a fully automatic spelling correction since all unknown words would be replaced by known ones. Our approach to minimize the risks (except for case endings) was to accept the correction only if one of our keywords (either a country name or a nationality noun/adjective) is among the proposed Hunspell suggestions. For future steps, a different decision possibly involving a partly manual correction of words unrecognized by Hunspell will be necessary. Since the OCR by Tesseract with different inputs and parameters results in different mistakes, it may be possible to combine the different outputs of the same text in order to reduce the number of mistakes, but such an approach has not been tested. However, the main factor for improvement is probably an increase in the quality of the OCR process.

5.2 Morphological and syntactic analysis

The morphological and syntactic analysis was also performed by UDPipe with the already mentioned HSE Belarusian model. We were not able to find freely available analyzers or Belarusian language models (although Zubov () mentions that such tasks were performed in the Belarusian Academy of Sciences). The HSE Belarusian model is based on a relatively small treebank of about 30,000 tokens and only half of the data were used to train the model (with the second part used for testing). As a result, the quality of the linguistic analysis is insufficient.

Hunspell was considered as another option for tagging the data but the morphological information is totally absent in the standard Hunspell Belarusian dictionary, so it cannot be used as a basic morphological analyzer for Belarusian. Furthermore, even lemmatization is practically impossible since the Belarusian Hunspell dictionary relies on a very fragmented description of the lexicon. Due to the concatenative nature of Hunspell, it does not allow the possible vowel alterations that are frequent in Belarusian stems (in relation with the movement of the lexical stress) to be easily dealt with in the traditional way, that is with a strong approach with word (lemma) and paradigm. It led the dictionary authors to divide single lexical units into several base forms when there are alterations in the stem. Separate singular and plural base forms for nouns are quite frequent (e.g., год ‘year’, гады ‘years’), but some very variable nouns are split into even more base forms (e.g., дзень/дня/дні/дзён ‘day(s)’). Other parts of speech can be split as well, e.g., маю and мець for the verb ‘to have’. Such an approach proved to be suitable for spell-checking, which is the core purpose of Hunspell, but it does not allow using this tool directly for other goals in Belarusian.

A preliminary evaluation of the results was performed on a very small sample of the first 100 sentences of the Issue 200/1938. It consists of 1093 tokens in total, but the evaluation was limited to the 864 alphabetic tokens since punctuation signs and numeric expressions are almost always correctly lemmatized and tagged for parts of speech.

The results for lemmatization are given in Table 4. Regarding the correct word forms, lemmatization was successful for about 79% of the alphabetic tokens and even close to 85% if the cases in which lemmatization was correct except for capitalization (either incorrectly assigned to the lemma or missing when needed) are included. When the word forms are incorrect (because of mistakes in OCR or dehyphenation), lemmatization was partly correct in 1/4 of the cases.

Table 4

Preliminary results (lemmatization).


LEMMATIZATION OUTCOMEON CORRECT WORD FORMON INCORRECT WORD FORM

Correct60478.9%2424.5%

Correct except for capitalization435.6%

Incorrect11915.5%7475.5%

Total766100%98100%

The POS (part-of-speech) tagging was evaluated on the same list of tokens. The results (see Table 5) are relatively similar to the lemmatization results: correct POS assignment amounts to 84% of the alphabetic tokens with correctly provided word forms, when the proportion of correct POS tagging for incorrect word forms is slightly higher and reaches 35%. This is the direct consequence of a greater role of the context in determining the POS, while lemmatization is more strongly lexicon-based.

Table 5

Preliminary results (POS tagging).


POS TAGGING RESULTON CORRECT WORD FORMON INCORRECT WORD FORM

Correct64484.1%3535.7%

Incorrect12215.9%6364.3%

Total766100%98100%

As for the parsing, a quick overview of the 100 mentioned sentences seems to show that about half of the sentences are not properly segmented. The mistakes are especially obvious in the part of the page where headers announce the content of the Issue 200/1938. In more massive paragraphs the segmentation tends to perform better, provided that the sequence of the text lines is correct. When the sentence segmentation is correct or almost correct, the syntactic analysis appears good or relatively good in about half of the sentences. The quality of the parsing does not seem to be related to the sentence length only: nominal sentences were often incorrectly parsed while several longer sentences were surprisingly well analyzed.

Given the small size of the sample, all these preliminary results have to be considered with caution. Nonetheless, it proves that at the current stage, the data are excessively noisy. Besides the mistakes that come from the OCR step, the quality of the syntactic and morphological analysis is too low and this issue must be addressed in the future. Some steps are underway to develop a morphological analyzer on the basis of the Hunspell Belarusian lexicon. An option for improving parsing could be to attempt using a larger Russian dependency model given the high syntactic similarity between Russian and Belarusian.

5.3 Corpus storage in a graph database

The data were finally stored in a database. With a view to develop linked information layers, the decision was made to store the data in a graph database. After trying several options, Memgraph was selected for its reliability and usability with relatively big databases, its ability to import CSV files at high speed, its graph visualization function allowing quick exploration of the database and the option to query the database through Python programs. The CoNLL-U files were converted to suitable CSV files by an ad hoc script.

In the current database schema (see Figure 2), vertices (or nodes) have the following properties:

Figure 2 

The database schema (visualization generated by Arrows, https://arrows.app).

  • Token: position in the document (Int), id field, form, lemma, part of speech, grammatical features, glue (all are strings)
  • Sentence: id field (String)
  • Document: id field (String), year (Int), issue number (Int)

The edges (or relationships) express the relation between each token and the document it belongs to (IS_IN_DOC), the relation between each token and the sentence it belongs to (IS_IN_SENT), the sequential relation between tokens in the text flow (IS_NEXT), the dependency relation between words (IS_DEP) and the relation between the root and its sentence (IS_ROOT). Only the dependency relation has a property, the syntactic type of the dependency relation. The relation IS_ROOT between the root and the sentence is largely redundant with the relation IS_IN_SENT and should probably be replaced by the boolean property is_root for each token.

At its present stage, article, page and paragraph structures are not expressed in the database (although information about newlines could be retrieved through the field glue). The quantitative information about the present database is summarized in Table 6.

Table 6

Database summary.


VERTICES (NODES)EDGES (RELATIONSHIPS)

tokens14,297,480IS_IN_DOC14,297,480

sentences1,404,085IS_IN_SENT14,297,480

documents880IS_NEXT14,296,600

IS_DEP12,893,395

IS_ROOT1,404,085

total15,702,445total57,189,040

5.4 Visualization

While Memgraph Lab allows a quick view of the data, it is also very limited in the way it can visualize these data, only as relation graphs. Given its wide range of possible visualizations, the Python library Plotly was selected to be included in this pipeline. For the current study, it was restricted to simple time diagrams and to static use of the data, but Plotly will make it possible to develop dynamic visualizations in connection with the Memgraph database through Python scripts.

The next part presents how these data, despite the above-mentioned shortcomings, can be used for a concrete small-scale study related to the radical Soviet discourse shift of 1939.

6 A minimal pilot study

As a minimal exploratory study, the time distribution of some pejorative collocates of the adjective германскі (hermanski) ‘German’ was extracted from the database. Given the low quality of the parsing, this short study relies on a simple collocation approach with a span of five tokens before the adjective and five tokens after it. This means that it could have been realized through a traditional text database such as BlackLab or NoSketch and that the potential of the graph database is underutilized at this point.

After a quick manual check of the 32,230 collocates of the word германскi to identify a set of potentially negative words with a minimum frequency (at least four occurrences), the following list was created: фашыст ‘fascist’ (and related words like фашызм ‘fascism’), агрэсiя ‘aggression’ (and related words like агрэсар ‘aggressor’), правакацыя ‘provocation’ (and related words), тэрор ‘terror’ (and related words). The latter three words have a direct negative connotation, while the first word names a major antagonistic ideological system (at some point the main one). Two more words, which are related to supposed or actual subversive activities and used to reinforce the feeling of an omnipresent threat, were added to the list: шпiён ‘spy’ (and related words) and агент ‘agent’. The extraction of the selected collocates was performed by regular expressions (e.g., *фашы* ‘*fasci*’) in order to partly recover some incorrect tokens due to OCR errors, hyphenation issues or incorrect lemmatization.

These selected collocates were visualized diachronically for the years 1938, 1939 and 1940. Beside a usual diagram with monthly frequency for each selected collocate (Figure 3), an unusual alternative is provided as Figure 4.

Figure 3 

The monthly frequency of the selected colocates of германскi (visualized with Plotly).

Figure 4 

The co-occurrence schema of германскi (visualized with Plotly).

This specific co-occurrence schema attempts to represent the appearance of collocates by Issue, in order to obtain a very fine sequential alignment (by date or Issue number), even for collocates that are meaningful not by their higher frequency, but by their only presence or absence. Since the frequency of the selected collocates in each Issue is mostly 0 or 1, rarely more, the (raw or normalized) frequency on the y-axis would have been hardly readable when two or more selected collocates appear once in an Issue. In general, the line would have been almost flat between 0 and 1 with a few peaks to 2. Instead, the ordinate was used to add the total frequency in the given period: each occurrence by Issue as abscissa is disposed on the line representing the total frequency in the whole selected period. Since this number is constant, the dots appear on a straight line, but such a decision allows one to read all the values (except in an improbable case in which the total frequency is exactly the same for two collocates) and to have a measure of the relative weight of each collocated in addition to their temporal distribution. In order to not lose information about eventual multiple occurrences of a collocate in an Issue, the size of the dots varies accordingly.

The data in Figures 3 and 4 show that the negatively connotated words stopped being used abruptly immediately after the Soviet-German pact and that this situation continued until the end of the described period (December 1940). It must be emphasized that the later German invasions of Poland, Denmark, Norway, the Netherlands, Belgium, Luxembourg and France did not provoke any return to the previous inflammatory rhetoric against Germany. In the Issues of Zviazda, the events related to the tipping point appear in the following way:

  • Issue 191 (492 in Figure 4) of August 21: last pejorative mentions of Germany (фашыст ‘fascist’, фашызм ‘fascism’, тэрар ‘terror’).
  • Issue 192 (493 in Figure 4) of August 22: nothing remarkable.
  • Issue 193 (494 in Figure 4) of August 23: the arrival of Ribbentrop in Moscow is announced.
  • Issue 194 (495 in Figure 4) of August 24: the signing of the Soviet-German Treaty is announced.

The data confirm at least the first two of the following statements by Ewa M. Thompson: “The signing of the Molotov-Ribbentrop Pact reversed the tone spectacularly. The word fascist was eliminated and virtually overnight the press adopted a pro-Nazi point of view regarding Europe” (, p. 389). The suddenness of such a radical shift illustrates how a totalitarian system can easily switch on or off a certain tone. While the negatively connotated words were not used any more, it does not mean that German topics disappeared from Zviazda, as shown in Figure 5.

Figure 5 

The monthly frequency of германскі (visualized with Plotly).

If we observe the period before the Soviet-German Pact, there is no obvious clue that a political shift was coming in relation to the events that are considered by historians as potential turning points, namely Stalin’s speech of March 10, 1939 or the replacement of Litvinov by Molotov on May 3, 1939 as the People’s Commissar for Foreign Affairs of the Soviet Union (, , ). Nonetheless, the frequency of rhetorical attacks seems to decrease about three months before the Pact, from the middle of May 1939, that is, not long after Molotov’s nomination as the Commissar for Foreign Affairs.

7 Final remarks

This paper shows that it is possible to build a full pipeline from the PDF copies of a Soviet Belarusian newspaper from the Second World War to an annotated corpus with the prospect of developing future richer layers of information and visualizations. Despite the obvious weakness of the data, the minimal exploratory analysis clearly illustrates the radical discourse shift in the Soviet press when the critical coverage of Germany was abandoned “overnight” after the Molotov-Ribbentrop Pact. From this point of view, the exploratory study can be deemed successful.

The approach and the data open the way for a larger outlook on this rhetorical shift. For example, do collocates with a positive connotation appear after the signing of the Molotov-Ribbentrop Pact or is the tone about Germany strictly neutral? Can we observe an opposite trend for the countries that were considered as prospective circumstantial allies before this Treaty, France in particular? In addition, the data allow us to discuss and explore more broadly the topic of public discourse in the Stalinist context and to compare it with other spaces and periods. Some ideas and hypotheses from de Leeuwe, Azrout, Rekker, and Van Spanje () could perhaps be used to analyze the influence of the Soviet legacy on the present Belarusian government press, which is still under strict control.

Nevertheless, the resulting data are extremely noisy, to the extent that some information layers are not usable yet. The Belarusian linguistic component, as far as freely available resources are considered, is still underdeveloped. It is necessary to significantly enhance these resources and to make them easily available. Improving the lemmatization and the syntactic analysis would increase the usability of the graph database, which remains underutilized. As it stands, the same analysis could have been realized with a simpler tool such as AntConc concordancer, but one of the main goals was to explore a sustainable pipeline.

In general, more layers, for example with the rhetorical structure, the annotation of named entities or ideologically marked terms, would also provide this graph database with more options for distant reading, but the priority is clearly to improve the OCR and the morphological and syntactic analysis before enriching the information contained in the Zviazda corpus.