1 Overview

Repository location

South African Centre for Digital Language Resources: https://repo.sadilar.org/; data set: https://hdl.handle.net/20.500.12185/568.


School texts, e.g., reading comprehension () or language instruction texts (, ; ), have been historically used in complexity studies. We leverage the reading comprehension and summary writing texts from examination question papers to overcome the limitations of reproduction of copyrighted textbook materials. We utilise texts from the home language (HL) and the first additional language (FAL) examinations. The home language subject is aimed at learners who start the first grade with competencies such as reading, writing, speaking, and listening in the language (). The first additional language subject is tailored for learners who do not necessarily start the first grade with competencies or exposure to the language being taught (). According to Makalela (), the objectives of the two subject levels are largely similar. However, the texts administered to learners in the HL subject are more linguistically complex () and are harder to read than those in the FAL classes, at least as far as English is concerned (). The current data set has already been used in the following articles:

  • Sibeko, J. (). A comparative analysis of the linguistic complexity of grade 12 English Home Language and English First Additional Language examination papers. Per Linguam: a Journal of Language Learning, 37(2), 50–64. DOI: https://doi.org/10.5785/37-2-976
  • Sibeko, J., and van Zaanen, M. (). An analysis of readability metrics on English examination texts. Journal of the Digital Humanities Association of Southern Africa, 03(1), 1–11. DOI: https://doi.org/10.55492/dhasa.v3i01.3864

2 Method


The data collection process consisted of four steps. First, PDF files of the examination papers were downloaded from South Africa’s Department of Basic Education’s website. As such, no student responses are available. These files (like all other files in the data set) are manually organized per language, per subject (either HL or FAL), and per examination opportunity. Language examinations are written in three sections, i.e., paper one for language, paper two for literature, and paper three for creative writing (, , ). Second, plain text was extracted from the PDF files using pdftotext (version 22.02.0), which is language-independent, on an Ubuntu Linux platform. Third, the plain texts were tokenized (and sentencized) using Ucto (version 0.21.1) to identify the individual words and sentences in the texts. These are both open-source tools. Fourth, the reading comprehension and summarization texts were manually extracted from the tokenized plain text files. Note that some examination papers contain more than one reading comprehension text. The names of all text files contain relevant metadata (language, subject, year, month, and file type).

Table 1 provides an overview of the distribution of the files in the data set. Table 2 provides an overview of the token and type (i.e., unique tokens) counts of the full examination texts, whereas Table 3 provides the same information for the extracted reading comprehension and summarization texts.

Table 1

Distribution of texts per language and subject level for both examination texts and extracted reading comprehension and summarization texts.














Table 2

Token and type count per language and subject level for the full examination texts.














Table 3

Token and type count per language and subject level for the extracted reading comprehension and summarization texts.














The data set contains 429 full examination text files. Of these, 223 are HL texts that have 689,730 tokens and 88,009 types, whereas the 206 FAL text documents contain 624,821 tokens with 73,451 types. In addition to the full examination texts, the reading comprehension and summary writing text part of the examinations are extracted manually, resulting in 929 texts (481 for HL and 448 FAL texts) with 472,430 tokens and 87,779 types. The extracted HL texts consist of 269,881 tokens and 59,007 types, whereas the extracted FAL texts consist of 202,549 tokens and 46,356 types.

The data set is combined into one ZIP file. The files in the data set are first divided into their file type (directories called pdf for PDF files, txt for the extracted UTF-8 Unicode files, tok for the corresponding tokenized files, and ext for the manually extracted reading comprehension and summarization texts). Next, the files are divided into directories corresponding to their languages. Within these directories, the files are divided into directories describing the two subjects, namely, HL and FAL. Next, the files are divided into three examination months, namely, February or March, May or June, and November, respectively in Feb, May, and Nov directories. The files have a consistent naming scheme: lang_subj_month_year.ext with lang the name of the language, subj the name of the subject level, month either Feb-March, May-June, or Nov, depending on the months the examinations were written, year ranging from 2008–2020. ext represents the type of the filename, either txt for text files, or pdf for PDF files. For the extracted reading comprehension and summarization files (found in the ext directory), before the extension (.ext), _type is present. This type can take the values RC1, RC2 for the first and second reading comprehension texts respectively, or SUM for the summarization texts. For instance, a filename IsiZulu_FAL_Nov_2009_SUM.txt indicates a summary text from an IsiZulu FAL examination written in November in the year 2009.

Sampling strategy

All available examination texts have been downloaded from South Africa’s Department of Basic Education’s website. However, as can be seen in Table 1, for some languages certain examination texts have not been made available.

One full examination text consists of reading comprehension texts in the first section, summary writing texts in the second section, and visual texts and language convention texts in the third section. We excluded the third section as it regularly contained graphics, such as cartoons or advertisements, and often it contained deliberate errors.

Quality control

The authors manually checked the contents of the texts to ensure all sections found in the PDF documents are also found in the plain text variants. Additionally, the texts were checked for consistent use of text encodings, in particular, related to diacritics for Tshivenḓa and Afrikaans.

3 Data Set Description

Object name

Final year high school examination texts of South African home language and first additional language subjects.

Format names and versions

PDF, UTF-8 encoded text files.

Creation dates

Start date: 2021–02–01; End date: 2022–10–15.

Data set creators

Johannes Sibeko and Menno van Zaanen.


The data set contains texts in Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, Siswati, Tshivenḓa, and Xitsonga. Metadata is provided in English.


Creative Commons License Attribution-ShareAlike 4.0 International.

Repository name

South African Centre for Digital Language Resources.

Publication date


4 Reuse Potential

Research in linguistic and text complexity (related to readability of texts) has been on-going for over a century (; ). However, such research on South African languages has lagged behind, resulting in limited resources for analysing readability and complexity of texts in the indigenous languages ().

The texts in the data set allow for several linguistic comparisons: chronologically or diachronically, between languages (i.e., cross-lingual comparison), between subjects (HL versus FAL), between types of texts (summary versus reading comprehension), and between different examination dates (February versus May versus November). Further annotation of the texts (e.g., on part-of-speech, partial parsing, named entities, etc.) allows for investigations into these textual properties. As the data set contains texts of languages of both disjunctive and conjunctive orthographies, investigations into the influence of orthography can be performed. For instance, the orthography of isiZulu has been proven to affect reading ability ().

More content-oriented research can consider the different genres used. For instance, around 2008 to 2011, literary texts were used for reading comprehension (taken from books that are not part of the official curriculum in the language), but after 2012, these texts focused more on newspaper and magazine news articles. This allows for research into the influence of the different genres ().

The data set is, to our knowledge, the first corpus in the South African educational context. The data allows for research in the realm of education. As examples, we mention, investigations into the themes of the texts used for the different languages and subjects, investigations into overall learner achievements between similar languages (e.g., those in the Nguni or the Sotho group), and investigations into learners’ reading abilities as the current Progress in International Reading Literacy Study (PIRLS) indicate depreciating reading abilities through the years ().