(1) Overview

Repository location

DaYta Ya Rona: https://doi.org/10.25388/nwu.23708229.


Research on swearing, offensive and taboo language has been an active area of research for many years in a variety of scientific contexts, including computational linguistics, psychology, sociology, and various subdisciplines of linguistics – see Stapleton et al. () for a recent overview. While the majority of scientific literature focuses on English, various studies have also been undertaken for other languages, including Cantonese, Danish, Dutch, Finnish, French, Italian, Japanese, Latin, and Russian. In the South African context, and specifically for Afrikaans, relatively little research has been done in this research area, bar some research focussing on the lexicographic handling of swearwords (; ), language acquisition (), language change (), lexicology and onomastics (, , ; ; ), sociolinguistics (; ; ), and grammatical aspects of swearing (; ; ; , ). Until recently, no research has ever been done on user attitudes to Afrikaans taboo language.

To address this shortcoming, a multidisciplinary research project – What the Swearword! – was initiated to investigate various aspects of taboo language in Afrikaans and other languages in its ecosystem (). An important part of the project is the collection of empirical data related to, among others, the prototypicality of swearwords (; ), attitudes to parental control (, ), and user attitudes to swearwords (). The methodology and resultant dataset of the latter is the focus of this article.

(2) Method


To collect data on self-reported attitudes to swearwords, short online surveys for individual words have been posted periodically on the project website and advertised via social media platforms. All respondents must firstly register for free as users on the project website. During the registration process, respondents firstly give their informed consent, and must then once-off provide some sociodemographic information, translated and summarised in Table 1. These sociodemographic factors and their values have been informed by the above-mentioned previous studies, as well as other sociopragmatic studies of offensive words, where one or more of these factors have been statistically correlated with usage of and attitudes to such words (see , , ), and Beers Fägersten (); Beers Fägersten and Stapleton (); Beers Fägersten and Stapleton () especially). A summary of the sociodemographic responses of the survey participants is available in the data repository as part of the dataset.

Table 1

Summary of sociodemographic factors.


Age groupOrdinal (3)18–39; 40–59; 60+

SexNominal (2)Male; Female

Population groupNominal (4)Black; Coloured; Indian; White1

LengthOrdinal (7)>199; 190–199; 180–189; 170–179; 160–169; 150–159; <150

Mother’s primary languageNominal (12)List of South Africa’s eleven official languages, plus Dutch

Father’s primary languageNominal (12)List of South Africa’s eleven official languages, plus Dutch

Language used primarily with familyNominal (13)List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option

Language used primarily with friendsNominal (13)List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option

Language primarily used for workNominal (13)List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option

Languages proficient inNominal (12)List of South Africa’s eleven official languages, plus Dutch

Identification with a geolectNominal (2)Yes; No. If “yes”, then the respondent gets a list of typical Afrikaans geolects to choose from, or to specify their own geolect.

Country of residenceNominal (12)South Africa, with a specification for one of the nine provinces; Namibia; Belgium; The Netherlands

Period in country of residenceOrdinal (4)>6 years; 4–6years; 1–3years; <1 year

Country of childhoodNominal (12)South Africa, with a specification per one of the nine provinces; Namibia; Belgium; The Netherlands

Highest qualificationNominal (10)List of typical kinds of qualification in South Africa

Income groupOrdinal (7)List of typical categories

Identification with a gender groupNominal (2)Yes; No. If “yes”, then the respondent can specify their own gender group.

Religiousness as a child/teenagerNominal (5)Very religious; Religious; Somewhat religious; Not really; Not at all

Religiousness currentlyNominal (5)Very religious; Religious; Somewhat religious; Not really; Not at all

Political viewsNominal (5)Very conservative; Conservative; Moderate; Liberal; Very liberal

World view (pertaining to moral and social issues)Nominal (5)Very conservative; Conservative; Moderate; Liberal; Very liberal

To gather responses on participants’ attitudes towards different words, an online single-word survey (SWS) template was designed. In each SWS, only one swearword is presented to respondents, in an attempt to prevent so-called “respondent fatigue” – a well-documented phenomenon that occurs when survey participants become tired of the survey task, and the quality of the data they provide begins to deteriorate (). The assumption is that one would cover more words over a period of time, than if one were to present the same number of words to participants in a single session ().

One very significant challenge of this SWS approach is that the responses for the different words are not being collected during a single session by the same respondents. For example, the SWS for word X could have been completed by 200 respondents of the more than 2,000 registered users, while the SWS for word Y a week later by only 120 respondents – with only some (if any) overlap between these two SWSs.

Table 2 provides a summary of the words in the data set along with the number of respondents who completed the survey for each word.

Table 2

Summary of swearwords and number of respondents for each word.





































Each word is judged on at least seven dimensions relating to a respondent’s attitude to the word; an eighth dimension only pertains to some words where the sex of the referent might be relevant (e.g., whether a word like soutie ‘English person’ can be used to refer to men and women alike). These dimensions and their corresponding questions are translated and listed in Table 3. For each dimension, a respondent must assign a value between 1 and 9, where only the two extreme values of the scale are labelled.

Table 3

Response dimensions.


Production frequencyHow often do you say or write the word?Never … Very often

Perception frequencyHow often do you hear or read the word? (E.g., in conversations, on the radio or TV, in magazines or books, on the internet, etc.)Never … Very often

Offensiveness (self)How offensive do you find the word personally?Not at all … Very

Tabooness (others)How taboo or socially unacceptable is the word for people in general? (E.g., in a workspace, classroom, at a party with friends, family, and colleagues)Not at all … Very

EmotionalityWhat emotional charge does the word have for you?Very negative … Very positive

ConspicuousnessHow conspicuous is the word? (To what degree does it grab your attention?)Not at all … Very

FamiliarityHow well do you know what the word means?Not at all … Very well

Sex of referentCan the word be used to refer only to men, to men and women, or only to women?Women only … Men only

All data are stored in a relational database, and then extracted to create a single UTF-8 encoded CSV file. Each line in the file has 54 columns consisting of the swearword, the respondent’s unique identifier, the responses of the respondent to the word, and the sociodemographic information of the respondent in both ordinal and text format.

Sampling strategy

Given the fact that the aim of the project is not to collect data specifically for decision making, but rather sociopragmatic description of swearwords, it is not as important to target fully stratified respondent samples. Consequently, non-probability sampling of respondents is a valid approach where volunteer respondents are recruited through respondent-driven opportunistic sampling, as formalised by Heckathorn (), and snow-ball sampling via social media (). These techniques have the potential advantage of including so-called “hidden populations”, or respondents that would not otherwise participate in research projects dealing with taboo topics and swearwords.

(3) Dataset Description

Object name – Afrikaans swearword scores

Format names and versions – UTF-8 encoded CSV version 1.0

Creation dates – 2019/07/01 – 2023/05/31

Dataset creators

Gerhard B. van Huyssteen (Organisation, Design, Collection, Quality Control), North-West University

Cornelius van der Walt (Website development, Data processing), BlueTek Computers

Jaco du Toit (Data processing), North-West University

Roald Eiselen (Data processing), North-West University

Nico Oosthuizen (Data processing), Independent

Language – Afrikaans (af)

License – Creative Commons Attribution 4.0 International

Repository name – DaYta ya Rona

Publication date – 2023–07–19.

(4) Reuse Potential

Since this is the first empirical dataset ever on user perceptions of Afrikaans swearwords, the dataset holds great potential for perusal in numerous language-specific (i.e., Afrikaans) sociopragmatic and/or sociolinguistic investigations. For example, the data can be used to compare specific words within the same domain, like what Van Huyssteen and Eiselen () have done for the words feeks (“shrew”) and helleveeg (“harridan”), or across semantic domains (e.g., a comparison of words from the sex domain with words from the religious domain, etc.). On the other hand, the dataset could be used fruitfully in investigating sociodemographic predictors of tabooness, offensiveness, and the like.

Given that the sociodemographic factors and their values are based on well-known international research, the dataset could also be used in comparative linguistic research. While specific words could not necessarily be compared across languages, semantic domains or taboo types (like blasphemies, slurs, or epithets) could be compared. It would, of course, be easier to do such comparisons with Germanic languages, e.g., with the data of Van Sterkenburg () for Dutch, or Beers Fägersten () for Danish.

From a statistical point of view, the data could be used in the modelling of problematic or challenging data. For example, one of the shortcomings of the dataset is the large variation in number of respondents per swearword, ranging from moffie (“gay man”) with 188 responses with complete metadata, to gat (“buttocks”) with only 3 comparable responses (see Table 2). The validity and reliability of data collected over a period of time by means of SWSs, should also be compared to data collected in a single, longer survey.

Lastly, the dataset could also be utilised for practical, applied purposes. For example, it is currently being used in the so-called Vloekmeter (‘swearing meter’; see vloek.co.za/vloekmeter). The Vloekmeter is purely data-driven: Based on this dataset, statistics are presented on an interactive dashboard on the website (see Figure 1). Such an application can be of practical use not only for content creators (like authors, and film makers), but especially also for publishers, broadcasting companies (like Netflix), or the South African Film and Publication Board that might want to provide age and content advisories for books, television series, films, and computer games.

Figure 1 

Vloekmeter showing results for fokken (“fucking”) and flippen (“fricking”).