(1) Overview
Repository location
DaYta Ya Rona: https://doi.org/10.25388/nwu.23708229.
Context
Research on swearing, offensive and taboo language has been an active area of research for many years in a variety of scientific contexts, including computational linguistics, psychology, sociology, and various subdisciplines of linguistics – see Stapleton et al. () for a recent overview. While the majority of scientific literature focuses on English, various studies have also been undertaken for other languages, including Cantonese, Danish, Dutch, Finnish, French, Italian, Japanese, Latin, and Russian. In the South African context, and specifically for Afrikaans, relatively little research has been done in this research area, bar some research focussing on the lexicographic handling of swearwords (; ), language acquisition (), language change (), lexicology and onomastics (, , ; ; ), sociolinguistics (; ; ), and grammatical aspects of swearing (; ; ; , ). Until recently, no research has ever been done on user attitudes to Afrikaans taboo language.
To address this shortcoming, a multidisciplinary research project – What the Swearword! – was initiated to investigate various aspects of taboo language in Afrikaans and other languages in its ecosystem (). An important part of the project is the collection of empirical data related to, among others, the prototypicality of swearwords (; ), attitudes to parental control (, ), and user attitudes to swearwords (). The methodology and resultant dataset of the latter is the focus of this article.
(2) Method
Steps
To collect data on self-reported attitudes to swearwords, short online surveys for individual words have been posted periodically on the project website and advertised via social media platforms. All respondents must firstly register for free as users on the project website. During the registration process, respondents firstly give their informed consent, and must then once-off provide some sociodemographic information, translated and summarised in Table 1. These sociodemographic factors and their values have been informed by the above-mentioned previous studies, as well as other sociopragmatic studies of offensive words, where one or more of these factors have been statistically correlated with usage of and attitudes to such words (see , , ), and Beers Fägersten (); Beers Fägersten and Stapleton (); Beers Fägersten and Stapleton () especially). A summary of the sociodemographic responses of the survey participants is available in the data repository as part of the dataset.
SOCIODEMOGRAPHIC FACTORS | DATA TYPE | OPTIONS |
---|---|---|
Age group | Ordinal (3) | 18–39; 40–59; 60+ |
Sex | Nominal (2) | Male; Female |
Population group | Nominal (4) | Black; Coloured; Indian; White1 |
Length | Ordinal (7) | >199; 190–199; 180–189; 170–179; 160–169; 150–159; <150 |
Mother’s primary language | Nominal (12) | List of South Africa’s eleven official languages, plus Dutch |
Father’s primary language | Nominal (12) | List of South Africa’s eleven official languages, plus Dutch |
Language used primarily with family | Nominal (13) | List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option |
Language used primarily with friends | Nominal (13) | List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option |
Language primarily used for work | Nominal (13) | List of South Africa’s eleven official languages, plus Dutch, as well as a bilingual (Afrikaans and English) option |
Languages proficient in | Nominal (12) | List of South Africa’s eleven official languages, plus Dutch |
Identification with a geolect | Nominal (2) | Yes; No. If “yes”, then the respondent gets a list of typical Afrikaans geolects to choose from, or to specify their own geolect. |
Country of residence | Nominal (12) | South Africa, with a specification for one of the nine provinces; Namibia; Belgium; The Netherlands |
Period in country of residence | Ordinal (4) | >6 years; 4–6years; 1–3years; <1 year |
Country of childhood | Nominal (12) | South Africa, with a specification per one of the nine provinces; Namibia; Belgium; The Netherlands |
Highest qualification | Nominal (10) | List of typical kinds of qualification in South Africa |
Income group | Ordinal (7) | List of typical categories |
Identification with a gender group | Nominal (2) | Yes; No. If “yes”, then the respondent can specify their own gender group. |
Religiousness as a child/teenager | Nominal (5) | Very religious; Religious; Somewhat religious; Not really; Not at all |
Religiousness currently | Nominal (5) | Very religious; Religious; Somewhat religious; Not really; Not at all |
Political views | Nominal (5) | Very conservative; Conservative; Moderate; Liberal; Very liberal |
World view (pertaining to moral and social issues) | Nominal (5) | Very conservative; Conservative; Moderate; Liberal; Very liberal |
To gather responses on participants’ attitudes towards different words, an online single-word survey (SWS) template was designed. In each SWS, only one swearword is presented to respondents, in an attempt to prevent so-called “respondent fatigue” – a well-documented phenomenon that occurs when survey participants become tired of the survey task, and the quality of the data they provide begins to deteriorate (). The assumption is that one would cover more words over a period of time, than if one were to present the same number of words to participants in a single session ().
One very significant challenge of this SWS approach is that the responses for the different words are not being collected during a single session by the same respondents. For example, the SWS for word X could have been completed by 200 respondents of the more than 2,000 registered users, while the SWS for word Y a week later by only 120 respondents – with only some (if any) overlap between these two SWSs.
Table 2 provides a summary of the words in the data set along with the number of respondents who completed the survey for each word.
WORD | TOTAL RESPONSES | RESPONSES WITH COMPLETE METADATA | WORD | TOTAL RESPONSES | RESPONSES WITH COMPLETE METADATA |
---|---|---|---|---|---|
asshole | 8 | 7 | jirre | 13 | 10 |
ballas | 12 | 10 | jissis | 184 | 152 |
bebliksemd | 189 | 155 | kak | 197 | 167 |
bedonderd | 147 | 123 | kerriekop | 25 | 19 |
befok | 13 | 9 | kont | 48 | 38 |
bekak | 11 | 8 | kots | 163 | 135 |
blerrie | 31 | 18 | magtig | 12 | 12 |
bliksem | 19 | 16 | ma-se-poes | 15 | 11 |
bliksems | 104 | 88 | moer | 20 | 16 |
boudservette | 194 | 155 | moerskont | 12 | 9 |
demmit | 12 | 11 | moffie | 222 | 180 |
donder | 12 | 11 | naai | 23 | 20 |
doos | 34 | 24 | naaier | 18 | 15 |
drol | 14 | 11 | piel | 29 | 18 |
eiers | 7 | 6 | piele | 208 | 167 |
etter | 22 | 13 | pis | 9 | 9 |
feeks | 184 | 147 | poep | 10 | 9 |
flerrie | 77 | 67 | poephol | 39 | 30 |
flippen | 133 | 114 | poes | 26 | 19 |
fok | 21 | 16 | rooikop | 125 | 104 |
fokken | 208 | 174 | shit | 21 | 16 |
fokker | 55 | 44 | skyt | 130 | 109 |
fokkit | 45 | 31 | slet | 18 | 14 |
fokkof | 9 | 5 | slymkonyn | 169 | 138 |
fokkol | 16 | 13 | stront | 9 | 8 |
foktog | 18 | 15 | swerkater | 20 | 17 |
frieken | 158 | 126 | swernoot | 178 | 147 |
fuck | 16 | 12 | teef | 10 | 7 |
gat | 5 | 3 | tos | 159 | 126 |
god | 26 | 22 | tril | 5 | 4 |
gots | 154 | 125 | voëlverklikker | 150 | 120 |
hel | 16 | 15 | wetter | 6 | 6 |
helleveeg | 130 | 108 | wolgordyn | 129 | 110 |
hoer | 29 | 19 | wortelkop | 222 | 176 |
hol | 10 | 9 | |||
Each word is judged on at least seven dimensions relating to a respondent’s attitude to the word; an eighth dimension only pertains to some words where the sex of the referent might be relevant (e.g., whether a word like soutie ‘English person’ can be used to refer to men and women alike). These dimensions and their corresponding questions are translated and listed in Table 3. For each dimension, a respondent must assign a value between 1 and 9, where only the two extreme values of the scale are labelled.
DIMENSION | QUESTION | END-POINT LABELS |
---|---|---|
Production frequency | How often do you say or write the word? | Never … Very often |
Perception frequency | How often do you hear or read the word? (E.g., in conversations, on the radio or TV, in magazines or books, on the internet, etc.) | Never … Very often |
Offensiveness (self) | How offensive do you find the word personally? | Not at all … Very |
Tabooness (others) | How taboo or socially unacceptable is the word for people in general? (E.g., in a workspace, classroom, at a party with friends, family, and colleagues) | Not at all … Very |
Emotionality | What emotional charge does the word have for you? | Very negative … Very positive |
Conspicuousness | How conspicuous is the word? (To what degree does it grab your attention?) | Not at all … Very |
Familiarity | How well do you know what the word means? | Not at all … Very well |
Sex of referent | Can the word be used to refer only to men, to men and women, or only to women? | Women only … Men only |
All data are stored in a relational database, and then extracted to create a single UTF-8 encoded CSV file. Each line in the file has 54 columns consisting of the swearword, the respondent’s unique identifier, the responses of the respondent to the word, and the sociodemographic information of the respondent in both ordinal and text format.
Sampling strategy
Given the fact that the aim of the project is not to collect data specifically for decision making, but rather sociopragmatic description of swearwords, it is not as important to target fully stratified respondent samples. Consequently, non-probability sampling of respondents is a valid approach where volunteer respondents are recruited through respondent-driven opportunistic sampling, as formalised by Heckathorn (), and snow-ball sampling via social media (). These techniques have the potential advantage of including so-called “hidden populations”, or respondents that would not otherwise participate in research projects dealing with taboo topics and swearwords.
(3) Dataset Description
Object name – Afrikaans swearword scores
Format names and versions – UTF-8 encoded CSV version 1.0
Creation dates – 2019/07/01 – 2023/05/31
Dataset creators
Gerhard B. van Huyssteen (Organisation, Design, Collection, Quality Control), North-West University
Cornelius van der Walt (Website development, Data processing), BlueTek Computers
Jaco du Toit (Data processing), North-West University
Roald Eiselen (Data processing), North-West University
Nico Oosthuizen (Data processing), Independent
Language – Afrikaans (af)
License – Creative Commons Attribution 4.0 International
Repository name – DaYta ya Rona
Publication date – 2023–07–19.
(4) Reuse Potential
Since this is the first empirical dataset ever on user perceptions of Afrikaans swearwords, the dataset holds great potential for perusal in numerous language-specific (i.e., Afrikaans) sociopragmatic and/or sociolinguistic investigations. For example, the data can be used to compare specific words within the same domain, like what Van Huyssteen and Eiselen () have done for the words feeks (“shrew”) and helleveeg (“harridan”), or across semantic domains (e.g., a comparison of words from the sex domain with words from the religious domain, etc.). On the other hand, the dataset could be used fruitfully in investigating sociodemographic predictors of tabooness, offensiveness, and the like.
Given that the sociodemographic factors and their values are based on well-known international research, the dataset could also be used in comparative linguistic research. While specific words could not necessarily be compared across languages, semantic domains or taboo types (like blasphemies, slurs, or epithets) could be compared. It would, of course, be easier to do such comparisons with Germanic languages, e.g., with the data of Van Sterkenburg () for Dutch, or Beers Fägersten () for Danish.
From a statistical point of view, the data could be used in the modelling of problematic or challenging data. For example, one of the shortcomings of the dataset is the large variation in number of respondents per swearword, ranging from moffie (“gay man”) with 188 responses with complete metadata, to gat (“buttocks”) with only 3 comparable responses (see Table 2). The validity and reliability of data collected over a period of time by means of SWSs, should also be compared to data collected in a single, longer survey.
Lastly, the dataset could also be utilised for practical, applied purposes. For example, it is currently being used in the so-called Vloekmeter (‘swearing meter’; see vloek.co.za/vloekmeter). The Vloekmeter is purely data-driven: Based on this dataset, statistics are presented on an interactive dashboard on the website (see Figure 1). Such an application can be of practical use not only for content creators (like authors, and film makers), but especially also for publishers, broadcasting companies (like Netflix), or the South African Film and Publication Board that might want to provide age and content advisories for books, television series, films, and computer games.