1 Context and motivation

In multilingual contexts, mixed output, featuring elements from two or more languages, is ubiquitous. Utterance (1), for example, demonstrates an instance of what is known as “code-switching”, a construction in which a speaker alternates between different languages (in this case, Vietnamese and English).

(1)mỗigroupphảia different focus
 each musthave 
 “Each group must have a different focus.”
 (CanVEC, )

Although multilingualism is the norm world-wide (), NLP tools capable of processing more than one language per “sentential unit” as in (1) are still rather limited. This effectively circumscribes important applications such as machine translation (MT) and information retrieval (IR), and also the utility of NLP-based technology in contexts where language-users readily employ two or more languages side by side. Furthermore, as in other areas of NLP, while some efforts have been made to investigate somewhat high-resource language pairs such as English-Spanish (e.g. ; ; ; ) or English-Chinese (e.g. ; ; ), work examining code-switching involving low-resource, or less-described languages is still largely neglected. This means very few resources are available to automatically process this kind of data. With this in mind, two members of our team recently developed a toolkit to process the Canberra Vietnamese-English Corpus (CanVEC), an original corpus of 10 hours of natural mixed speech involving 45 Vietnamese-English migrant speakers living in Canberra. The corpus is semi-automatically annotated with language information and part-of-speech (POS) tags, obtaining >90% accuracy on both tasks ().

In this work, we test the wider feasibility of this framework in processing multilingual corpora by extending its application to another language pair, Hindi-English. Although Hindi-English is one of the more thoroughly investigated language pairs in the context of code-switching (e.g. ; ; ; ), it nevertheless still provides a good test-bed in which to evaluate multilingual-corpus processing tools. We particularly focus on the language-identification task, for which we rely on the annotated data released in the International Conference on Natural Language Processing (ICON) 2016 shared task (). In what follows, we report the result of this pilot as well as the challenges and implications that emerged.

2 Related Work

It should be noted at the outset that language identification is one of the most important and well-studied tasks in computational approaches to code-switching. This is because it is often the prerequisite for many more complex downstream NLP tasks such as POS tagging, machine translation and speech recognition (; ; ). However, since monolingual processing tools tend to be less accurate in short or unidentified code-switching contexts, custom multi-lingual tools such as dictionary lookup, language models, morphological and phonological analysis, and machine learning techniques have become increasingly popular in recent years (; ; ; ; ; ). In particular, a wide range of machine learning algorithms such as Maximum Entropy, Naïve Bayes, Logistic Regression, and Support Vector Machines have been developed for code-switching language identification in many different language pairs or even triples, including English-Spanish (; ; ), English-Hindi (), English-Mandarin (), Spanish-Wixarika (), German-Turkish (), Turkish-Dutch (), modern standard Arabic-Egyptian dialect (), English-Hindi-Bengali (), and Romanized Moroccan Arabic (Darija)-English-French (), among others. Performance is often reported to deliver a mid-90s F-score for English-Spanish or English-Hindi, but much lower for less popular language pairs such as Arabic-Egyptian Arabic or Nepalese-English (80–85 F-score).

Machine learning methods, however, typically require a large amount of training data which may not always be available for low-resource languages participating in language contact. While this kind of data is nevertheless available for Hindi-English code-switching, in this study, we use it purely as a test set to investigate the performance of our approach in the hope that a similar methodology can also be applied to other, less well-resourced language pairs. We particularly hope that this will be of interest to traditional linguists who may be inexperienced with machine learning but who would otherwise have to annotate data manually.

3 Methodology

3.1 ICON-2016 data

The goal of the ICON-2016 shared task was to automatically annotate code-switched Hindi-English, Bengali-English and Telugu-English social media posts (Facebook/Twitter/WhatsApp) with either fine-grained or coarse-grained part-of-speech (POS) tags (). Participants were provided with word tokenised social media posts that were already annotated with native language information. Since the goal of this paper is to investigate the automatic annotation of language information in code-switched data, we ignore the POS annotations and only make use of the language tags. Specifically, we focus on the Hindi-English subset of the corpus for which there are seven possible tags (Table 1).

Table 1

The different language tags in the data, their meaning and some examples.


TAGMEANINGEXAMPLE

enEnglishI, the, songs, listening

hiHindi Apna (mine), ladki (girl), peeti (drinks)

univUniversal#, !, @abc, #happy

mixedMixed Dedh-litre (1.5 litre)

acroAcronymIITB, USA

neNamed EntityEurope, Paris

undefUndefinedM

We downloaded the Facebook, Twitter and WhatsApp Hindi-English data from the shared task website. The distribution of the seven language tags for each dataset and overall is shown in Table 2.

Table 2

The distribution of language tags across datasets and overall.


TAGFACEBOOKTWITTERWHATSAPPOVERALL

en13,2143,73236317,309

hi2,8579,7792,53915,175

univ3,6283,3542817,263

mixed7108

acro251320283

ne656413351,104

undef2002

Total20,61517,3113,21841,144

Since several of these tags are relatively low frequency, we collapsed the mixed, acro, ne and undef tags into the univ category. This was partly because multi-class classification is more challenging with a greater number of labels (especially extremely rare labels), but also because we saw little reason to differentiate between these tags in the language identification task. For example, certain acronyms (e.g. DJ) and named entities (e.g. Holi) can be said to belong to both languages, yet are rarely indicative of code-switching. Similarly, while mixed tokens are certainly interesting examples of code-switching at a morphological level, they are extremely rare in the given dataset (N = 6) and so did not warrant a dedicated label.

The final distribution of labels across the reprocessed datasets is shown in Table 3. It is interesting to note that the distribution of languages is different across datasets, with Facebook being predominantly English (64%), and Twitter and WhatsApp being predominantly Hindi (56% and 78% respectively). It is also notable that universal tokens comprise a significant proportion of the data and are roughly as prevalent as the minority code-switching language in all datasets.

Table 3

Final distribution of language tags after preprocessing.


TAGFACEBOOKTWITTERWHATSAPPOVERALL

en13,2143,73236317,309

hi2,8579,7792,53915,175

univ4,5443,8003168,660

Total20,61517,3113,21841,144

This can possibly be explained by the fact that social media data comes with its own set of particular challenges (as reviewed in ), e.g. typos, intentional spelling deviations (e.g. “okkkk”), abbreviated Internet slang (e.g. “lol”, “smh”), and non linguistic expressions (e.g. emoticons, URLs, hashtags, @ mentions, etc.), many of which are language-agnostic (i.e. universal). Universal tokens may thus be more prevalent in social media posts than other genres of text. These challenges nevertheless play a central role in our decision-making process, and will be discussed throughout this paper.

3.2 Approach

Following L. Nguyen and Bryant (), our approach to token-based language identification is rule-based and relies on a word list for each language. For English, we used a custom Hunspell word list that contained a combination of American, British, Canadian and Australian variant spellings. It was important to allow all these variants in order to maximise the chance that a word would be properly classified. For Hindi, we used a list of 30,000 transliterations that had been extracted from an online Hindi lyric database () and made available in the Forum for Information Retrieval Evaluation (FIRE) 2013 shared task (). We used this dataset because social media users tend not to switch between Devanagari script for Hindi and Roman script for English, and instead use Roman script for everything, transliterating Hindi as necessary. Since there is no standard way of transliterating Hindi to English however (see Section 5 for more discussion), this list represents the largest resource we could find that also contains several variant Roman transliterations for the same Hindi word. We consequently hoped it would have sufficiently large coverage. It is worth mentioning that although an equivalent Hunspell word list for Hindi is also publicly available, it uses Devanagari script and so is incompatible with the ICON-2016 data.

Before making use of these resources, however, we first wrote a number of rules to classify universal tokens that are language-agnostic. In particular, a token is classified as universal if it meets at least one of the following criteria:

  1. It does not contain any alphanumeric characters; e.g. punctuation;
  2. It contains “@”, “#” or “http”, or else is “RT”; e.g. @usernames, #topics, URLs and retweets;
  3. If non-alphanumeric characters are deleted, the string is a number; e.g. dates and times;
  4. It starts with “:” or “;”; e.g. emoticons.

Having tagged universal tokens, the next step was to use the English and Hindi word lists. Specifically, if a token appears in the English word list, but not the Hindi word list, it is tagged as English, and if a token appears in the Hindi word list, but not the English word list, it is tagged as Hindi. This approach successfully accounted for the vast majority of tokens, but revealed 3,629 tokens that did not meet either criteria and were untagged. We hence extracted these tokens and annotated the top 1,000 most frequent ones manually. It is worth noting that 2,569 of the automatically untagged tokens only occurred once in the dataset, so we effectively only annotated tokens that appeared at least twice. The top 20 of these most frequent tokens and their counts are shown in Table 4.

Table 4

The top 20 most frequent ambiguous-language tokens and their frequency.


TOKENFREQ.TOKENFREQ.

to556this134

I496my126

a357for126

of258aur122

in236h111

you212it108

is185have104

me184on100

accha152or91

ho145hi88

Of the top 1,000 that were annotated, there were 59 tokens that we were unable to confidently classify as Hindi, English or universal. Most of these tokens (N = 41/59) were ambiguous high-frequency words in both languages; e.g. “to” which is a discourse marker in Hindi, and “me” which means either “I” or “in” in Hindi. Of the remaining unannotated tokens, three were unknown abbreviations (“clg”, “mst”, “em”), seven were mixed tokens from more than one language tag (“100ka”, “sirji”, “prajii”, “newsAik”, “masterni”, “Ep3/18”, “chahiyeShopkeeper”), and eight were simply unknown/indecipherable tokens (“o”, “Yese”, “furra”, “fufa”, “B”, “t”, “tem”, “s”).

Finally, whenever a token was not classified by any word list or rule, it was assigned a tag based on the previous non-universal token in the current message, or else tagged English if it was the first token in the sentence. The decision to ignore universal tokens in this manner was based on the observation that universal tokens form the rarest category and tend not to occur in long contiguous sequences, while the decision to use English as the default language for ambiguous first-word tokens was based solely on the observation that English is slightly more prevalent in the data than Hindi (17k vs. 15k tokens). The final system hence classifies tokens according to the following ordered rules:

  1. Assign label based on manually defined disambiguation word list; else;
  2. Assign label based on universal token rules; else;
  3. Assign label based on exclusive English or Hindi word list membership; else;
  4. Assign label based on previous token label.

It should be noted that the manual disambiguation list takes the highest priority in this system because manual human judgements are considered to be the most reliable.

4 Experiments and Results

4.1 Manual disambiguation list size

We evaluated the effectiveness of our approach by comparing the predicted labels against the gold labels in terms of the F1 score, which is a weighted average of precision (P) and recall (R). In particular, precision is calculated as the proportion of correct labels over predicted labels for a given tag (xcor/xpred), while recall is calculated as the proportion of correct labels over gold labels for a given tag (xcor/xgold). In other words, precision measures the extent to which a system can correctly predict a given tag (i.e. correctness), while recall measures the extent to which a system can correctly predict all intended instances of a given tag (i.e. coverage). The F-score is hence the harmonic mean of the two. In the context of this work, we specifically compared the micro F1 scores (which take the differences between class labels into account) using manual disambiguation lists of different sizes in order to better understand the relationship between manual annotation and performance; i.e. to what extent a larger word list increases performance. Results are shown in Figure 1.

Figure 1 

Language tagging performance as a function of manual disambiguation list size.

As expected, Figure 1 shows diminishing returns as more manual labels are available. There is nevertheless a large gain from 84.2 to 86 F1 for the first 100 manual tags, which shows that even a small word list of the most frequently ambiguous tokens can provide a significant boost to the overall performance. Figure 1 also shows that this performance increase begins to level out at roughly 400–600 tokens, which roughly equates to tokens that occur at least 3–4 times or more in the data. This is a significant point to note as it potentially indicates an optimum level of manual annotation that should be carried out in future work (scaled according to the size of the data).

4.2 General evaluation

In addition to evaluating our system overall, we also evaluated in terms of P, R and F1 for each language tag in each of the Facebook, Twitter and WhatsApp subsections of the overall corpus. The results are shown in Table 5 where all systems make use of the full manual disambiguation list.

Table 5

Precision, Recall and F1 scores for each language tag in each corpus.


TAGFACEBOOKTWITTER


PRF1 PRF1

en93.3498.3595.7870.2681.3275.39

hi89.0485.6187.3090.7282.0886.19

univ97.3684.5190.4880.3587.6183.82

TAGWHATSAPPOVERALL


PRF1PRF1

en39.5280.9953.1285.9894.3289.95

hi96.6578.3086.5191.2882.1286.45

univ59.7178.8067.9487.2385.6686.44

One of the most interesting results from this table is that performance on Hindi classification is stable across all datasets at 86–87 F1, while performance on English classification varies considerably. Most notably, English classification scores almost 95.8 F1 on the Facebook data, but just 53.1 F1 on the WhatsApp data. This is largely due to precision being so low in the WhatsApp data (39.5). A similar effect is observed in the Twitter data, where the precision for English is the lowest out of the 3 tags at 70.3. Our first hypothesis for this observation was that the lower scores on the Twitter and WhatsApp data were a by-product of the decision to label unknown sentence-initial tokens as English by default. In particular, since the majority of tokens in the Twitter and WhatsApp data are Hindi, unlike the Facebook data, they would be more likely to benefit from Hindi as the default language. We hence tried labelling all unknown sentence-initial tokens (i.e. those that do not have a previous token) as Hindi rather than English, ultimately observing little improvement in the classification of English tokens in the Twitter data (75.4 F1 → 76.5 F1) and a noticeable improvement in the WhatsApp data (53.1 F1 → 59.9 F1). Precision in the WhatsApp data was nevertheless still very low at 39.5 → 49.8. In order to investigate why there might be such a difference between datasets and also to further evaluate the efficacy of our approach, we next carried out a manual evaluation of the first 500 tokens in each dataset.

4.3 Qualitative evaluation

4.3.1 Coarse-grained

In our manual qualitative evaluation, we first annotated both the predicted and gold-standard language labels of the first 500 tokens in each dataset as either correct (COR) or incorrect (INC). While it might seem unusual to reannotate the gold standard for correctness, we encountered many cases where the gold standard was incorrect and we wanted to take this into account in the evaluation. Table 6 hence shows the confusion matrices for all combinations of correct and incorrect labels in both our predictions (rows) and the gold standard (columns) for each dataset and overall.

Table 6

Confusion matrices for correct (COR) and incorrect (INC) labels in each dataset.


FACEBOOKTWITTER


GOLDCORINCGOLDCORINC

PREDPRED

COR4666COR42525

INC244INC3515

WHATSAPPOVERALL


GOLDCORINCGOLDCORINC

PREDPRED

COR40349COR129480

INC417INC10026

This table shows that there were 1294/1500 (86%) tokens across all datasets where both the prediction and gold standard were correct. There were a further 80/1500 (5%) tokens where our prediction was correct but the gold standard was incorrect (49 of which occurred in the WhatsApp data), and 100/1500 (7%) tokens where our prediction was incorrect but the gold standard was correct. The remaining 26/1500 (2%) tokens were incorrect in both the prediction and gold standard. The most significant finding from these results is that of the 206/1500 tokens where at least one label was considered incorrect, just over half of them (106/206) were in the gold standard. This suggests our classifier may actually be more reliable than reported above, as almost 40% of all errors are caused by problems with the dataset. It is also notable that most of the gold-standard errors occurred in the WhatsApp and Twitter data, which suggest these datasets are noisier than the Facebook data. Examples of gold-standard errors include English abbreviations that were tagged as Hindi (e.g. “thnk u” (for “thank you”) and “ofc” (for “of course”)), universal emojis that were tagged as Hindi (e.g. “”), and real English words that were tagged as either Hindi or universal (e.g. “life” and “path”).

4.3.2 Fine-grained

To further investigate the limitations of our approach, we also manually classified the 126/1500 errors made by our system into five different categories depending on the perceived reason for the error. The definitions of the categories and examples are shown in Table 7.

Table 7

The five different types of classification errors with examples.


CODEMEANINGEXAMPLES (AND INCORRECT PREDICTED TAG)

ATokenisation/Orthography LøVĕ (hi)-*Subha (en)2014–15)ka (en)

BNamed entityTanzeel (en)Amir (hi)chennai (en)

CToken in both word listshe (en)to (en)are (en)

DToken in neither word listAchhi (en)Namaskar (en)tiket (hi)

EToken in incorrect word listMt (en)thy (en)pre (hi)

More specifically, tokens were classified as Type A when the error was the result of incorrect tokenisation or non-standard orthography, Type B when the token was a named entity that was not classified as universal, Type C or D when either the token was a frequently-used word in both word lists or a rare token/spelling error in neither word list and it was furthermore incorrect to rely on the language of the previous token, and Type E when the token occurred only in the word list of the incorrect language. The results are shown in Table 8.

Table 8

The error type distribution between datasets.


CODEFACEBOOKTWITTERWHATSAPPOVERALL

A320124

B516728

C442129

D1281232

E42713

Total126

One of the most significant findings from this table is that, overall, out of the few errors that our system failed to correct, no single category significantly outnumbered the rest. This suggests there is no obvious weakness to our classifier. We do note, however, that the distribution of error types can vary significantly between datasets. For example, Twitter has the highest incidence of Type A tokenisation errors (N = 20/24), while WhatsApp has the highest incidence of Type C ‘both word list’ errors (N = 21/29). On closer inspection, we found that the former was caused by a single tweet in the Twitter dataset that contained repeated multi-punctuation strings of the form “:-*Subha”, which were systematically tokenised incorrectly (N = 15/24 errors), while the latter was an artefact of shorter messages and slang in the WhatsApp dataset. Specifically, since WhatsApp messages are much shorter than Facebook or Twitter posts (typically only 2–5 words), this meant there was a lower chance for a token to have a reliable previous language context if it was ambiguous in both word lists. This unique property of the WhatsApp dataset is hence something to be aware of when processing very short messages.

In summary, we note that our approach is quite robust for processing Hindi-English social media code-switched text. This is significant because the methodology was originally developed to process transcribed natural code-switched speech between Vietnamese and English, an entirely different dataset both in terms of the languages involved and the media through which the code-switching is conducted. This highlights the potential for further extending the approach to different code-switched datasets across different media and language pairs.

5 NLP challenges in processing multilingual discourse

Despite this encouraging result, it is worth noting that several challenges in processing multilingual discourse remain. The first of these is specific to processing social media Hindi data. Specifically, Hindi is traditionally written in Devanagari script; however, social media users primarily use Roman script to write Hindi, in what is sometimes called Romanagari script (; ; ). Although there are several commonly used conventions for Romanagari, there is no standardised spelling. For example, “d” is used for “द” /ȡ/ (dental d), “ड” /ɖ/ (retroflex d), and sometimes “ड़” /ɽ/ (retroflex r). Many-to-one mappings in the Devanagari-Roman direction are also caused by dialectal differences at times. People tend to transliterate based on what they hear rather than formal Devanagari spellings. For example, “ज़” /z/ is pronounced as /ʤ/ in some dialects and so is represented as “z” or “j” in Roman script despite always being written as “ज़” in Devanagari. This, together with the fact that Hindi has a larger inventory of consonants and vowels (11 vowels and 35 consonants in Devanagari script vs. 5 vowels and 21 consonants in the Roman script), highlights a lack of one-to-one mapping between Devanagari and Roman letters and leads to several issues in writing Romanagari ().

The second problem, which remains challenging across the field is the inherent bias towards English (see e.g. ; ), both in terms of available resources and human judgements. In our case, for example, most of the errors are target Hindi tokens. Table 9 illustrates.

Table 9

Distribution of error types based on the target gold standard.


CODETYPETARGET

ENGLISHHINDIUNIVERSALUNDEFINEDOVERALL

ATokenisation/Orthography1182324

BNamed entity0028028

CToken in both word lists1280029

DToken in neither word list7214032

EToken in incorrect word list292013

Total1176353126

It is clear from the table that the target Hindi errors significantly outnumber those of English and universal tokens (N = 76/126 compared to 11/126 and 35/126 respectively). Although the Hindi word list we used was specifically chosen to offset the lack of standardised Romanagari spellings, in that it featured commonly used alternative spellings for each word, the high degree of variability in Romanagari spellings meant that some spelling possibilities were inevitably missing. These missing spellings led to a high number of Type D (no word list) errors for target Hindi words (N = 21/32). There were also some spelling alternatives that were missing in the Hindi word list but were found in the English word list instead (Type E ‘incorrect word list’ target Hindi errors N = 9/13). This is because the majority of these errors (N = 8/9) involved very short Hindi words with omitted vowels, which coincidentally constituted English abbreviations in the word list and were consequently incorrectly tagged as English (e.g. “mt” represents “mǝt” in Hindi meaning “do not”, but is an abbreviation in English meaning “mountain”). These Hindi-specific issues are particularly amplified by social media text, which is self-transcribed by each speaker and so a single spelling convention is not used. We suggest normalisation of spelling and/or using a more comprehensive Hindi word list as a way to improve performance.

Furthermore, the bias towards English is not constrained solely by available resources but also extends to human judgements. For example, the dataset contained the words “India” and “Bharat” which are the English and Hindi names for the same named entity respectively. Although they should thus both be tagged as universal, we noted a preference by the annotators for tagging “India” as universal but “Bharat” as Hindi. Upon recognising this bias, we ultimately decided that both the language-specific tag (i.e. English for “India” and Hindi for “Bharat”) as well as the universal tag were equally valid answers. This example nevertheless shows that while English named entities are often more likely to be considered universal, perhaps partly due to the status of English as a global lingua franca, Hindi named entities may be more ambiguous, especially if they have an English counterpart. This possible bias is something that annotators should keep in mind for future work.

6 Implications

In this paper, we examined the extent to which we could standardise the automated processing of multilingual corpora, using a rule-based system originally developed to annotate transcribed bilingual code-switched Vietnamese-English speech data (). We applied this approach to Hindi-English social media text and achieved a high performance of 87.99 F1 on the language identification task. We furthermore carried out an error analysis and found that almost 40% of all classification errors were caused by problems with the gold standard, and so performance is actually likely to be even higher. These findings are particularly promising given the inherently challenging nature of social media text as well as the idiosyncratic conventions of the language pairs involved.

In the broader context, our work further highlighted how well a rule-based system can handle various kinds of code-switched input. In particular, we found that the approach generalises to both isolating (i.e. Vietnamese) and fusional (i.e. Hindi) language pairs with English, and is not dependent on annotated training data for machine learning. Ultimately, the most significant challenge is to instead obtain a suitably diverse word list which is not just limited to standardised spellings. Unfortunately, however, research in multilingual NLP has rarely considered other languages that may not have standardised orthography, or whose varieties may not be so well-established. In an era where the worldwide ‘normality’ of multilingualism becomes increasingly visible and language innovation continues to speedily spread, this lack of resources poses an even more urgent problem. Devising an efficient way to create and update different word lists across different language varieties is thus a worthwhile avenue for future research.

Additional Files

The resources associated with this paper can be accessed at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QD94F9.