RESEARCH PAPER Automatic Language Identification in Code-Switched Hindi-English Social Media Text

Natural Language Processing (NLP) tools typically struggle to process code-switched data and so linguists are commonly forced to annotate such data manually. As this data becomes more readily available, automatic tools are increasingly needed to help speed up the annotation process and improve consistency. Last year, such a toolkit was developed to semi-automatically annotate transcribed bilingual code-switched Vietnamese-English speech data with token-based language information and POS tags (hereafter the CanVEC toolkit, L. Nguyen & Bryant, 2020). In this work, we extend this methodology to another language pair, Hindi-English, to explore the extent to which we can standardise the automation process. Specifically, we applied the principles behind the CanVEC toolkit to data from the International Conference on Natural Language Processing (ICON) 2016 shared task, which consists of social media posts (Facebook, Twitter and WhatsApp) that have been annotated with language and POS tags (Molina et al., 2016). We used the ICON-2016 annotations as the gold-standard labels in the language identification task. Ultimately, our tool achieved an F 1 score of 87.99% on the ICON-2016 data. We then evaluated the first 500 tokens of each social media subset manually, and found almost 40% of all errors were caused entirely by problems with the gold-standard, i.e., our system was correct. It is thus likely that the overall accuracy of our system is higher than reported. This shows great potential for effectively automating the annotation of code-switched corpora, on different language combinations, and in different genres. We finally discuss some limitations of our approach and release our code and human evaluation together with this paper.


CONTEXT AND MOTIVATION
In multilingual contexts, mixed output, featuring elements from two or more languages, is ubiquitous. Utterance (1), for example, demonstrates an instance of what is known as "codeswitching", a construction in which a speaker alternates between different languages (in this case, Vietnamese and English).
(1) mỗi group phải có a different focus each must have "Each group must have a different focus." (CanVEC, L. Nguyen & Bryant, 2020) Although multilingualism is the norm world-wide (Grosjean & Li, 2013), NLP tools capable of processing more than one language per "sentential unit" as in (1) are still rather limited. This effectively circumscribes important applications such as machine translation (MT) and information retrieval (IR), and also the utility of NLP-based technology in contexts where language-users readily employ two or more languages side by side. Furthermore, as in other areas of NLP, while some efforts have been made to investigate somewhat high-resource language pairs such as English-Spanish (e.g. Ahn, Jimenez, Tsvetkov, & Black, 2020;Bullock, Guzmán, Serigos, Sharath, & Toribio, 2018;Solorio & Liu, 2008;Soto & Hirschberg, 2018) or English-Chinese (e.g. Chan, Ching, & Lee, 2005;Lyu, Dau-Cheng and Tan, Tien-Ping and Chng, Eng and Li, Haizhou, 2015;Shen, Wu, Yang, & Hsu, 2011), work examining code-switching involving low-resource, or lessdescribed languages is still largely neglected. This means very few resources are available to automatically process this kind of data. With this in mind, two members of our team recently developed a toolkit to process the Canberra Vietnamese-English Corpus (CanVEC), an original corpus of 10 hours of natural mixed speech involving 45 Vietnamese-English migrant speakers living in Canberra. The corpus is semi-automatically annotated with language information and part-of-speech (POS) tags, obtaining >90% accuracy on both tasks (L. Nguyen & Bryant, 2020).
In this work, we test the wider feasibility of this framework in processing multilingual corpora by extending its application to another language pair, Hindi-English. Although Hindi-English is one of the more thoroughly investigated language pairs in the context of code-switching (e.g. Aguilar & Solorio, 2020;Bali, Sharma, Choudhury, & Vyas, 2014;Dey & Fung, 2014;Si, 2011), it nevertheless still provides a good test-bed in which to evaluate multilingual-corpus processing tools. We particularly focus on the language-identification task, for which we rely on the annotated data released in the International Conference on Natural Language Processing (ICON) 2016 shared task (Jamatia, Gambäck, & Das, 2015). In what follows, we report the result of this pilot as well as the challenges and implications that emerged.

RELATED WORK
It should be noted at the outset that language identification is one of the most important and well-studied tasks in computational approaches to code-switching. This is because it is often the prerequisite for many more complex downstream NLP tasks such as POS tagging, machine translation and speech recognition (Çetinoğlu, Schulz, & Vu, 2016;Choudhury, Chittaranjan, Gupta, & Das, 2014;Solorio & Liu, 2008). However, since monolingual processing tools tend to be less accurate in short or unidentified code-switching contexts, custom multi-lingual tools such as dictionary lookup, language models, morphological and phonological analysis, and machine learning techniques have become increasingly popular in recent years (Attia et al., 2019;Barman, Das, Wagner, & Foster, 2014;Mave, Maharjan, & Solorio, 2018;D. Nguyen & Doğruöz, 2013;Voss, Tratz, Laoudi, & Briesch, 2014;Xia, 2016

ICON-2016 DATA
The goal of the ICON-2016 shared task was to automatically annotate code-switched Hindi-English, Bengali-English and Telugu-English social media posts (Facebook/Twitter/WhatsApp) with either fine-grained or coarse-grained part-of-speech (POS) tags (Jamatia et al., 2015). Participants were provided with word tokenised 1 social media posts that were already annotated with native language information. Since the goal of this paper is to investigate the automatic annotation of language information in code-switched data, we ignore the POS annotations and only make use of the language tags. Specifically, we focus on the Hindi-English subset of the corpus for which there are seven possible tags (Table 1).
We downloaded the Facebook, Twitter and WhatsApp Hindi-English data from the shared task website. 2 The distribution of the seven language tags for each dataset and overall is shown in Table 2.
1 Not sentence segmented; i.e. each Facebook/Twitter/WhatsApp message may consist of more than one sentence or a single sentence may also be split across messages.  Since several of these tags are relatively low frequency, we collapsed the mixed, acro, ne and undef tags into the univ category. This was partly because multi-class classification is more challenging with a greater number of labels (especially extremely rare labels), but also because we saw little reason to differentiate between these tags in the language identification task. For example, certain acronyms (e.g. DJ) and named entities (e.g. Holi) can be said to belong to both languages, yet are rarely indicative of code-switching. Similarly, while mixed tokens are certainly interesting examples of code-switching at a morphological level, they are extremely rare in the given dataset (N = 6) and so did not warrant a dedicated label.
The final distribution of labels across the reprocessed datasets is shown in Table 3. It is interesting to note that the distribution of languages is different across datasets, with Facebook being predominantly English (64%), and Twitter and WhatsApp being predominantly Hindi (56% and 78% respectively). It is also notable that universal tokens comprise a significant proportion of the data and are roughly as prevalent as the minority code-switching language in all datasets.
This can possibly be explained by the fact that social media data comes with its own set of particular challenges (as reviewed in Çetinoğlu et al., 2016), e.g. typos, intentional spelling deviations (e.g. "okkkk"), abbreviated Internet slang (e.g. "lol", "smh"), and non linguistic expressions (e.g. emoticons, URLs, hashtags, @ mentions, etc.), many of which are languageagnostic (i.e. universal). Universal tokens may thus be more prevalent in social media posts than other genres of text. These challenges nevertheless play a central role in our decisionmaking process, and will be discussed throughout this paper.

APPROACH
Following L. Nguyen and Bryant (2020), our approach to token-based language identification is rule-based and relies on a word list for each language. For English, we used a custom Hunspell word list that contained a combination of American, British, Canadian and Australian variant spellings. 3 It was important to allow all these variants in order to maximise the chance that a word would be properly classified. For Hindi, we used a list of 30,000 transliterations that had been extracted from an online Hindi lyric database (Gupta, Choudhury, & Bali, 2012) and made available in the Forum for Information Retrieval Evaluation (FIRE) 2013 shared task (Roy, Choudhury, Majumder, & Agarwal, 2013). 4 We used this dataset because social media users tend not to switch between Devanagari script for Hindi and Roman script for English, and instead use Roman script for everything, transliterating Hindi as necessary. Since there is no standard way of transliterating Hindi to English however (see Section 5 for more discussion), this list represents the largest resource we could find that also contains several variant Roman transliterations for the same Hindi word. We consequently hoped it would have sufficiently large coverage. It is worth mentioning that although an equivalent Hunspell word list for Hindi is also publicly available, 5 it uses Devanagari script and so is incompatible with the ICON-2016 data.
Before making use of these resources, however, we first wrote a number of rules to classify universal tokens that are language-agnostic. In particular, a token is classified as universal if it meets at least one of the following criteria:  1. It does not contain any alphanumeric characters; e.g. punctuation; 2. It contains "@", "#" or "http", or else is "RT"; e.g. @usernames, #topics, URLs and retweets; 3. If non-alphanumeric characters are deleted, the string is a number; e.g. dates and times; 4. It starts with ":" or ";"; e.g. emoticons.
Having tagged universal tokens, the next step was to use the English and Hindi word lists. Specifically, if a token appears in the English word list, but not the Hindi word list, it is tagged as English, and if a token appears in the Hindi word list, but not the English word list, it is tagged as Hindi. This approach successfully accounted for the vast majority of tokens, but revealed 3,629 tokens that did not meet either criteria and were untagged. We hence extracted these tokens and annotated the top 1,000 most frequent ones manually. It is worth noting that 2,569 of the automatically untagged tokens only occurred once in the dataset, so we effectively only annotated tokens that appeared at least twice. The top 20 of these most frequent tokens and their counts are shown in Table 4.
Finally, whenever a token was not classified by any word list or rule, it was assigned a tag based on the previous non-universal token in the current message, or else tagged English if it was the first token in the sentence. The decision to ignore universal tokens in this manner was based on the observation that universal tokens form the rarest category and tend not to occur in long contiguous sequences, while the decision to use English as the default language for ambiguous first-word tokens was based solely on the observation that English is slightly more prevalent in the data than Hindi (17k vs. 15k tokens). 6 The final system hence classifies tokens according to the following ordered rules: 1. Assign label based on manually defined disambiguation word list; else; 2. Assign label based on universal token rules; else; 3. Assign label based on exclusive English or Hindi word list membership; else;

4.
Assign label based on previous token label. 6 Future work might prefer to label ambiguous first-word tokens according to the language of the following token rather than using a default.  It should be noted that the manual disambiguation list takes the highest priority in this system because manual human judgements are considered to be the most reliable.

MANUAL DISAMBIGUATION LIST SIZE
We evaluated the effectiveness of our approach by comparing the predicted labels against the gold labels in terms of the F 1 score, which is a weighted average of precision (P) and recall (R). In particular, precision is calculated as the proportion of correct labels over predicted labels for a given tag (x cor /x pred ), while recall is calculated as the proportion of correct labels over gold labels for a given tag (x cor /x gold ). In other words, precision measures the extent to which a system can correctly predict a given tag (i.e. correctness), while recall measures the extent to which a system can correctly predict all intended instances of a given tag (i.e. coverage). The F-score is hence the harmonic mean of the two. 7 In the context of this work, we specifically compared the micro F 1 scores (which take the differences between class labels into account) using manual disambiguation lists of different sizes in order to better understand the relationship between manual annotation and performance; i.e. to what extent a larger word list increases performance. Results are shown in Figure 1.
As expected, Figure 1 shows diminishing returns as more manual labels are available. There is nevertheless a large gain from 84.2 to 86 F 1 for the first 100 manual tags, which shows that even a small word list of the most frequently ambiguous tokens can provide a significant boost to the overall performance. Figure 1 also shows that this performance increase begins to level out at roughly 400-600 tokens, which roughly equates to tokens that occur at least 3-4 times or more in the data. This is a significant point to note as it potentially indicates an optimum level of manual annotation that should be carried out in future work (scaled according to the size of the data).

GENERAL EVALUATION
In addition to evaluating our system overall, we also evaluated in terms of P, R and F 1 for each language tag in each of the Facebook, Twitter and WhatsApp subsections of the overall corpus. The results are shown in Table 5 where all systems make use of the full manual disambiguation list.
One of the most interesting results from this table is that performance on Hindi classification is stable across all datasets at 86-87 F 1 , while performance on English classification varies considerably. Most notably, English classification scores almost 95.8 F 1 on the Facebook data, but just 53.1 F 1 on the WhatsApp data. This is largely due to precision being so low in the WhatsApp data (39.5). A similar effect is observed in the Twitter data, where the precision 7 For more details on how F 1 score is computed, see e.g. Sasaki 2007.

Figure 1
Language tagging performance as a function of manual disambiguation list size. for English is the lowest out of the 3 tags at 70.3. Our first hypothesis for this observation was that the lower scores on the Twitter and WhatsApp data were a by-product of the decision to label unknown sentence-initial tokens as English by default. In particular, since the majority of tokens in the Twitter and WhatsApp data are Hindi, unlike the Facebook data, they would be more likely to benefit from Hindi as the default language. We hence tried labelling all unknown sentence-initial tokens (i.e. those that do not have a previous token) as Hindi rather than English, ultimately observing little improvement in the classification of English tokens in the Twitter data (75.4 F 1 → 76.5 F 1 ) and a noticeable improvement in the WhatsApp data (53.1 F 1 → 59.9 F 1 ). Precision in the WhatsApp data was nevertheless still very low at 39.5 → 49.8. In order to investigate why there might be such a difference between datasets and also to further evaluate the efficacy of our approach, we next carried out a manual evaluation of the first 500 tokens in each dataset.

Coarse-grained
In our manual qualitative evaluation, we first annotated both the predicted and gold-standard language labels of the first 500 tokens in each dataset as either correct (COR) or incorrect (INC). While it might seem unusual to reannotate the gold standard for correctness, we encountered many cases where the gold standard was incorrect and we wanted to take this into account in the evaluation. Table 6 hence shows the confusion matrices for all combinations of correct and incorrect labels in both our predictions (rows) and the gold standard (columns) for each dataset and overall.
This table shows that there were 1294/1500 (86%) tokens across all datasets where both the prediction and gold standard were correct. There were a further 80/1500 (5%) tokens where our prediction was correct but the gold standard was incorrect (49 of which occurred in the WhatsApp data), and 100/1500 (7%) tokens where our prediction was incorrect but the gold standard was correct. The remaining 26/1500 (2%) tokens were incorrect in both the prediction and gold standard. The most significant finding from these results is that of the 206/1500   tokens where at least one label was considered incorrect, just over half of them (106/206) were in the gold standard. This suggests our classifier may actually be more reliable than reported above, as almost 40% of all errors are caused by problems with the dataset. It is also notable that most of the gold-standard errors occurred in the WhatsApp and Twitter data, which suggest these datasets are noisier than the Facebook data. Examples of gold-standard errors include English abbreviations that were tagged as Hindi (e.g. "thnk u" (for "thank you") and "ofc" (for "of course")), universal emojis that were tagged as Hindi (e.g. " "), and real English words that were tagged as either Hindi or universal (e.g. "life" and "path").

Fine-grained
To further investigate the limitations of our approach, we also manually classified the 126/1500 errors made by our system into five different categories depending on the perceived reason for the error. The definitions of the categories and examples are shown in Table 7.
More specifically, tokens were classified as Type A when the error was the result of incorrect tokenisation or non-standard orthography, Type B when the token was a named entity that was not classified as universal, Type C or D when either the token was a frequently-used word in both word lists or a rare token/spelling error in neither word list and it was furthermore incorrect to rely on the language of the previous token, and Type E when the token occurred only in the word list of the incorrect language. The results are shown in Table 8.
One of the most significant findings from this table is that, overall, out of the few errors that our system failed to correct, no single category significantly outnumbered the rest. This suggests there is no obvious weakness to our classifier. We do note, however, that the distribution of error types can vary significantly between datasets. For example, Twitter has the highest incidence of Type A tokenisation errors (N = 20/24), while WhatsApp has the highest incidence of Type C 'both word list' errors (N = 21/29). On closer inspection, we found that the former was caused by a single tweet in the Twitter dataset that contained repeated multi-punctuation strings of the form ":-*Subha", which were systematically tokenised incorrectly (N = 15/24 errors), while the latter was an artefact of shorter messages and slang in the WhatsApp dataset. Specifically, since WhatsApp messages are much shorter than Facebook or Twitter posts (typically only 2-5 words), this meant there was a lower chance for a token to have a reliable previous language context if it was ambiguous in both word lists. This unique property of the WhatsApp dataset is hence something to be aware of when processing very short messages.
In summary, we note that our approach is quite robust for processing Hindi-English social media code-switched text. This is significant because the methodology was originally developed to process transcribed natural code-switched speech between Vietnamese and English, an entirely  Total 126 Table 8 The error type distribution between datasets. 9 Nguyen et al. Journal of Open Humanities Data DOI: 10.5334/johd.44 different dataset both in terms of the languages involved and the media through which the code-switching is conducted. This highlights the potential for further extending the approach to different code-switched datasets across different media and language pairs.

NLP CHALLENGES IN PROCESSING MULTILINGUAL DISCOURSE
Despite this encouraging result, it is worth noting that several challenges in processing multilingual discourse remain. The first of these is specific to processing social media Hindi data. Specifically, Hindi is traditionally written in Devanagari script; however, social media users primarily use Roman script to write Hindi, in what is sometimes called Romanagari script (Bali et al., 2014;V.B., Choudhury, Bali, Dasgupta, & Basu, 2010;Virga & Khudanpur, 2003). 8 Although there are several commonly used conventions for Romanagari, there is no standardised spelling. For example, "d" is used for "द" /ȡ/ (dental d), "ड" /ɖ/ (retroflex d), and sometimes "ड़" /ɽ/ (retroflex r). Many-to-one mappings in the Devanagari-Roman direction are also caused by dialectal differences at times. People tend to transliterate based on what they hear rather than formal Devanagari spellings. For example, "ज़" /z/ is pronounced as /ʤ/ in some dialects and so is represented as "z" or "j" in Roman script despite always being written as "ज़" in Devanagari. This, together with the fact that Hindi has a larger inventory of consonants and vowels (11 vowels and 35 consonants in Devanagari script 9 vs. 5 vowels and 21 consonants in the Roman script), highlights a lack of one-to-one mapping between Devanagari and Roman letters and leads to several issues in writing Romanagari (Mhaiskar, 2015).
The second problem, which remains challenging across the field is the inherent bias towards English (see e.g. Anastasopoulos & Neubig, 2020;Garrido-Muñoz, Montejo-Ráez, Martínez-Santiago, & Ureña-Lápez, 2021 for some recent overview), both in terms of available resources and human judgements. In our case, for example, most of the errors are target Hindi tokens. Table 9 illustrates.
It is clear from the table that the target Hindi errors significantly outnumber those of English and universal tokens (N = 76/126 compared to 11/126 and 35/126 respectively). 10 Although the Hindi word list we used was specifically chosen to offset the lack of standardised Romanagari spellings, in that it featured commonly used alternative spellings for each word, the high degree of variability in Romanagari spellings meant that some spelling possibilities were inevitably missing. These missing spellings led to a high number of Type D (no word list) errors for target Hindi words (N = 21/32). There were also some spelling alternatives that were missing in the Hindi word list but were found in the English word list instead (Type E 'incorrect word list' target Hindi errors N = 9/13). This is because the majority of these errors (N = 8/9) involved very short Hindi words with omitted vowels, which coincidentally constituted English abbreviations in the word list and were consequently incorrectly tagged as English (e.g. "mt" represents "mǝ t" in 8 The same holds for other Indian languages, such as Marathi (also traditionally written in Devanagari) (Mhaiskar, 2015) and Punjabi (traditionally written in Gurmukhi) (Kaur & Singh, 2015), as well as various dialects of modern Arabic (Eskander, Al-Badrashiny, Habash, & Rambow, 2014).

9
There is disagreement on exact numbers. The numbers given are from the Government of India as reported by the BBC: https://www.bbc.co.uk/languages/other/hindi/guide/alphabet.shtml.
10 Note that the undefined tokens were made up by 3/24 Type A errors that could not be attributed to any target tag as they were mixed language tokens, e.g. "Girl-Sacchi" [en-hi] and "haiAnother" [hi-en].  Table 9 Distribution of error types based on the target gold standard.

CODE
Hindi meaning "do not", but is an abbreviation in English meaning "mountain"). These Hindispecific issues are particularly amplified by social media text, which is self-transcribed by each speaker and so a single spelling convention is not used. We suggest normalisation of spelling and/or using a more comprehensive Hindi word list as a way to improve performance.
Furthermore, the bias towards English is not constrained solely by available resources but also extends to human judgements. For example, the dataset contained the words "India" and "Bharat" which are the English and Hindi names for the same named entity respectively. Although they should thus both be tagged as universal, we noted a preference by the annotators for tagging "India" as universal but "Bharat" as Hindi. Upon recognising this bias, we ultimately decided that both the language-specific tag (i.e. English for "India" and Hindi for "Bharat") as well as the universal tag were equally valid answers. This example nevertheless shows that while English named entities are often more likely to be considered universal, perhaps partly due to the status of English as a global lingua franca, Hindi named entities may be more ambiguous, especially if they have an English counterpart. This possible bias is something that annotators should keep in mind for future work.

IMPLICATIONS
In this paper, we examined the extent to which we could standardise the automated processing of multilingual corpora, using a rule-based system originally developed to annotate transcribed bilingual code-switched Vietnamese-English speech data (L. Nguyen & Bryant, 2020). We applied this approach to Hindi-English social media text and achieved a high performance of 87.99 F 1 on the language identification task. We furthermore carried out an error analysis and found that almost 40% of all classification errors were caused by problems with the gold standard, and so performance is actually likely to be even higher. These findings are particularly promising given the inherently challenging nature of social media text as well as the idiosyncratic conventions of the language pairs involved.
In the broader context, our work further highlighted how well a rule-based system can handle various kinds of code-switched input. In particular, we found that the approach generalises to both isolating (i.e. Vietnamese) and fusional (i.e. Hindi) language pairs with English, and is not dependent on annotated training data for machine learning. Ultimately, the most significant challenge is to instead obtain a suitably diverse word list which is not just limited to standardised spellings. Unfortunately, however, research in multilingual NLP has rarely considered other languages that may not have standardised orthography, or whose varieties may not be so well-established. In an era where the worldwide 'normality' of multilingualism becomes increasingly visible and language innovation continues to speedily spread, this lack of resources poses an even more urgent problem. Devising an efficient way to create and update different word lists across different language varieties is thus a worthwhile avenue for future research.

ADDITIONAL FILES
The resources associated with this paper can be accessed at https://dataverse.harvard.edu/dataset.