(1) Overview
Repository location
Context
The @lis_grievances bot was first activated February 26, 2016, and continues to operate to this day. The messages that it tweets are all anonymously harvested and thus allow workers (presumably) in libraries and related fields to ‘air their grievances’ (). There has been an enthusiastic discussion in social media and within Library and Information Science (LIS) literature on the utility of the bot and both its harm and benefit to the profession ().
The five-year archive was created as the basis for a chapter () within a monograph that investigated the hypothesis that Libraries are dysfunctional workplaces (). This research conducted an analysis that partitioned the tweets into various categories in order to understand themes found in the corpus. It also introduced a novel metric called the Grief Index (GI), which gave a quantitative ratio of how many submissions made to the bot were not posted. Every submission made to the bot is checked by a moderator before it is posted, this is to ensure that no one is specifically mentioned in a tweet and that discriminatory language is not used. This difference in submissions to the bot versus actual posted messages is the basis of GI. This value provides a proxy for understanding the amount of material submitted that is not suitable for posting, which also avoids the need to share actual inappropriate posts.
(2) Method
Steps
The actual dataset is an aggregation of tweets made by the @lis_grievances account (n = 4096) and retrieved from the Twitter API using the Tweepy platform (). Some metadata of the tweets is retained and augmented with a custom metric called the Grief Index as well as the three components of the VADER sentiment score for the full text of the tweet (). The complete software used to create the bot is hosted on GitHub (). A key component of this software is that it contains a mechanism that retrieves the direct messages sent to the bot through a process that ensures anonymity of the sender from the operator of the bot. The Twitter archive of the account was requested on February 27, 2021 and as such, any favourite or retweet counts is current to that day. The etymology of ‘bot’ is preferred to describe this account since the posting and retrieval of messages is mediated through an API interface in conjunction with custom software that ensures anonymity of posts. While all posts are submitted by humans, no submission is posted without a comprehensive mediated quality control process.
The basis of the analysis was the creation of a metric dubbed Engagement Score (ES) which was the sum of the retweets and favourites a tweet received in the 5-year period. By combining this quantitative scoring with a close reading of the tweets a mixed method was conducted. Tweets with a high ES were examined in an attempt to uncover themes present in the full corpus.
Column description
Description of all columns retained from the Twitter archive export and additional data added as part of the analysis can be found in Table 1.
ID | GLOBAL ID OF TWEET AS STRING |
---|---|
favorite_count | integer count of how many times tweet was favourited |
retweet_count | integer count of how many times tweet was retweeted |
created_at | timestamp of when tweet was made |
full_text | full text of tweet |
entities.hashtags | list of hashtags found in tweet |
entities.symbols | list of symbols found in tweet |
entities.user_mentions | list of users mentioned in the tweet |
entities.urls | list of URLs found in the tweet |
possibly_sensitive | flag autogenerated to indicate possibility of sensitive content |
entities.media | python list of media found in tweet, e.g. images |
full_text_norm | Normalized full text of tweet |
vscore_pos | VADER positive dimension score for the tweet full-text, 0 to 1 inclusive |
vscore_neg | VADER negative dimension score for the tweet full-text, 0 to 1 inclusive |
vscore_neu | VADER neutral dimension score for the tweet full-text, 0 to 1 inclusive |
vscore_compound | VADER composite score for the tweet full-text, –1 to 1 inclusive |
swears | flag if tweet contains a swear word |
engaged | flag if tweet was either favourited or retweeted |
total_engagement | Engagement score, ie. combined count of number of retweets and favourites |
hashtags | flag if tweet contains a hashtag |
questions | flag if tweet contains a question (full text includes a question mark) |
media | flag if tweet contains image |
fav_quant | what quantile tweet is in based on favourite count, if applicable |
g_index | the grief index value for the month that the tweet was made |
This dataset is a combination of data exported directly from Twitter and enriched with additional analysis specific description of the provenance of each column, as described in Table 2.
PROVENANCE | COLUMNS |
---|---|
Twitter Export | favourite_count retweet_count created_at full_text entities.hastags entities.symbols entities.user_mentions entities.urls possibly_sensitive entities.media |
Derived | full_text_norm vscore_pos vscore_neg vscore_neu vscore_compound swears engaged total_engagement hashtags questions media fav_quant g_index |
(3) Dataset Description
Object name – LIS_G_5_YEAR_ARCHIVE
Format names and versions – .CSV
Creation dates – 2016-02-26 to 2021-02-27
Dataset creators – Tim Ribaric
Language – English
License – CC0 1.0
Repository name – Borealis
Publication date – 2022-10-11
Statistics and Contents
As mentioned, the investigation focused on ES of the corpus, but it contrasted this score against other dynamics of the tweets. Box plots of the different facets used to partition the tweets are seen in Figure 1. Here we see that the inclusion of swears, for example, lead to a higher mean score compared to other facets.
Figure 2 shows the general distribution of ES across all tweets in the corpus.
This provides us with a quick view of the distribution of scores along with some evidence that outliers were also present. To further shed light on the corpus, VADER sentiment scores were also calculated. Figure 3 shows an example of sentiment score composition for the swear word facet.
Lastly, to provide a general sense of what is in the corpus, a word cloud is presented in Figure 4. It appears that Librarians enjoyed talking about themselves and the places in which they work.
(4) Reuse Potential
The primary goal of the research was to propose a mixed-method approach to analysing the corpus in order to derive insights into its contents without the need of having a researcher examine each tweet and hand-code for themes; however, many other uses of the corpus can be devised. This archive of tweets has potential to inform investigations in many different areas. For example, it can be used to assess the perceived accuracy of the VADER sentiment analysis scoring system. It can also be used to study the online disinhibition effect (ODE). ODE is the supposition that when given anonymity people will express themselves in stronger ways than if their speech is attributed.
Lastly, the dataset can be used for sociological or LIS inquiry, such as to investigate a profession’s self-image. Within the LIS field, romanticisation of professional self-identity is known as vocational awe (). One tenant of vocational awe is that librarians in the course of their work will put up with outrageous workplace deficiencies simply because of the importance of the job. This candid archive of librarian self-reflection could very well prove useful in an examination of this phenomenon.