(1) Overview

Repository location

https://doi.org/10.5683/SP3/PHWSVM.

Context

The @lis_grievances bot was first activated February 26, 2016, and continues to operate to this day. The messages that it tweets are all anonymously harvested and thus allow workers (presumably) in libraries and related fields to ‘air their grievances’ (). There has been an enthusiastic discussion in social media and within Library and Information Science (LIS) literature on the utility of the bot and both its harm and benefit to the profession ().

The five-year archive was created as the basis for a chapter () within a monograph that investigated the hypothesis that Libraries are dysfunctional workplaces (). This research conducted an analysis that partitioned the tweets into various categories in order to understand themes found in the corpus. It also introduced a novel metric called the Grief Index (GI), which gave a quantitative ratio of how many submissions made to the bot were not posted. Every submission made to the bot is checked by a moderator before it is posted, this is to ensure that no one is specifically mentioned in a tweet and that discriminatory language is not used. This difference in submissions to the bot versus actual posted messages is the basis of GI. This value provides a proxy for understanding the amount of material submitted that is not suitable for posting, which also avoids the need to share actual inappropriate posts.

(2) Method

Steps

The actual dataset is an aggregation of tweets made by the @lis_grievances account (n = 4096) and retrieved from the Twitter API using the Tweepy platform (). Some metadata of the tweets is retained and augmented with a custom metric called the Grief Index as well as the three components of the VADER sentiment score for the full text of the tweet (). The complete software used to create the bot is hosted on GitHub (). A key component of this software is that it contains a mechanism that retrieves the direct messages sent to the bot through a process that ensures anonymity of the sender from the operator of the bot. The Twitter archive of the account was requested on February 27, 2021 and as such, any favourite or retweet counts is current to that day. The etymology of ‘bot’ is preferred to describe this account since the posting and retrieval of messages is mediated through an API interface in conjunction with custom software that ensures anonymity of posts. While all posts are submitted by humans, no submission is posted without a comprehensive mediated quality control process.

The basis of the analysis was the creation of a metric dubbed Engagement Score (ES) which was the sum of the retweets and favourites a tweet received in the 5-year period. By combining this quantitative scoring with a close reading of the tweets a mixed method was conducted. Tweets with a high ES were examined in an attempt to uncover themes present in the full corpus.

Column description

Description of all columns retained from the Twitter archive export and additional data added as part of the analysis can be found in Table 1.

Table 1

All columns found in the dataset.


IDGLOBAL ID OF TWEET AS STRING

favorite_countinteger count of how many times tweet was favourited

retweet_countinteger count of how many times tweet was retweeted

created_attimestamp of when tweet was made

full_textfull text of tweet

entities.hashtagslist of hashtags found in tweet

entities.symbolslist of symbols found in tweet

entities.user_mentionslist of users mentioned in the tweet

entities.urlslist of URLs found in the tweet

possibly_sensitiveflag autogenerated to indicate possibility of sensitive content

entities.mediapython list of media found in tweet, e.g. images

full_text_normNormalized full text of tweet

vscore_posVADER positive dimension score for the tweet full-text, 0 to 1 inclusive

vscore_negVADER negative dimension score for the tweet full-text, 0 to 1 inclusive

vscore_neuVADER neutral dimension score for the tweet full-text, 0 to 1 inclusive

vscore_compoundVADER composite score for the tweet full-text, –1 to 1 inclusive

swearsflag if tweet contains a swear word

engagedflag if tweet was either favourited or retweeted

total_engagementEngagement score, ie. combined count of number of retweets and favourites

hashtagsflag if tweet contains a hashtag

questionsflag if tweet contains a question (full text includes a question mark)

mediaflag if tweet contains image

fav_quantwhat quantile tweet is in based on favourite count, if applicable

g_indexthe grief index value for the month that the tweet was made

This dataset is a combination of data exported directly from Twitter and enriched with additional analysis specific description of the provenance of each column, as described in Table 2.

Table 2

Description of data origin, either direct from Twitter export or result of analysis.


PROVENANCECOLUMNS

Twitter Exportfavourite_count
retweet_count
created_at
full_text
entities.hastags
entities.symbols
entities.user_mentions
entities.urls
possibly_sensitive
entities.media

Derivedfull_text_norm
vscore_pos
vscore_neg
vscore_neu
vscore_compound
swears
engaged
total_engagement
hashtags
questions
media
fav_quant
g_index

(3) Dataset Description

Object name – LIS_G_5_YEAR_ARCHIVE

Format names and versions – .CSV

Creation dates – 2016-02-26 to 2021-02-27

Dataset creators – Tim Ribaric

Language – English

License – CC0 1.0

Repository name – Borealis

Publication date – 2022-10-11

Statistics and Contents

As mentioned, the investigation focused on ES of the corpus, but it contrasted this score against other dynamics of the tweets. Box plots of the different facets used to partition the tweets are seen in Figure 1. Here we see that the inclusion of swears, for example, lead to a higher mean score compared to other facets.

Figure 1 

Engagement score box plots for tweets with different characteristics.

Figure 2 shows the general distribution of ES across all tweets in the corpus.

Figure 2 

Engagement score distribution of all tweets.

This provides us with a quick view of the distribution of scores along with some evidence that outliers were also present. To further shed light on the corpus, VADER sentiment scores were also calculated. Figure 3 shows an example of sentiment score composition for the swear word facet.

Figure 3 

VADER sentiment score breakdown of all tweets in the archive.

Lastly, to provide a general sense of what is in the corpus, a word cloud is presented in Figure 4. It appears that Librarians enjoyed talking about themselves and the places in which they work.

Figure 4 

Word cloud of all the tweets in the archive.

(4) Reuse Potential

The primary goal of the research was to propose a mixed-method approach to analysing the corpus in order to derive insights into its contents without the need of having a researcher examine each tweet and hand-code for themes; however, many other uses of the corpus can be devised. This archive of tweets has potential to inform investigations in many different areas. For example, it can be used to assess the perceived accuracy of the VADER sentiment analysis scoring system. It can also be used to study the online disinhibition effect (ODE). ODE is the supposition that when given anonymity people will express themselves in stronger ways than if their speech is attributed.

Lastly, the dataset can be used for sociological or LIS inquiry, such as to investigate a profession’s self-image. Within the LIS field, romanticisation of professional self-identity is known as vocational awe (). One tenant of vocational awe is that librarians in the course of their work will put up with outrageous workplace deficiencies simply because of the importance of the job. This candid archive of librarian self-reflection could very well prove useful in an examination of this phenomenon.