(1) Overview

The collection of this dataset was inspired by the short squeeze event on GameStop stock initiated by retail investors in January 2021. A short squeeze is an unusual condition that triggers rapidly rising prices in a stock or other tradeable financial instruments. For a short squeeze to occur, the financial instrument must have an unusual degree of short sellers holding positions in it. The short squeeze is triggered when short sellers coincidentally cut losses and exit their positions (Mitchell, 2021). At its height, the pre-market value for GameStop stock was more than $500 per share (Wikipedia, 2022). GameStop stock is a meme stock that refers to the shares a company has gained online following through social media platforms. These online communities can build hype around a meme stock through narratives and conversations, which reflect public opinions of the stock (Hayes, 2022).

Reddit has been the primary platform retail investors use to communicate with each other, including sharing and discussing news from social media and mainstream media, personal trading histories, memes, technical analyses, and strategies to facilitate global wealth transfer. The goal of the Reddit community online movement was colloquially named “Mother of All Short Squeezes” (MOASS) (Anand & Pathak, 2021; Betzer & Harries, 2022). MOASS exemplifies the populist intent of an online social movement observed in the dataset. The realization of MOASS’s goals requires participation from every community member. However, differing opinions about how their goal should be achieved and what kind of community culture should be constructed have split the community into four subreddits. Specifically, the creation of r/superstonk was born of frustration about r/wallstreetbets, which received mainstream media attention at the beginning of the short squeeze. The subreddit r/superstonk was created, driven by the lack of focus on achieving the common goal and the concern on the intention and conduct of moderators on r/wallstreetbets. Its profile banner, “Power to the Shareholders” distils its populist belief in achieving the common goal. However, the integration with meme culture on r/superstonk has distanced community members who are motivated to achieve the common goal with a more serious and less memetic attitude. The community migration into r/GME, r/DDintoGME, r/GMEJungle was the result of this cultural disagreement. Furthermore, community migration does not follow a linear progression nor does it suggest that the community is conflicted and divisive. Instead, it reflects the influence of Reddit’s features on the organization of the community: individuals’ content curation on Reddit is structured by topics. Users on Reddit curate their content by following different subreddit communities. Thus, this dataset will help study online social movements and its relationship with online culture.

The collection of data was motivated by the continuous actions of community members pursuing the realization of MOASS. During the data collection period, several changes in communication patterns and communication tactics occurred, driven by both internal and external events, such as community disagreement on ways of realizing common MOASS goals, and episodic mainstream media attention.

The dataset on r/superstonk has 560,125 posts with an average word count of 15 and a standard deviation of 13 rounded to the nearest integer. The dataset on r/GME has 1,033,236 posts with an average word count of 14 and a standard deviation of 13 rounded to the nearest integer. The dataset on r/GMEJungle has 39,634 posts with an average word count of 15 and a standard deviation of 12 rounded to the nearest integer. The dataset on r/DDintoGME has 5,498 posts with an average word count of 16 and a standard deviation of 13 rounded to the nearest integer. The four HTML files on explorative data analyses demonstrate the first 12 variables (id, title, url, score, author, number of comments, date, flair, negative sentiment, positive sentiment, neutral sentiment, and compound sentiment), their interactions, and correlations from the dataset files ending with “features.”

Repository location


This dataset was produced as part of an ongoing research project1 that studies the communication patterns of subreddit communities around meme stocks and their belief in using meme stocks to facilitate a global wealth transfer movement. It has not been used in any publication yet.

(2) Method

The post ID, title, URL, score, author, number of comments, date, and flair (community-defined content filter) were collected by using Pushshift Reddit API (Baumgartner, 2018). The post comments were collected by using the Python Reddit API Wrapper, PRAW (Boe, 2021). Each post’s sentiment scores were calculated using VADER (Hutto & Gilbert, 2014) with a customized dictionary that reflects the common emojis used in these subreddits. 57 meta-features on post titles were produced by using the spaCy large English model (Honnibal et al., 2020). The explorative data analyses are generated by pandas profiling (Brugman, 2019) and sweetviz (Bertrand, 2022).


I used pushshift to collect post titles and post metadata. Next, I used PRAW to collect post comments. The customized VADER dictionary assigned the “gem stone”, “gorilla”, different skin tones of “raising hands”, “rocket”, different versions of “moon”, and different skin tones of “open hands” emojis to score four, which is the highest score in VADER, signifying high positive sentiment. The emoji “crayon” was assigned a score of one, reflecting a moderately positive sentiment. The distinctive emoji uses reflect the communication and language patterns in these subreddits. For example, the “gem stone” emoji means “diamond hands”, which describes an investor who refrains from selling an investment despite downturns or losses. The combination of “rocket” emoji and “moon” emoji means “going to the moon”, which describes when the price of a financial instrument is rising off the charts.

Quality control

The values collected from Pushshift, such as scores and number of comments, only reflect the values when the data was collected. There might be a discrepancy between the values collected and the real-time values. The customized update on VADER dictionary only includes commonly agreed-on emoji used by the GameStop retail investors. These particular emoji uses are also shared by the larger communities associated with the mentality of the meme culture. The pre-processing results on post titles are included in 57 meta-features, which are viable for future analyses, such as creating further features.

(3) Dataset Description

Object name

Reddit Dataset on Meme Stock: GameStop


Format names and versions

CSV; HTML; Version 2.0

Creation dates

2022–02–15 — 2022–04–26

Dataset creators

Jing Han





Repository name


Publication date


(4) Reuse Potential

The 74 variables in this dataset provide opportunities for future analyses, such as creating further features during exploratory analysis and future studies. For example, the variable post flairs can be used as post labels for text classification research. Researchers who are interested in understanding online communication patterns could use this labeled dataset to train a classifier and apply multiclass or multilabel inference on the comment threads. The results of text classification research could also be used to understand the communication processes of these subreddits. The relationship between communication processes and the effects of the online social movement (MOASS) could be studied by performing a time series analysis on the dataset and analyzing mainstream media’s attention on the movement. Furthermore, word count, stop word count, word count after cleaning, and speech tagging would be useful for named-entity recognition and online language studies. The results of studying online language use contained in the dataset would be helpful understanding the community culture of these subreddits, which could contribute to the studies on meme culture and broadly, online culture. Using public sentiment to harness the power of public opinion, research has outlined methods for analyzing commercial interests. For example, researchers have studied the relationship between public sentiment on social media platforms and market impact (Nguyen & Shirai, 2015; Audrino et al., 2020). S & P Dow Jones Indices includes a social media sentiment factor (S&P Global, n.d.). Sentiment annotation on this Reddit dataset using VADER with a customized dictionary could provide a baseline comparison for researchers interested in using sentiment as a variable to study the processes and effects of public sentiment. Specifically, the sentiment annotation could assist studies on the relationship between public sentiment and price fluctuations of stock, between public sentiment and public opinion.

The Reddit dataset generated during the GameStop short squeeze stands out from other Reddit corpus because of its socio-economic relevance. The social movement following the event demonstrates the power of people and the long-term economic impact their actions had. Additionally, Reddit allows access to data via its API terms of use, which is more generously than other social media platforms (Reddit, 2016). Reddit’s data structure and limited restrictions on posting content provide opportunities to study online language use, communication processes, public opinions, online culture, online communities, and online social movements.