Digital Narratives of COVID-19: A Twitter Dataset for Text Analysis in Spanish

Susanna Allés-Torrent; Gimena del Rio Riande; Jerry Bonnell; Dieyun Song; Nidia Hernández

(1) Overview

Repository location

https://doi.org/10.5281/zenodo.3824950

Context

Although Twitter could be understood as a massive corpus of texts in which many distant reading methodologies are deployed, interest in it as a resource for Digital Humanities (DH) projects has not been widespread. Its most popular use to date has been related to the movement “Twitter for scholarly networking”, in which digital humanists analysed how the DH community grew and established their particular networks (; ; ). However, we can highlight relevant DH studies that relate Twitter data to Natural Language Processing (NLP) (; ; ) and the DHNow initiative that has been offering open tools and resources to work with Twitter. Moreover, we note other responses closely related to DH that performed qualitative and quantitative analysis on Twitter (; ).

Digital Narratives of COVID-19 (DHCovid) is a digital humanities project funded by the University of Miami (FL) and developed in collaboration with the National Scientific and Technical Research Council (CONICET, Argentina) that investigates the sociolinguistic and geographical trends and topics in Twitter conversations surrounding the COVID-19 pandemic. Since April 2020, this project has been collecting COVID-19 related tweets in Spanish as well as tweets in English and Spanish in specific geographic locations: Argentina, Colombia, Ecuador, Mexico, Peru, Spain, and South Florida.

(2) Method

Steps

To assemble the Twitter corpus, a PHP programming language script mines the Twitter data streaming through Twitter’s Application Programming Interface (API) and recovers a series of specific tweet identifiers (IDs). Our data mining sampling strategy consists of four main variables: language, keywords, region, and date.

The corpus is available through three repositories:

GitHub. Tweet IDs are stored in a MySQL relational database where they are “hydrated,” that is, all metadata associated with the tweets is recovered, including its body text. Then, an additional script organizes the tweet IDs in the database by day, language, and region, and creates a plaintext file for each combination with a list of corresponding tweet IDs. The script generates these files daily and organizes them into folders, where each directory represents one day. These are uploaded directly to our public GitHub repository (Table 1). The data collection began on April 24^th, 2020, and new tweets are automatically uploaded daily, until May 2021.
Project website endpoint. A free access endpoint for query and download of “hydrated” tweets can be accessed from DHCovid website. An additional script queries the database and recovers body text of tweets (see Quality Control section). The access to a tidied and structured Twitter corpus for on-demand querying is one of the most meaningful contributions of our project for data reuse and text mining activities.
Zenodo. A first stable version of the dataset, published on May 13, 2020, was released through Zenodo as a compressed ZIP file containing folders of daily tweets made between April 24^th, 2020 and May 12^th, 2020. A second and final version will be uploaded by the end of the project in May 2021 with the complete collection of tweet IDs.

Table 1

Organization of plain text datasets in the GitHub repository.


YEAR-MONTH-DAY	DAILY FOLDER

dhcovid_YEAR_MONTH_en_fl.txt	Tweets in English in Florida

dhcovid_YEAR_MONTH_es.txt	All Spanish tweets

dhcovid_YEAR_MONTH_es_ar.txt	Tweets in Spanish in Argentina

dhcovid_YEAR_MONTH_es_co.txt	Tweets in Spanish in Colombia

dhcovid_YEAR_MONTH_es_ec.txt	Tweets in Spanish in Ecuador

dhcovid_YEAR_MONTH_es_es.txt	Tweets in Spanish in Spain

dhcovid_YEAR_MONTH_es_fl.txt	Tweets in Spanish in Florida

dhcovid_YEAR_MONTH_es_mx.txt	Tweets in Spanish in Mexico

dhcovid_YEAR_MONTH_es_pe.txt	Tweets in Spanish in Peru

Sampling strategy

The recovery of tweets by language and keywords is straightforward: we only query tweets written in Spanish and English that contain one of the words from our user-defined keywords list. We delimited two lists of keywords in English and Spanish related to the COVID-19 pandemic. Consequently, only tweets with one of these words and/or hashtags are selected. The English keywords only apply to Miami area and include terms such as “covid,” “coronavirus,” “pandemic,” “quarantine,” “#stayathome,” “outbreak,” “lockdown,” and “#socialdistancing.” The Spanish keywords include “covid,” “coronavirus,” “pandemia,” “cuarentena,” “confinamiento,” “#quedateencasa,” “desescalada,” and “#distanciamientosocial ().

Our strategy is also shaped by Twitter API policies. First, in its free version, the API did not offer the possibility for querying tweets older than seven days. Second, Twitter allows users to publish georeferenced tweets (with exact location) but retrieving geotagged tweets is complicated due to the absence of a facility for querying by geographic region. A pragmatic approach led us to define “country” as a circle surrounding the area of interest, e.g., “Mexico” is defined as latitude 21.295658, longitude -100.291341, and a radius of 450 miles. Indeed, political and national borders will not always follow our selection criteria, so our area-specific corpus can sometimes contain tweets from a neighbouring country, e.g., a query for Argentina is conflated with parts of Uruguay, and Colombia with parts of Ecuador.

Quality control

The “hydration” of the collected tweet IDs undergoes an additional data tidying process before any body text data is returned to the user. We apply a set of rules to the tweet body text: enforce that all words are in lowercase, remove accents, punctuations, mention of users (@users) to protect privacy, and replace all links with a general “URL” term. While enforcing all text to be in UTF-8 encoding, a particular challenge unique to the Spanish corpus is accents and graphemes, such as the “ñ”, that can be difficult to process and preserve. Most of those cases were resolved through a script by detecting special entity codes and replacing them with the correct character (e.g. &ntilde as ñ). We have also transliterated emojis into its corresponding UTF-8 charset and eliminated them from our experiments as of now. This processing facilitates the application of NLP techniques.

(3) Dataset Description

Object name

http://doi.org/10.5281/zenodo.3824950

https://github.com/dh-miami/narratives_covid19/tree/master/twitter-corpus

https://covid.dh.miami.edu/get/

Format names and versions

.txt

Creation dates

2020-04-24 to 2021-05-31

Dataset creators

Susanna Allés-Torrent: Conceptualization; Funding Acquisition; Project administration; Supervision; Writing

Gimena del Rio Riande: Conceptualization; Writing

Jerry Bonnell: Data curation; Software; Visualization

Dieyun Song: Data curation; Writing

Nidia Hernández: Data curation; Visualization; Writing

Language

Spanish, English (See Table 2).

Table 2

Number of tweets for each month and each region from April 2020 to March 2021. FLeng and FLspa correspond to tweets in English and in Spanish in the Greater Miami area. Spanish column represents all tweets in Spanish.


DATE	FLSPA	FLENG	ECUADOR	PERU	COLOMBIA	SPAIN	ARGENTINA	MEXICO	SPANISH	TOTAL

2020 Apr	1.8k	5.7k	12.2k	12.8k	39k	47.3k	16k	93.7k	512.2k	740.7k

2020 May	6.1k	22k	48.4k	56.2k	168.1k	182.7k	68.5k	411.1k	2.4M	3.4M

2020 Jun	4.7k	18.9k	34.7k	43.2k	149.6k	124.7k	76.6k	319.2k	2.1M	2.9M

2020 Jul	6.5k	28.6k	34.7k	41.4k	171.2k	127.3k	79.8k	324.5k	2.1M	2.9M

2020 Aug	4.9k	16.4k	22.5k	32k	114.7k	116.9k	76.3k	225.5k	1.5M	2.1M

2020 Sep	4.3k	15.6k	18.9k	25.5k	86.8k	137.4k	86.3k	184.5k	1.5M	2.1M

2020 Oct	5.5k	23.1k	17.6k	21.8k	85.2k	145.6k	79.3k	205.3k	1.5M	2.1M

2020 Nov	4.6k	18.8k	18.7k	18k	74.4k	134.8k	66.9k	188.7k	1.3M	1.7M

2020 Dec	5.2k	21.1k	20.2k	25.7k	100.9k	116.9k	72.6k	248.9k	1.6M	2.2M

2021 Jan	5.4k	16.4k	27k	42k	125.9k	155.5k	78k	304.4k	1.9M	2.7M

2021 Feb	3.9k	13k	20.7k	39.5k	80k	112.3k	50.3k	207.1k	1.3M	1.9M

2021 Mar	4k	14.6k	21.6k	30.8k	74k	90.3k	67.2k	170.5k	1.3M	1.8M

Total	56.9k	214.3k	297.3k	388.9k	1.3M	1.5M	817.7k	2.9M	19.2M	26.6M

License

Creative Commons license Attribution 4.0 International (CC BY 4.0).

Repository name

GitHub for the continuously daily updated version. Zenodo for the stable DOI.

Publication date

First published to the GitHub repository on 2020-04-24. Afterwards it was released to Zenodo on 2020-05-13.

Statistics and contents

In Table 2 and Figure 1 we offer the basic statistics of the dataset.

Figure 1

Overview of number of tweets for each month and each region from April 2020 to March 2021. FLeng and FLspa correspond to tweets in English and in Spanish in the Greater Miami area.

(4) Reuse Potential

The COVID-19 pandemic has motivated a plethora of ambitious digital research focusing on Twitter, including the role and importance of automated Twitter accounts, also known as bots (), the increase of politically radical discourse in social media (), and the general public perception (). Most of the efforts, however, mine and analyze tweets written in English (; ; ). DHCovid bridges the gap by bringing Spanish Twitter narratives into the conversation.

We use a CC BY license to give full reuse of our dataset to the public as text or to apply combined technical and humanistic processing techniques, such as statistics, NLP, topic modelling, and sentiment analysis. Our datasets can be meaningful for teaching, research, and activism resources to be used across language, disciplinary, and professional boundaries. Scholars interested in the human experience of the pandemic, health professionals, policy makers, funding agencies, journalists, and active citizens are all welcome to engage with our project.

Journal of Open Humanities Data

Data Papers