(1) Overview
Repository location
Context
Although Twitter could be understood as a massive corpus of texts in which many distant reading methodologies are deployed, interest in it as a resource for Digital Humanities (DH) projects has not been widespread. Its most popular use to date has been related to the movement “Twitter for scholarly networking”, in which digital humanists analysed how the DH community grew and established their particular networks (; ; ). However, we can highlight relevant DH studies that relate Twitter data to Natural Language Processing (NLP) (; ; ) and the DHNow initiative that has been offering open tools and resources to work with Twitter. Moreover, we note other responses closely related to DH that performed qualitative and quantitative analysis on Twitter (; ).
Digital Narratives of COVID-19 (DHCovid) is a digital humanities project funded by the University of Miami (FL) and developed in collaboration with the National Scientific and Technical Research Council (CONICET, Argentina) that investigates the sociolinguistic and geographical trends and topics in Twitter conversations surrounding the COVID-19 pandemic. Since April 2020, this project has been collecting COVID-19 related tweets in Spanish as well as tweets in English and Spanish in specific geographic locations: Argentina, Colombia, Ecuador, Mexico, Peru, Spain, and South Florida.
(2) Method
Steps
To assemble the Twitter corpus, a PHP programming language script mines the Twitter data streaming through Twitter’s Application Programming Interface (API) and recovers a series of specific tweet identifiers (IDs). Our data mining sampling strategy consists of four main variables: language, keywords, region, and date.
The corpus is available through three repositories:
- GitHub. Tweet IDs are stored in a MySQL relational database where they are “hydrated,” that is, all metadata associated with the tweets is recovered, including its body text. Then, an additional script organizes the tweet IDs in the database by day, language, and region, and creates a plaintext file for each combination with a list of corresponding tweet IDs. The script generates these files daily and organizes them into folders, where each directory represents one day. These are uploaded directly to our public GitHub repository (Table 1). The data collection began on April 24th, 2020, and new tweets are automatically uploaded daily, until May 2021.
- Project website endpoint. A free access endpoint for query and download of “hydrated” tweets can be accessed from DHCovid website. An additional script queries the database and recovers body text of tweets (see Quality Control section). The access to a tidied and structured Twitter corpus for on-demand querying is one of the most meaningful contributions of our project for data reuse and text mining activities.
- Zenodo. A first stable version of the dataset, published on May 13, 2020, was released through Zenodo as a compressed ZIP file containing folders of daily tweets made between April 24th, 2020 and May 12th, 2020. A second and final version will be uploaded by the end of the project in May 2021 with the complete collection of tweet IDs.
YEAR-MONTH-DAY | DAILY FOLDER |
---|---|
dhcovid_YEAR_MONTH_en_fl.txt | Tweets in English in Florida |
dhcovid_YEAR_MONTH_es.txt | All Spanish tweets |
dhcovid_YEAR_MONTH_es_ar.txt | Tweets in Spanish in Argentina |
dhcovid_YEAR_MONTH_es_co.txt | Tweets in Spanish in Colombia |
dhcovid_YEAR_MONTH_es_ec.txt | Tweets in Spanish in Ecuador |
dhcovid_YEAR_MONTH_es_es.txt | Tweets in Spanish in Spain |
dhcovid_YEAR_MONTH_es_fl.txt | Tweets in Spanish in Florida |
dhcovid_YEAR_MONTH_es_mx.txt | Tweets in Spanish in Mexico |
dhcovid_YEAR_MONTH_es_pe.txt | Tweets in Spanish in Peru |
Sampling strategy
The recovery of tweets by language and keywords is straightforward: we only query tweets written in Spanish and English that contain one of the words from our user-defined keywords list. We delimited two lists of keywords in English and Spanish related to the COVID-19 pandemic. Consequently, only tweets with one of these words and/or hashtags are selected. The English keywords only apply to Miami area and include terms such as “covid,” “coronavirus,” “pandemic,” “quarantine,” “#stayathome,” “outbreak,” “lockdown,” and “#socialdistancing.” The Spanish keywords include “covid,” “coronavirus,” “pandemia,” “cuarentena,” “confinamiento,” “#quedateencasa,” “desescalada,” and “#distanciamientosocial ().
Our strategy is also shaped by Twitter API policies. First, in its free version, the API did not offer the possibility for querying tweets older than seven days. Second, Twitter allows users to publish georeferenced tweets (with exact location) but retrieving geotagged tweets is complicated due to the absence of a facility for querying by geographic region. A pragmatic approach led us to define “country” as a circle surrounding the area of interest, e.g., “Mexico” is defined as latitude 21.295658, longitude -100.291341, and a radius of 450 miles. Indeed, political and national borders will not always follow our selection criteria, so our area-specific corpus can sometimes contain tweets from a neighbouring country, e.g., a query for Argentina is conflated with parts of Uruguay, and Colombia with parts of Ecuador.
Quality control
The “hydration” of the collected tweet IDs undergoes an additional data tidying process before any body text data is returned to the user. We apply a set of rules to the tweet body text: enforce that all words are in lowercase, remove accents, punctuations, mention of users (@users) to protect privacy, and replace all links with a general “URL” term. While enforcing all text to be in UTF-8 encoding, a particular challenge unique to the Spanish corpus is accents and graphemes, such as the “ñ”, that can be difficult to process and preserve. Most of those cases were resolved through a script by detecting special entity codes and replacing them with the correct character (e.g. ñ as ñ). We have also transliterated emojis into its corresponding UTF-8 charset and eliminated them from our experiments as of now. This processing facilitates the application of NLP techniques.
(3) Dataset Description
Object name
http://doi.org/10.5281/zenodo.3824950
https://github.com/dh-miami/narratives_covid19/tree/master/twitter-corpus
Format names and versions
.txt
Creation dates
2020-04-24 to 2021-05-31
Dataset creators
Susanna Allés-Torrent: Conceptualization; Funding Acquisition; Project administration; Supervision; Writing
Gimena del Rio Riande: Conceptualization; Writing
Jerry Bonnell: Data curation; Software; Visualization
Dieyun Song: Data curation; Writing
Nidia Hernández: Data curation; Visualization; Writing
Language
Spanish, English (See Table 2).
DATE | FLSPA | FLENG | ECUADOR | PERU | COLOMBIA | SPAIN | ARGENTINA | MEXICO | SPANISH | TOTAL |
---|---|---|---|---|---|---|---|---|---|---|
2020 Apr | 1.8k | 5.7k | 12.2k | 12.8k | 39k | 47.3k | 16k | 93.7k | 512.2k | 740.7k |
2020 May | 6.1k | 22k | 48.4k | 56.2k | 168.1k | 182.7k | 68.5k | 411.1k | 2.4M | 3.4M |
2020 Jun | 4.7k | 18.9k | 34.7k | 43.2k | 149.6k | 124.7k | 76.6k | 319.2k | 2.1M | 2.9M |
2020 Jul | 6.5k | 28.6k | 34.7k | 41.4k | 171.2k | 127.3k | 79.8k | 324.5k | 2.1M | 2.9M |
2020 Aug | 4.9k | 16.4k | 22.5k | 32k | 114.7k | 116.9k | 76.3k | 225.5k | 1.5M | 2.1M |
2020 Sep | 4.3k | 15.6k | 18.9k | 25.5k | 86.8k | 137.4k | 86.3k | 184.5k | 1.5M | 2.1M |
2020 Oct | 5.5k | 23.1k | 17.6k | 21.8k | 85.2k | 145.6k | 79.3k | 205.3k | 1.5M | 2.1M |
2020 Nov | 4.6k | 18.8k | 18.7k | 18k | 74.4k | 134.8k | 66.9k | 188.7k | 1.3M | 1.7M |
2020 Dec | 5.2k | 21.1k | 20.2k | 25.7k | 100.9k | 116.9k | 72.6k | 248.9k | 1.6M | 2.2M |
2021 Jan | 5.4k | 16.4k | 27k | 42k | 125.9k | 155.5k | 78k | 304.4k | 1.9M | 2.7M |
2021 Feb | 3.9k | 13k | 20.7k | 39.5k | 80k | 112.3k | 50.3k | 207.1k | 1.3M | 1.9M |
2021 Mar | 4k | 14.6k | 21.6k | 30.8k | 74k | 90.3k | 67.2k | 170.5k | 1.3M | 1.8M |
Total | 56.9k | 214.3k | 297.3k | 388.9k | 1.3M | 1.5M | 817.7k | 2.9M | 19.2M | 26.6M |
License
Creative Commons license Attribution 4.0 International (CC BY 4.0).
Repository name
GitHub for the continuously daily updated version. Zenodo for the stable DOI.
Publication date
First published to the GitHub repository on 2020-04-24. Afterwards it was released to Zenodo on 2020-05-13.
(4) Reuse Potential
The COVID-19 pandemic has motivated a plethora of ambitious digital research focusing on Twitter, including the role and importance of automated Twitter accounts, also known as bots (), the increase of politically radical discourse in social media (), and the general public perception (). Most of the efforts, however, mine and analyze tweets written in English (; ; ). DHCovid bridges the gap by bringing Spanish Twitter narratives into the conversation.
We use a CC BY license to give full reuse of our dataset to the public as text or to apply combined technical and humanistic processing techniques, such as statistics, NLP, topic modelling, and sentiment analysis. Our datasets can be meaningful for teaching, research, and activism resources to be used across language, disciplinary, and professional boundaries. Scholars interested in the human experience of the pandemic, health professionals, policy makers, funding agencies, journalists, and active citizens are all welcome to engage with our project.