Misinformation and conspiratorial claims related to the coronavirus are a problem for stemming the pandemic. Public health depends upon people having accurate knowledge about the severity of the problem, how they can avoid infection and what treatments can help them (Goldacre, 2009). Studies have also shown that believing in conspiracy theories makes people less likely to participate in behaviours that protect their health, such as obtaining vaccinations (Dunn et al., 2017). While all large social media platforms can host misinformation, research suggests that YouTube has played a particularly important role as a source for misinformation related to the coronavirus pandemic (Allington, Duffy, Wessely, Dhavan, & Rubin, 2020).
In April 2020, YouTube’s Chief Executive Susan Wojcicki stated that the company was increasing its efforts to remove “medically unsubstantiated” videos, using both automated detection as well as human moderators (Cellan-Jones, 2020). YouTube publishes only aggregated information about the videos that break its Community Guidelines and that are removed. It is, however, possible to gather information about individual removed videos from various public data sources. This dataset describes all videos that circulated on publicly searchable social media and then were removed by YouTube because they contained false information about the coronavirus.
The dataset was created for the Computational Propaganda Project at the Oxford Internet Institute, in order to study the scale of the audience of COVID-related misinformation and its mechanisms of distribution on social media.
This dataset describes 8,122 YouTube videos that contain COVID-related misinformation. Instead of applying our own inclusion criteria, we identify these videos by following the categorisations made by YouTube itself when removing videos.
We identified COVID-related videos by looking for posts on Facebook, Reddit and Twitter that link to YouTube and that match COVID-related keywords. For Twitter, we used an open access dataset that covered the period from October 2019 to the end of April 2020 (Dimitrov et al., 2020). This dataset was based on a set of 268 COVID-related keywords. We simplified and updated this list of keywords to a total of 71 keywords (Knuutila, Herasimenka, Au, Bright, & Howard, 2020). We used the CrowdTangle service to search for posts on Reddit and Facebook between 1 October 2019 and the 30 June 2020. The dataset will not be updated in the future. CrowdTangle is a database that contains public groups and pages from Facebook and Reddit (CrowdTangle, 2020). It does not contain personal accounts or closed groups.
This search resulted in a list of 1,091,876 distinct videos. We then followed the YouTube link to each video, and where the videos were no longer available we recorded the reason that the YouTube site gave for the video having been removed. With this method, we identified 8,122 COVID-related videos that YouTube had removed because they breached its Community Guidelines.
For these 8,122 videos, we recovered additional information and metadata from other sources, since YouTube itself only published the reason for their removal. Firstly, we recovered the titles and part of the description for all the videos that have been posted to Facebook. The posts on Facebook displayed the original titles and the first 157 characters of the video’s description, which we could read by programmatically retrieving every Facebook post. We also queried the Facebook Graph API to get the total number of shares, comments and reactions that the videos had received across the entire platform, including posts to individual profiles and closed groups. Data collection was undertaken in July 2020.
Lastly, we recovered metadata about the videos from the archive.org’s WayBack Machine, a service that archives the older versions of webpages. Copies of the deleted YouTube pages were accessible through the WayBack Machine’s API in 935 cases. For these videos, we could access the view counts, channel subscriber counts, full descriptions of the videos as well as the video’s creation date. In 420 cases, we were also able to approximate how long the video had been visible, by noting the date at which the WayBack Machine had archived the first copy of video’s page that stated its removal.
Content of dataset
The dataset contains the following information:
- The YouTube links where the videos were viewable prior to their removal.
- The titles, descriptions, and view counts of the videos, where these could be recovered.
- The identification numbers of the YouTube channels where the videos were posted and the channels’ subscriber counts.
- A timestamp of when the videos were published and removed, where these could be recovered.
- A link to archive.org pages where metadata about the videos and in many cases the videos themselves can be viewed.
- Engagement statistics from Facebook’s Graph API for every video, describing overall engagement across the platform.
- ID numbers for Twitter and public Facebook posts linking to the videos.
Format names and versions
Data was collected in July 2020 and covers a period from October 2019 to June 2020
Aleksi Knuutila, Aliaksandr Herasimenko, Hubert Au, Jonathan Bright, Philip N. Howard
30th of November 2020
The dataset is a resource for researchers in the humanities that look to study narratives related to the coronavirus and the communities that produce them. One challenge for such studies is that medical misinformation is ephemeral and often quickly removed from social media platforms. In many cases, however, it is still possible to view and analyse the videos. Archive.org and similar services might hold copies of the videos, and the videos’ titles may help find them hosted elsewhere. The project that created the dataset raised questions about how to study the content removal policies of platforms and how to utilize the traces left behind by deleted content, and future work in this area may well suggest productive new methodological approaches.
The dataset can also be reused for research on the extent of misinformation in social media and information diets. One benefit of the dataset is that it is a relatively comprehensive list of COVID-related misinformation videos that were shared publicly in the study period.