A Reproducible IT-Blog Corpus

Adrien Barbaresi; Jens Pohlmann

(1) Overview

Repository location

DWDS-platform (dwds.de): https://www.dwds.de/d/korpora/it_blogs

Context

This data was produced for an ongoing research project concerning public discourse about Internet Policy and some of it has been discussed by Barbaresi & Pohlmann ().

(2) Method

Steps

We first compiled a list of German IT-blogs and websites by identifying the main websites in this field and then looking for similar sites and keywords. In this process, we were looking for blogs and websites that report about and discuss the latest developments in information technology and products, IT-law, and policy, as well as sites dedicated to commentary on the societal impact of technology. Aside from specific IT-news and policy portals, such as www.netzpolitik.org, and IT-product portals (e.g., www.mobilegeeks.de), we collected blogs by IT-lawyers and those of scholars, intellectuals, and journalists working in the fields of Communication and Media Studies, Tech, Law, Policy, and Philosophy. After finding these initial sites, we manually extended the list by handpicking recommendations from the engine https://www.similarsites.com that fit the corpus profile.

We fetched sitemaps from the sites of interest (the sitemaps protocol primarily allows webmasters to inform search engines about pages on their sites that are available for crawling) and identified content for the remaining sites by web crawling ().

Text extraction focuses on the central part of the texts (e.g., without navigation or footer information), comments (potential user-generated content listed at the bottom of an article), and metadata (at least title, date, and URL and possibly author, tags, categories, and summary). The documents are then stored as XML and processed by a platform for lexicographic corpus research ().

For more information see https://trafilatura.readthedocs.io/. The software is published under an open-source license: https://doi.org/10.5281/zenodo.3460969.

Sampling strategy

All pages found on the websites have been processed in full provided it was technically possible, and the resulting documents contained a meaningful amount of text (e.g., no image galleries). Additionally, metadata, text, and comments have been extracted and indexed on the platform, which allows for further text-based filtering using faceted searches.

Quality control

The corpus has been checked for consistency and completeness (especially concerning the accuracy of the scraping process). Furthermore, relevant metadata on the website level have been checked as well (e.g., copyright licenses).

(3) Dataset Description

Object name

IT-blog corpus, EN+DE.

Format names and versions

TXT, Python package, CSV and XML data export.

Creation dates

2019-09-03, update pending.

Dataset creators

Adrien Barbaresi (Berlin-Brandenburg Academy of Sciences), Jens Pohlmann (ZeMKI, University of Bremen; CESTA, Stanford University).

Language

English and German.

License

Access restricted to free login; software under GPLv3+ license; list of sources under CC BY-SA license v4.0.

Repository name

DWDS-platform (dwds.de)

Publication date

2020-09-15.

Size

Hundreds of different sources, German version: 1.5 million documents, amounting to 900 million tokens, English version: 2 million documents and about 1.3 billion tokens.

(4) Reuse Potential

Quotes extracted from the corpus can be used in a variety of formats. The whole dataset cannot be copied and re-used as such. However, the corpus can be re-created from the sources list using the open-source corpus building software Trafilatura (https://github.com/adbar/trafilatura), thus making it free to copy while bypassing potential copyright concerns. The data is of interest to other corpus linguists, political scientists, sociologists, cultural studies researchers, but also for market studies or technology impact assessment.

In our work, the data is used to analyse the discourse about Internet Policy questions in Germany and the United States. IT-blogs represent an expert discourse regarding questions at the intersection of technology and society and may have a considerable impact on the discussion of these matters in traditional media (e.g., newspapers), and thereby on the conversation in broader swaths of the population (). In a pilot study, we compare the discussion about a particular German anti-hate speech law, the Netzwerkdurchsetzungsgesetz (NetzDG), on German IT-blogs with the discourse that is simultaneously taking place in the most important German newspapers. Based on this setup, we can draw conclusions about the impact of IT-blogs and websites on more traditional print media.

Separate from analyses that are predicated on themes and specified search terms, users can also apply more corpus-driven text data mining techniques and examine, for example, the contents of complete blogs or groups of blogs and websites in order to determine topics that prevail in these posts over time. Furthermore, they can inspect reference networks through exploring linkages between specific blogs/websites and thereby study communication practices within the IT-blog sphere.

The corpus needs to be re-created from the sources list to avoid potential copyright concerns, as some of the blog posts and website articles contained within the corpus are copyright protected. However, this does not hinder the application of text data mining techniques to the data when it comes to producing results that are either highly aggregated and do not allow for the reconstruction of the original texts, or regarding results that only provide snippets of the texts in question.

Note, however, that this stipulation may generate difficulties when it comes to sharing the underlying dataset of a study with peer-reviewers and the research community at large. The dataset can be freely reproduced by using Trafilatura to create a specific subcorpus. However, its composition may change after the initial corpus has been produced if web pages or whole websites are deleted. Recovery can be automatically attempted from the Internet Archive (https://archive.org). However, reviewers or anyone who wants to rebuild the initial corpus may end up with a slightly different text base, especially if there is a substantial amount of time between the creation of the respective corpora. Consequently, the verifiability of individual research results may be limited, particularly if the research in question mainly draws on distant reading practices and statistical analysis of large amounts of text and metadata. For projects that follow a “blended reading” () approach and integrate elements of close reading, these limitations may be less troubling. A comprehensive discussion of this accessibility issue exceeds the scope of the short data paper format.

Journal of Open Humanities Data

Data Papers