The Game Walkthrough Corpus (GWTC) – A Resource for the Analysis of Textual Game Descriptions

We present the Game Walkthrough Corpus (GWTC), which contains 12,295 unique walkthrough documents covering 6,117 games. For each game walkthrough, we provide frequencies of unigrams and bigrams, treating the walkthrough document as a Bag of Words. In addition, we provide word frequencies at the sentence level. Furthermore, the GWTC contains a number of game-related metadata, including title, publisher, developer, year, and genre. All the language statistics and metadata are stored in separate plain text files and can be referenced through uniform resource names (URN). These URNs can also be used to derive any combination of statistics and metadata. Researchers, for instance, can investigate the most frequent unigrams for games in the “Adventure” genre. This way, the GWTC can be reused for different kinds of research questions on gaming language.


CONTEXT
The academic interest in studying games as cultural phenomena in their own right has reached a certain level of maturity.This maturity is reflected, among other things, in the existence of a number of dedicated organizations, such as DiGRA (Digital Games Research Association), and a large number of relevant publications in specific journals, such as Game Studies.The study of games has attracted a wide range of subject areas, including media studies, cultural studies, psychology, computer science, and many more (see Ensslin, 2012).We argue that Digital Humanities (DH) can offer yet another perspective to studying games.In DH, Moretti's (2000) notion of distant reading has become a central concept and metaphor for all kinds of computational and empirical approaches to text analysis.In the case of video games, however, the question arises as to how such a highly interactive and dynamic medium can be formalized and modeled in a way that allows it to be analyzed computationally.After all, game experiences are highly individual and explicitly quantifiable features are hardly available.Thus, to enable a quantitative research perspective on games, we propose focusing on their textual dimension: their language.According to Ensslin (2012, p.6), the language of gaming branches into two major aspects: "the ways in which videogames and their makers convey meanings to their audience, and the ways in which gamers and other stakeholders communicate and negotiate meanings between themselves".To study the various levels of discourse evident in the language of gaming, Ensslin created a small-scale corpus of different texts about videogames.Her GameCorp comprises 184 texts from videogame magazines, gamer fora and chats, and transcribed live conversations during gameplay.
In this data paper, we present a large corpus of game walkthroughs, which are textual guides for all kinds of video games that include instructions and tips that walk players through a game, so that they can complete it successfully.Walkthroughs have a specific type of gaming language that might be categorized as a mixture of languages.First, we find language as it appears in the actual games, in terms of game mechanics as well as references to the plot and its characters.Second, walkthroughs contain gaming jargon used by actual gamers, as walkthroughs are mostly written by players of the actual game (see also Krause, 2016).Walkthroughs have been shown to be a suitable document type for purposes of digital game preservation (Newman, 2011;Nylund, 2015).Newman (2011, p.111) summarizes the strengths of game walkthroughs in the following way: "[…] player-produced walkthroughs […] are some of the most comprehensive investigations of digital gameplay that presently exist; certainly more thorough, investigative and inventive than any professional or academic literature; […] walkthrough texts might be better able to capture and communicate the important qualities of games, as defined and understood by their players, than the playable games themselves".
While walkthroughs have been used successfully to support the study of specific games, such as Zelda 64 (Consalvo, 2013), they are not equally suited for all game genres.Games that involve a certain degree of creativity or that may not have a specified winning goal, such as Minecraft, are examples of the insufficiency of walkthrough descriptions.At the same time, complex, open gaming experiences, like Grand Theft Auto or current Assassins Creed games, and Grand Strategy games, like the Europa Universalis or Total War franchises, may also be rather limited in their walkthrough descriptions compared to the actual gameplay.Despite these limitations for certain genres, we agree with Newman (2011) and Nylund (2015) and believe that, in general, walkthroughs are a great source of textual game preservations that enable large-scale corpus analyses.From a quantitative perspective, walkthroughs are particularly interesting types of gaming text, as they are widely available on the Internet.Interestingly, besides one example from linguistics, where a custom German-language walkthrough corpus has been used to study imperative language (Krause, 2016), hardly any existing studies so far have utilized this type of text to study games in a quantitative fashion.With the Game Walkthrough Corpus (GWTC), we hope to promote more research in this direction in the future.Burghardt and Tiepmar Journal of Open Humanities Data DOI: 10.5334/johd.34 (2) METHOD This section summarizes the main steps that were involved in creating the GWTC.It also provides an overview of the main contents of the dataset.

STEPS i)
Data sources and languages -The GWTC is designed to be continuously expanded.The ultimate goal is a multilingual corpus of many different game walkthroughs.Although this first release of the GWTC focuses on English-language walkthroughs, a multilingual perspective has already been considered in the data structure by including some German-language walkthroughs.For the current corpus of 12,295 unique walkthrough documents for a total of 6,117 games, we collected English and German language walkthroughs from the following platforms: • Neoseeker: 8,729 documents were collected from Neoseeker (https://www.neoseeker. com/).The platform includes mostly mainstream titles like the Grand Theft Auto or Assassins Creed series, but also has a small selection of niche titles such as "Hitomi -My Stepsister".The language of the documents is English.
• Jayisgames: 2,220 documents were collected from Jayisgames (https: //jayisgames.com/).This platform is focused on puzzle games.The language of the documents is English.
• Gamesetter: 799 walkthroughs were collected from Gamesetter (http://gamesetter.com/).This platform also has a focus on puzzle games.The language of the documents is German.
• Portforward: 318 documents were collected from Portforward (https://portforward. com/games/walkthroughs/).This platform is actually focused on helping with computer network-related problems but also has a side project that provides walkthrough documents for a number of popular games.The language of the documents is English.
• Spieletipps: 229 documents were collected from Spieletipps (https://www.spieletipps.de/).The games on this platform can be considered mainstream content.The language of the documents is German.
ii) Data processing -All HTML walkthrough documents were collected from the different platforms using individually implemented Scrapy (https://scrapy.org/)crawlers.The text content was extracted and converted into a generic uniform hierarchical TEI/XML markup.In a pre-processing step, we converted each text to lowercase and removed every character that is not a regex word or space character ([^\w\s]).For the sentence collocations, all punctuation was normalized to full stops, meaning that, for example, subordinate clauses are treated as sentences to break up larger sentences.All characters were further filtered, based on a lowercase whitelist of English and German letters and full stop to avoid encoding problems.We did not remove any stop words and also did not perform any lemmatization or stemming.As for potential paratextual elements (e.g., introductions, general information on the game, etc.), we kept all of those in the walkthrough documents and only removed HTML-related structural elements (e.g., navigation headers) from the documents.The normalized TEI/XML files were then used to build a Canonical Text Service (CTS), which is typically used as a citation framework in classical studies (Smith, 2009;Tiepmar, Teichmann, Heyer, Berti, & Crane, 2014).The purpose of the CTS is to have persistent URNs for each game and its structural text elements.The following is an example URN for the game "Zak McKracken and the Alien Mindbenders": • urn:cts:gwtc:zak_mckracken_and_the_alien_mindbenders: iii) Metadata -Next, we added various metadata to the walkthrough documents, which we gathered from RAWG (https://rawg.io/) and Steam (https://store.steampowered.com/).For both platforms, it can be assumed that most of the metadata is subject to systematic editing.All metadata was collected using Python API packages for Steam (https://pypi.org/project/steamfront/) and RAWG (https://rawgpy.readthedocs.io/).A slight bias toward PC games is Burghardt and Tiepmar Journal of Open Humanities Data DOI: 10.5334/johd.34 expected, as Steam (unlike RAWG) does not include console games.While this should not impact multi-platform titles that were also published on PC, metadata for console-only games may be underrepresented.
The following metadata are available for the games, with a varying degree of coverage 1 : • game title 2 • short description (booklet text) • gameplay tags  Note: Some videogames are published multiple times, as they may be the subject of patches or updates.Games can even be completely remade and published under the same name (e.g., "Tomb Raider") or differ from platform to platform, because of technical differences (e.g., the PlayStation and Nintendo Wii edition of "Resident Evil 4").As a mapping of such parallel releases cannot easily be achieved automatically, we kept the original titles as they were provided by the authors. 2 Aside from the game titles and short descriptions, metadata often provide multiple entries per game.For example, the gameplay tags for Max Payne 2 are singleplayer, destruction, drama, physics, romance, story, character, police and fall.Gameplay tags seem to cover a wide range of topics, from platform-specific information to ludic and narrative categorization that may overlap with game genres.While it may seem strange to have multiple release dates, developers, and publishers, this is to be expected because of repeated publications and later ports to additional game platforms that are often realized by different teams.Text statistics -As individual copyrights protect game walkthroughs, this data set does not include the full-text documents.It rather provides various data formats that are useful for text mining and distant reading approaches while not allowing for the reconstruction of the full texts.To enable researchers to look up the original full text of specific walkthroughs, we provide the source URLs for any walkthrough document as part of the dataset.The following frequency information is available in the GWTC: Adding more metadata from other sources is a desideratum for future releases of the dataset.