The Cybernetics Thought Collective: Machine-Generated Data Using Computational Methods

) are held by the British Library, the American Philosophical Society, the University of Illinois at Urbana-Champaign, and MIT, respectively. The data were created for “The Cybernetics Thought Collective: A History of Science and Technology Portal Project” (2017–2019), a pilot project funded by the National Endowment for the Humanities (NEH). Using computational methods and tools—machine learning, named entity recognition, and natural language pro - cessing—on digitized archival records, the data were generated to enhance archival access in three dis - tinct but interrelated ways: as archival metadata for the digitized records, as reusable data to facilitate digital scholarly analyses, and as the basis for a series of test visualizations. The data represent entities associated with cybernetic concepts and the main actors attached to the cybernetics movement and the exchange of its ideas. The dataset is stored along with the digitized records in the University of Illinois (U of I) Library’s multi-tiered repository, a replicated preservation service based on PREMIS (Preserva - tion Metadata: Implementation Strategies). Reuse potential for this dataset includes historical/archival, linguistic, and artistic analyses of the data to examine connections between the cybernetic entities.

McCulloch, Heinz von Foerster, and Norbert Wiener for digitization. In total, 61,067 pages of archival records were digitized, resulting in 615 digital objects (which represent folder-level or multi-page item-level aggregations of digitized records). The project created PDFs for archival access purposes as well as high-resolution preservation TIFF files. The former were processed by optical character recognition (OCR) software to make the records machinereadable. Some materials are also handwritten and were transcribed as time allowed.

Normalization and Input Creation
PDFMiner [20] was used to extract text from the OCR-ed records into plaintext files. Before testing entity extraction, natural language processing, and machine learning software, text remediation and normalization was needed to both address OCR errors and to translate some of the fonds' Italian, Spanish, French, and German texts into English. Translation was completed with the aid of N-grams and Googletrans [17], while Wolfram Text Analysis tools [35] were used to remove stopwords.
Concurrent with this step, the project team created inputs, or a cybernetics vocabulary. The project sought to specifically identify and extract cybernetic entities; fortunately, cybernetics has a distinct set of core concepts related to behavior, self-organization, and feedback mechanisms, from which a vocabulary could be derived [2,25]. Identifying this vocabulary was especially important for connecting concepts and agents to each other in the cybernetics network. The project team used Cybernetics of Cybernetics: Or, the Control of Control and the Communication of Communication as a source for generating a cybernetics vocabulary [32]. Cybernetics of Cybernetics is a compilation that prominently features Ashby, McCulloch, von Foerster, Wiener, and key cybernetic ideas at the time that they were active in the transdiscipline. A digital version of the text was run through Voyant Tools [28] to generate a list of keywords based on frequency. This list was narrowed to include the most frequently occurring terms (about 200 total). Members of the project's advisory board (who comprised technologists and subject-experts in cybernetics) reviewed this list and offered additional suggestions.

Entity Extraction, Natural Language Processing, and Classification
Using this cybernetic vocabulary as inputs, one of the project's programmers experimented with a number of Python libraries for natural language processing and named entity extraction (e.g., NLTK [6] and spaCy [24]) and the University of Illinois Cognitive Computation Group's NLP pipeline software [7]. Following entity identification and extraction, the project team decided to adopt a supervised machine learning approach to classify the records into four broad categories: Mathematics/Logic, Computers/Machines, Psychology/Neuroscience, and Personal. Naïve Bayes [8] and Weka [14] were used for the machine learning portion of the project. Percentages of certainty for the classifications were also generated through this process. Additional testing was performed with sentiment analysis using NLTK and VADER [19].

Text Processing and Remediation Pipeline
After testing various software, a Python-based pipeline was developed (see Figure 1). Following the pipeline, the project team imported text from the PDF files into plaintext; normalized the files; removed files that contained a significant amount of noise that could not be easily remediated with existing tools in the allotted timeframe; identified the language of the documents and translated into English where necessary; extracted entities; classified the documents into categories; and estimated the percentage of certainty for each category per document. The pipeline is documented in the project's GitHub Repository [30]. All of the data resulting from the entity extraction, sentiment analysis, and machine learning steps were imported into a CSV file that was made available for research use and used as metadata for the digital collection. A more detailed overview of the methodology and the steps employed is delineated in the project's white paper [4]. It is important to note that the creation of this pipeline was not a linear process and involved retesting tools and revisiting several steps.

Preservation and Access
The PDFs of the digital surrogates and files containing the inputs, classification data, percentages of certainty for those classifications, and extracted entities were ingested into the University of Illinois Library's digital repository service for preservation and access. The digital repository (known as Medusa [31]) is a replicated multi-tiered Fedora-based repository that uses PREMIS [26]. The classification data, percentages of certainty, and entities also populate the metadata application profile for each PDF in the repository's access interface. The data were made available as a dataset via a CSV file for users to download, along with a CSV file containing the original inputs and a readme file that provides additional information about the data and the process that created them [3]. The dataset includes file-level metadata, some of which is human created (e.g., level of description and title), and some of which is collection-level metadata that applies to all digital objects in the same fonds (e.g., scope and contents, parent collection, collection identifier), and provides original archival context for the machine-generated data (e.g., machine-extracted feature, cybernetic classification, certainty). These fields are described more fully in the data dictionary in the readme file. A selection of the data was also used to create test visualizations, which are available on the project site.

Sampling strategy
This pilot project aimed to produce a proof-of-concept machine learning, named entity recognition, and natural language processing pipeline for meta/data generation and classification of archival records; through this process, a representative sample of documents that illustrate prominent cybernetic concepts and consist of letters between von Foerster, Ashby, McCulloch, Wiener, and other known cyberneticians were selected from across the four fonds. However, statistical sampling techniques were employed at various stages of the natural language processing and entity-extraction workflow. For example, to translate texts into English, a test set of approximately 200 documents in English, German, French, and Italian was created in order to employ an N-gram approach to language identification. The Python library Googletrans was then used to translate the texts into English. Additionally, a training set of 154 documents from all fonds were manually annotated and prepared for the supervised classification model.

Quality Control
The majority of the records from the four fonds are typewritten; these records were processed with OCR software, and, as time allowed, handwritten documents were transcribed. The texts also required "normalization" in order to be machine-ready. After extracting the text from the OCR-ed records to import into plaintext files, character errors that resulted from OCR were remediated (e.g., extra spaces between letters in a word, or alpha-numeric characters that were misread as non-ASCII characters).
Statistical analysis was performed on the extracted entities to identify which entities surface the most frequently in the corpus, as a means of determining which entities appear most significant. We tested this through N-grams and Term-Frequency Inverse Document Frequency (TF-IDF) to determine the frequency of an entity in each document and thus its importance throughout the entire corpus. Using TF-IDF in an archival context has precedent ([10], pp. 109-110), so we hoped that it would have utility for the project. The team felt this would be useful for comparison against the original cybernetic inputs. However, despite removing "noise" such as stop-words (i.e., commonly used words like "the," "of," or "but"), TF-IDF proved not as reliable as an N-gram approach for determining entity relevancy within the corpus. For TF-IDF to produce more useful results, document-length would need to be normalized. Given the overall nonuniformity of archival records in this particular corpus (and in archival fonds in general), it is difficult to normalize records for length.
To assess the accuracy of the machine learning results, the project team used Weka to perform a chi-squared analysis to help us better understand the accuracy of the training set in the classification process. The results revealed 71.1% "true positives" and 4% "false positives," indicating that the majority of the entities were useful in informing which documents were classified into specific categories. However, a manual analysis revealed more false positives (i.e., a few inaccurately classified documents). This assessment enabled the project team to perform a degree of quality control on the dataset and understand how we might improve the machine learning results in the future, especially by creating a larger training set.

Assessment
As a proof of concept, the Cybernetics Thought Collective project opened up the possibility of applying computational methods to archival records. But it also opened up questions about how to develop and streamline computational workflows in an archival setting, how best to document those workflows to facilitate data reuse and reproducibility, and to provide transparency so that users can understand the "computational provenance" of the results.
While the results of the project did reveal connections between documents across the four fonds through the extracted entities, the machine learning results indicated a need for additional refinement. For example, some of the documents which almost exclusively consisted of discussions of a technical nature were classified as "Personal." Thus, we will need larger training sets in tandem with better quality control mechanisms to produce more reliable results. Participants in computational archival projects need to be able to anticipate the labor necessary for creating viable inputs and training sets and for verifying the trustworthiness of the results.
Computational archival projects require close collaboration between archivists, programmers, data curators, and digital preservationists, who each provide vital input and expertise at different decision points. Likewise, in the future more engagement with potential users will be vital for determining the utility of the results and their implications for archival research, which should also inform the creation and refining of processes that generate these datasets.
The project raised questions about the relationship between machine-generated archival datasets and the original archival records-especially how that relationship is represented in both archival systems and visualization interfaces in order to ensure the original "archival provenance" of the data and materials from which they are derived are clearly described to prevent decontextualization. Digital records, and the data generated from them, can provide greater context and enhance access to each other. Therefore, it is important to find ways to make them mutually discoverable in archival access systems.

University of Illinois Digital Collections repository
Publication date 2020-01-24 (4) Reuse potential While cybernetics experienced a heyday that spanned the mid-late twentieth century, its philosophical influences are widespread. Indeed, vestiges of cybernetics continue to surface in modern computing, information theory, and cognitive science, as just a few examples [21,9]. Because of this intellectual omnipresence, the data have the potential to shed light on the etymology of concepts and disciplinary areas of specialization. For example, the data may be useful for contributing to discussions about artificial intelligence and its relationship to cybernetics [12,27]. However, these data do not provide insight into the evolution of the terms themselves or how the relationships between entities shifted and changed over time.
From a historical perspective, the data can be reused to reveal additional connections between cybernetic entities and the scientists who formed the cybernetics movement.
Cybernetics continues to be of recent interest to historians and science and technology studies scholars (for example [21,1]). Since this was a pilot project that resulted (in part) in several test visualizations, the data are also available for bulk download to facilitate use in other digital scholarship projects. We hope that this opens the data up to new questions and explorations of the boundaries of the "thought collective," while also serving as a step toward meeting emergent research needs within a digital scholarship framework (for example [18]).
An important aspect of data reuse is providing sufficient contextual information to enable a variety of reuse(s). Because the dataset includes information about the original digital records from which the entities are generated, and thus the original fonds, this may lead to new pathways to the digitized records themselves that are in line with FAIR data reuse principles for archival materials [22]. At the same time, the relative success with which researchers are able to reuse the data and gain new insights can inform the project's future phases as it refines its software pipeline and methods for assessing quality control. It is hoped that this data paper provides additional information about the process that generated the data, so that others may test its reproducibility and assess the results.
It is worth noting that reuse should also logically extend to the digital records themselves; all digitized materials have been made machine-readable and are accessible through the University of Illinois' repository/digital collections portal. Users can download the OCR-ed records, process them through different software pipelines, and perform their own computational analyses. While the methods employed by this project sought to extract data from the records, drawing a distinction between the reuse potential of the records themselves and the data generated from them is somewhat blurry given the interdependence of the data on the records to elucidate their context(s) and make them reusable [15]. It is thus important to emphasize that the digital records themselves are (re)usable. A future phase of this project will seek to engage researchers and the archival community in identifying additional reuse cases for both the data and the digitized records themselves, and investigate the possibility of interactive interfaces that open up explorations of records as data and user-driven reorderings of content [23,36].
The data also have potential reuse value in a visual culture space. Cybernetics (especially second-order cybernetics) invoked visual and art historical references to interrogate and illustrate many of its ideas. For example, to peruse the publications that emerged from the Biological Computer Laboratory-the center for cybernetics at the University of Illinois directed by Heinz von Foerster-is to become simultaneously immersed in scientific diagrams and esoteric imagery of ouroboros and art historical iconography (see, for example [33]). Cybernetics has inspired "cybernetic art" and explorations of media culture through a cybernetic lens [5,13]. Examples of cybernetic data either informing or becoming artistic works themselves also have precedent, indicating that such reuses are not unimaginable [16]. The data resulting from the project can contribute to cybernetic explorations at the intersection of art, technology, and new media.

Additional File
The additional file for this article can be found as follows: • Readme for the Cybernetics Thought Collective Data. This readme file contains a brief description of the dataset, metadata fields, and the process of data creation. https://digital.library.illinois.edu/items/ 3cd33c50-8c95-0138-729a-02d0d7bfd6e4-8.