Normalized Datasets of Hahn’s and Zahn’s Reconstructions of Marcion’s Gospel

These four datasets are the first born-digital, normalized, lexicographically enriched, and peer-reviewed versions of the reconstructions of Marcion’s Gospel made by August Hahn in 1832 and Theodor Zahn in 1892. Two dataset files were generated for each reconstruction: the first consisting of human-readable Postclassical Greek; the second of lemmatized and morphologically tagged text following the openly licensed BibleWorks Greek Morphology schema. These datasets represent another batch in a series of normalized and enriched datasets of major reconstructions of Marcion’s Gospel published by JOHD. 2 Bilby Journal of Open Humanities Data DOI: 10.5334/johd.63 (1) OVERVIEW Repository location: DOI: https://doi.org/10.7910/DVN/BYNHX6 Morphology key: Bilby, M.G. (2021a). Key to BibleWorks Greek Morphology (BGM) (v1.1). DOI: https://doi.org/10.5281/zenodo.4950243

Clear relationships exist among these reconstructions because of their shared dependence on common underlying data: over 700 patristic attestations to GMarc by over fifteen ancient witnesses, hundreds of textual variants and thousands of non-variants in manuscripts of Luke, and tens of thousands of close parallel words in other gospels. However, because of quite different methodological assumptions held and approaches taken by their respective editors, these reconstructions reflect widely varying results and reveal editorial idiosyncrasies as well as clear dependencies when cross-compared quantitatively.
Hahn's pioneering reconstruction (1832) represented the fruit of his extensive earlier study on Marcion and the texts distributed under Marcion's patronage (Hahn, 1823). As Roth notes, "Hahn's work was particularly important in that he provided the first attempt to present comprehensively Marcion's Gospel as reconstructed from the available sources" (2015: 9). While Hahn's maximalist, continuous approach to reconstruction was challenged by many later editors, he still provided a starting point for all future reconstructions. Zahn sought to correct and pare down Hahn's overly generous reconstruction, while making use of the Editio octava critica minor (EOCM) (Tischendorf, 1869) as a base text. Zahn's discontinuous approach became the precursor to more radically discontinuous and minimalist reconstructions in subsequent scholarship, namely those of Tsutsui (1992) and Roth (2015).
Given the central place of GMarc in heated debates about the compositional and editorial formation of the earliest gospels (both canonical and non-canonical), quantitative analyses, computational linguistics (CL), and historical corpus linguistics (HCL) have the potential to clarify and transcend the deep subjective and idiosyncratic divides within scholarship and may prove decisive in settling centuries-old questions about the earliest texts that arose out of the Jesus movement.
(2) METHOD STEPS Our normalization of GMarc datasets began with the work of Adolf von Harnack (Harnack, 1921(Harnack, /1924 because of the public domain status of the corresponding print edition and its established place as the standard reconstruction of GMarc for most of the last century. Of all major GMarc reconstructions in history, Harnack's is the most ambiguous and inconsistent in its indications. The challenge of normalizing Harnack's reconstruction proved crucial to develop and implement a few clear, consistent datatypes and indications that allow for meaningful comparisons by humans and machines of all major GMarc Greek datasets. Compared to Harnack's, Hahn's indications are relatively sparse and simple: 1) () parentheses for words and phrases to indicate they are necessary to the meaning of the surrounding words or that there was some doubt/confusion about them

2) [] square brackets for verses Hahn deemed were removed by Marcion
3) *) asterisk followed by right parenthesis to indicate a variant detailed in the footnotes 4) #) number followed by right parenthesis to indicate footnotes, which provide references to patristic citations of GMarc and/or Hahn's explanation of the possible inclusion or exclusion of this content from GMarc To transform Hahn's reconstruction into normalized datasets, indication 1 is kept as is, verses corresponding to indication 2 are omitted, and indication 3 is rendered as empty square brackets. Supplemental version identifiers (here 01H for the first edition of GMarc by Hahn) are added to the beginning of each line to facilitate version identification and content alignment in computational linguistics environments as well as the creation of consistently sorted, interlinear arrangements of past editions of GMarc.
Zahn exercises more caution and nuance in his decisions and indications than did Hahn, resulting in a relatively lesser-yet still unacceptably high-level of Lk2 contamination. Besides restoring clear wording for many verses, Zahn also uses these indications, several of which were later commonplace in Harnack's work: To transform Zahn's reconstruction into normalized datasets, all clear wording is included, but all implicitly present content corresponding to numbers 1-3 is placed in [brackets]. For content corresponding to indication 4, whole verses are omitted but specific words within verses are retained, since these typically involve contextually required words. For indication 5, preceding words are placed in parentheses, empty parentheses render intervening words and whole verses, and empty square brackets render alternative wording. Uncertain content corresponding to indication 6 is replaced with empty parentheses, but alternative wording is rendered as empty square brackets. For indication 7, the alternative reading is signaled with empty square brackets. Content corresponding to indication 8 is omitted.

QUALITY CONTROL
As detailed in our Harnack data paper, the first dataset for each reconstruction consists of normalized, human-readable Greek, while the second manually applies lemmatization and morphological tagging using the BibleWorks Greek Morphology (BGM) schema, which is lightweight, adaptable, familiar to many scholars, openly licensed for non-commercial use, and easy to compile, edit, and query in word processor and CL environments. For quality control in the transcription of the respective Greek texts, we created interlinear parallels by verse for all GMarc editions together with canonical Luke, and made second and third passes to check each transcription against the corresponding print edition. Similarly for the lemmatizing and morphological tagging process, we sorted the editions by verse in an interlinear format and made regular use of close or exact parallel tagging already done for the canonical Gospels in BGM and the tagging we had previously done for our Harnack and Roth datasets. As a mediating step for this batch of two longer editions, we wrote and ran an R script that automated the lemmatizing and morphological tagging for about 25% of words, those that either individually or as syntagmata proved lexicographically and syntactically unambiguous. After that, we spent about 100 hours manually tagging the remainder of untagged words, looking them up in the Thesaurus Linguae Graecae whenever they varied from words in the canonical gospels or the Harnack and/or Roth reconstructions of GMarc. Finally, we ran granular, segmented crosschecks of word counts within and across datasets, confirming totals of 14442 and 10572 words for the Hahn and Zahn datasets respectively.
These UTF-8 encoded .txt files offer additional starting points for CL research on GMarc and are definitely not the final word. We welcome scholarly feedback and collaboration to correct, improve, and transform our datasets, convert them to other schemata, especially TEI XML enriched with variants, tags, and notes placed in the markup to facilitate deeper analyses of the main text.

(4) REUSE POTENTIAL
These transformative supplements to the first two major scholarly reconstructions of GMarcboth now in the public domain-represent the third batch of open datasets of GMarc published in the Journal of Open Humanities Data. GMarc has suffered considerable degradation and disintegration as a result of its suppression over the last eighteen centuries, but CL, HCL, and open data science methods can restore it to much higher levels of fidelity than currently obtain. Data scientists and humanists alike are invited to use these datasets to identify, disambiguate, and clarify the vocal strata underlying GMarc alongside the other early canonical and non-canonical gospels, whether for the purpose of uncovering the earliest textual history of the Jesus movement or to explore how editorial voices and stages in ancient texts can be scientifically delineated and sequenced.

ADDITIONAL FILE
The additional file for this article can be found as follows: • Dataset. These four UTF-8 encoded .txt files are the first born-digital, normalized, lexicographically enriched, and peer-reviewed versions of the reconstructions of Marcion's Gospel made by August Hahn in 1832 and Theodor Zahn in 1892. Two dataset files were generated for each reconstruction: the first consisting of human-readable Postclassical Greek; the second of lemmatized and morphologically tagged text following the openly licensed BibleWorks Greek Morphology schema. DOI: https://doi.org/10.5334/johd.63.s1