(1) Overview

Repository location: DOI: https://doi.org/10.7910/DVN/BYPOOR

Morphology key: Bilby, M. G. (). Key to BibleWorks Greek Morphology (v1.1). DOI: https://doi.org/10.5281/zenodo.4950243

Print source: Roth, D. T. (). The Text of Marcion’s Gospel. Leiden: Brill. DOI: https://doi.org/10.1163/9789004282377


Here we briefly recap the history of scholarly reconstructions of the Gospel of Marcion (GMarc) as outlined more thoroughly in our Harnack data paper (). Eight major published reconstructions have appeared over the last 200 years: Hahn (), Zahn (), Harnack (1921, ), Tsutsui (), BeDuhn (), Roth (), Klinghardt (, ), and Nicolotti (). Among recent reconstructions, Roth’s has gained the most acceptance among scholars, as seen in reviews by Lieu (), Guignard (), Hixson (), Gathercole (), and Holmes (). Even highly critical reviews, such as BeDuhn’s (), find much to praise in Roth’s rigorous caution, which has lent the work a greater sense of reliability than other recent reconstructions. By comparison, the ambitious restoration by Klinghardt has been extensively critiqued in the translation by Gramaglia () and in critical reviews by Bauer (), Schmid (), and Roth (). The recent reconstruction of Nicolotti (), though clearly influenced by Klinghardt’s work, restores far less content. For a recent, pragmatic example of the scholarly preference for Roth’s work, see Smith (), who uses it as the sole basis for a statistical analysis of GMarc.

The eight major scholarly reconstructions contain a plethora of interrelated yet varied data for the computational linguistic analysis of GMarc. These reconstructions are derived from much larger and more varied data: over 700 patristic attestations to GMarc by over fifteen ancient witnesses, hundreds of textual variants and thousands of non-variants in manuscripts of Luke, and tens of thousands of close parallel words in other gospels. Owing to highly divergent a priori assumptions and methods, the reconstructions vary widely in their respective use of these underlying data. This is evident in the total word counts of our normalized datasets: 14,442 for Hahn, 10,572 for Zahn, 4,338 for Harnack, 3,296 for Tsutsui (using a Latin to Greek conversion ratio of 1.082 based on the Vulgate having 18,003 words compared to 19,482 Greek words in canonical Luke), 7,419 for BeDuhn (using an English to Greek conversion ratio of 0.77 words based on a sampling of BeDuhn’s translation habits in comparison with the New Revised Standard Version’s translators), 4,169 for Roth, 12,850 for Klinghardt, and 10,870 for Nicolotti.

The methodological chasms separating these formations gauge deep disagreements on foundational matters. Should GMarc be reconstructed as a fully continuous text or is a radically discontinuous text the best scholars may hope to achieve? Can content that is not explicitly attested in patristic citations of GMarc be restored or not, and if so, on what grounds? To what degree may we lean on the variants in Codex Bezae and other so-called Western manuscript readings of Luke as a reliable basis to recover wording for GMarc? Is the target text a later, deviant evisceration of the primal, apostolic, canonized Gospel of Luke or is the canonical version a later expansion of a text anachronistically caricatured as heretical? Do these two texts reflect different authors (the Schwegler hypothesis), the same author (the early-orthodox hypothesis), or—as Gramaglia’s () critical translation of Klinghardt’s edition posits—two recensions by the same author across two different time periods?

Despite Humanities scholars turning increasingly from theology- to text-based approaches to GMarc, many of the same debates of the last 175 years linger, with scholars running in their respective circles whilst accusing the other side of circular reasoning. As detailed more thoroughly in our Harnack data paper (), past statistical and stylometric studies have not settled the foundational questions decisively, falling short of scientific rigor and method. Lacking careful and critical engagement with the text of GMarc, Sanday’s () stylometric claims made in the defense of the early-orthodox hypothesis have held sway for nearly 150 years, only passingly defended by scholars such as Rowe (), Wolter (), Hays (), and Roth (). While (the American) John Knox () and his former student Tyson () both roundly challenged Sanday’s consensus, their conclusions—based primarily on their use of Harnack’s reconstruction—have not yet won broad acceptance, with notable exceptions in the sympathetic receptions of BeDuhn (), Vinzent (), Matthews (), and Klinghardt (, ). More recently, Smith () has compiled statistics based on verse counts from Roth’s reconstruction.

Our iterative First Gospel LODLIB () builds on the previous statistical and stylometric studies of Knox, Tyson, and Smith by running verse and word counts and separating out Single, Double, and Triple traditions. Our work differs substantially from earlier studies by integrating computational linguistics (CL) and historical corpus linguistics (HCL) methods (signals analysis, cluster analysis, binomial distributions, data visualization, etc.) and open science principles (scientifically testable hypotheses, open data, version control, etc.) in the interest of transcending the subjective impasse in GMarc studies decisively. This has led us to compile, normalize, and enrich datasets based on all major GMarc reconstructions so that we and others may analyze and correlate these data and draw scientifically sound conclusions.

(2) Method

Challenges and Resolutions

Of all major reconstructions of GMarc, Harnack’s is the most ambiguous and convoluted in its variety and inconsistent application of indications. Our data paper for the Harnack datasets () sets forth the normalization standards we developed to render clean, consistent data that can be scientifically compared with other GMarc datasets by humans and machines. While Roth’s reconstruction is far more carefully and consistently organized than Harnack’s, its variety and ambiguity of indications resemble those of Harnack in many respects. Roth details and deploys an elevenfold (!) continuum of confidence for words:

  1. bold font for secure
  2. bold italics for very likely
  3. normal font for probable
  4. italics for possible
  5. () parentheses for (precise wording not attested)
  6. [] square brackets for [likely present]
  7. [] square brackets for [may have been present]
  8. [] square brackets for [likely not present]
  9. [] square brackets for [may not have been present]
  10. [] square brackets for [possibly not present]
  11. [] square brackets for [readings with ambiguous options]
    In conjunction with the above, two more indications for words or groups of words are used.
  12. … ellipses for lacunae or unrestorable content necessary for the syntax or narrative
  13. {} braces for {uncertain word order}
    For verses (individually or as ranges), Roth also makes use of several more labels:
  14. “attested”
  15. “attested but no insight into wording can be gained”
  16. “attestation uncertain”
  17. “unattested” or “not attested”
  18. explicitly “attested as not present”
  19. implicitly (tacitus) “attested as not present”

To create normalized datasets from this profoundly ambiguous text, content corresponding to indications 1–3 is rendered in normal font. Parentheses enclose content corresponding to indications 4–7. Empty parentheses substitute for indications 12, 15, and 16. To break out indication 11, we wrap the first listed reading within parentheses and substitute empty brackets in place of the variant(s). Indication 13 is simply ignored; the datasets replicate the word order as presented. The datasets omit all content corresponding to indications 8–10 and 17–19. These decisions together employ binary decision-making to produce a clear, consistent, and tokenizable running script akin to a unitary vocal performance or recording, where likely/performed content is included, but unlikely/unperformed/alternate content is not.

The editor explains these intricate typographical and tagging conventions as follows:

since perhaps the most pronounced weaknesses of all previous reconstructions, including Harnack’s, is the lack of distinction between various levels of certainty for attested readings, the following reconstruction clearly reveals the attempt to indicate what level of confidence can be assigned to any particular reading for Marcion’s text. Therefore, even when the wording of this reconstruction agrees with that of Harnack’s, the ability to see an assessment of the relative confidence that one can place in a specific reading seeks to provide significantly more helpful insight into Marcion’s Gospel. ()

The descriptions provided for indications 1–3 elaborate on this proposed method, according to which alignments between patristic citation habits and manuscripts of Luke merit higher levels of confidence, but disparities or lack of corroboration between them require lower levels. As BeDuhn () noted, this method—though pretending to objective neutrality—inherently biases toward the canonized form of the textual tradition and confirms by way of circular reasoning the a priori assumption of the early orthodox hypothesis. Scholars who closely scrutinize the indications across myriad decisions and indecisions may find themselves wondering to what extent they achieve an objective standard for data restoration and to what extent they obfuscate the data by focusing on the editor’s ambiguous array of skeptical sentiments. Whatever the case may be, to move from quasi-scientific confidence to actual scientific confidence, scholarship on GMarc must shift to hypothesis-driven, verifiable research and statistically significant findings based on normalized data.

Quality and Version Control

As detailed in our Harnack data paper, the first dataset consists of human-readable Greek, and the second manually lemmatizes and morphologically tags the text using the BibleWorks Greek Morphology (BGM) schema. The BGM arose out of the previous work of the Computer Assisted Tools for Septuagint/Scriptural Studies (CATSS) project at the Univeristy of Pennsylvania under Robert Kraft () to apply morphological tagging to the Septuagint. Thereafter this tagging was extended to the New Testament and other early Christian texts by Michael Bushnell of BibleWorks, Jean-Noel Aletti of the Pontifical Biblical Institute, and Andrzej Gieniusz. The BGM schema is lightweight, adaptable, familiar to many scholars, licensed for non-commercial use, and easy to compile, edit, and query in word processors and CL environments. For quality control, we ran segmented cross-checks of word counts across datasets, confirming 4,169 words total in each.

These UTF-8 encoded .txt files offer a starting point for CL research on GMarc, not the final word. We welcome scholarly feedback and collaboration to correct, improve, and transform our datasets, convert them to other schemata, and create TEI XML files enriched with unlikely and variant wording and notes placed in the markup so as not to disrupt the visualized flow of the running main text while also allowing for deeper analysis and alternative scenarios.

(3) Dataset description

Object name: Normalized Datasets of Roth’s Reconstruction of Marcion’s Gospel

Format names and versions: UTF-8 encoded .txt

Creation dates: 2020-11-01/2021-09-09

Dataset Creators

Mark G. Bilby (California State University, Fullerton) manually created both datasets.

Languages: Postclassical Greek. English

License: CC-BY-NC-ND

Repository name: Journal of Open Humanities Data Dataverse

Publication date: [to be decided]

(4) Reuse potential

These datasets are by no means substitutions for Roth’s monograph in general or his reconstruction in particular, which we highly encourage readers to consult firsthand. These datasets do, however, provide important transformative supplements to today’s most widely accepted reconstruction of GMarc. They build on the previous publication of normalized datasets of Harnack’s reconstruction in JOHD and anticipate the future publication of datasets based on other GMarc reconstructions. As a non-canonical text suppressed for some 1,800 years, GMarc has suffered much decay and disintegration, but CL, HCL, and open data science have enormous potential to restore this text to a much higher level of fidelity than currently obtains, doing so by means of scientific data restoration methods, including the identification and disambiguation of underlying voices and clarification of interdependent relationships with the voices embedded in other early canonical and non-canonical Gospels. These normalized datasets anticipate GMarc becoming a major focus among data scientists and humanities scholars alike.

Additional Files

The additional files for this article can be found as follows:


These two UTF-8 encoded .txt dataset files are the first normalized, peer-reviewed, and lexicographically enriched transformations of Dieter Roth’s 2015 reconstruction of Marcion’s Gospel to be published. The first dataset consists of human-readable Postclassical Greek, while the second lemmatizes and morphologically tags the text according to the openly licensed BibleWorks Greek Morphology schema. DOI: https://doi.org/10.5334/johd.57.s1

Key to BibleWorks Greek Morphology (BGM) (v1.1)

The BibleWorks Greek Morphology (BGM) schema is, together with its datasets, openly licensed for non-commercial distribution. The schema provides a lightweight, compact means of adding Part of Speech (PoS) tags subsequent to lemmatized words. Each element of the schema occupies a set location within a given sequence. This morphological key elaborates the schema and numbers the respective positions for the sake of clarity. Each option is represented by a single alphanumeric abbreviation dependent on its precursors and position within the sequence. DOI: https://doi.org/10.5334/johd.57.s2