Digital Second Edition of Judaica Americana: A Bibliography of Publications to 1900

Context This dataset was produced as part of the creation of the Digital Second Edition of Judaica Americana, a pilot project within the framework of the Penn Libraries’ Judaica Digital Humanities program [1]. This dataset is derived from the 1,500+ page PDF file of the second edition of Robert Singerman’s Judaica Americana, the copyright of which was donated to Penn Libraries in 2019 [2]. In it, Singerman identified 6,500+ monographic and serial publications, and an additional 3,000 entries in the supplement, presented with bibliographical descriptions, classification explanations, and holdings information. Supplemental entries to the second edition were identified by Singerman through digitized, full-text books accessible from freely available online services like HathiTrust and Internet Archive, auction sites, and online marketplaces for booksellers. The first edition of Judaica Americana was notable for several differences from its predecessors in the bibliographies of American Jewish literature. It extended bibliographic listings to 1900; it included location symbols for all known copies of a work; it identified each work with references to its listing in early Jewish and standard classic American reference works; it included transliterated titles; and its extensive index allowed for search not only of author, title, and broad subject areas, but of place of publication and publisher [3]. (2) Methods Steps Using the document version of Judaica Americana, I developed a Python script, “extract_singerman.py,” to create a dictionary with the key as the Singerman ID number (or an assigned ID number if it was a new entry since the print edition) with the following values: year of publication, the full entry as listed in Singerman, notes, and holdings information [4]. I further extracted the author/editor of an entry, title, place of publication, printer/publisher, and asterisks through a combination of manual identification and OpenRefine, an open source desktop application for data cleanup, wrangling, and transformation activities. With a scanned PDF of the print edition’s index, I used a Python script developed by Digital Scholarship Librarian Jonathan Scott Enderle to perform Optical Character Recognition (OCR) via tesseract, an open source text recognition Engine [5]. This created three separate text files with each line displaying a heading followed by a list of locators (Singerman ID numbers). After extensive cleaning, a final Python script, “flipindex-headers.py,” created another dictionary, with each locator as a key and headings as values [6]. This dictionary was written into a separate CSV and then linked with corresponding entries in the dataset based on the Singerman ID number.

This dataset was produced as part of the creation of the Digital Second Edition of Judaica Americana, a pilot project within the framework of the Penn Libraries' Judaica Digital Humanities program [1].
This dataset is derived from the 1,500+ page PDF file of the second edition of Robert Singerman's Judaica Americana, the copyright of which was donated to Penn Libraries in 2019 [2]. In it, Singerman identified 6,500+ monographic and serial publications, and an additional 3,000 entries in the supplement, presented with bibliographical descriptions, classification explanations, and holdings information. Supplemental entries to the second edition were identified by Singerman through digitized, full-text books accessible from freely available online services like HathiTrust and Internet Archive, auction sites, and online marketplaces for booksellers.
The first edition of Judaica Americana was notable for several differences from its predecessors in the bibliographies of American Jewish literature. It extended bibliographic listings to 1900; it included location symbols for all known copies of a work; it identified each work with references to its listing in early Jewish and standard classic American reference works; it included transliterated titles; and its extensive index allowed for search not only of author, title, and broad subject areas, but of place of publication and publisher [3].

Steps
Using the document version of Judaica Americana, I developed a Python script, "extract_singerman.py," to create a dictionary with the key as the Singerman ID number (or an assigned ID number if it was a new entry since the print edition) with the following values: year of publication, the full entry as listed in Singerman, notes, and holdings information [4]. I further extracted the author/editor of an entry, title, place of publication, printer/publisher, and asterisks through a combination of manual identification and OpenRefine, an open source desktop application for data cleanup, wrangling, and transformation activities.
With a scanned PDF of the print edition's index, I used a Python script developed by Digital Scholarship Librarian Jonathan Scott Enderle to perform Optical Character Recognition (OCR) via tesseract, an open source text recognition Engine [5]. This created three separate text files with each line displaying a heading followed by a list of locators (Singerman ID numbers). After extensive cleaning, a final Python script, "flipindex-headers.py," created another dictionary, with each locator as a key and headings as values [6]. This dictionary was written into a separate CSV and then linked with corresponding entries in the dataset based on the Singerman ID number.

Sampling strategy
Subject matter, not author ancestry, is the determining factor for inclusion in Judaica Americana and the dataset. The basic premise was to "include all separately printed works issued under Jewish auspices including works issued under non-Jewish auspices relating to the Jewish people and their culture, from antiquity to modern times" [3].

Quality Control
To normalize the location and author data post-extraction, I used several clustering methods in OpenRefine. The dataset matches the document as exactly, where spelling variations exist. For example: "Philadelphia. Congregation Beth Israel" and "Philadelphia. Beth Israel Congregation" are considered unique authors. Assumed authors, denoted with square brackets, are considered unique from known authors, such as "[Scott, Walter, Sir]" and "Scott, Walter, Sir."  The majority of the dataset is written in English. In the title field, the dataset includes Aramaic, Danish, English, French, German, Hebrew, Judeo-German, Norwegian, Spanish, and Yiddish text. Transliteration for Hebrew, Greek, and Russian is enclosed in brackets and follows the ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts approved by the Library of Congress and the American Library Association. Transliteration for Yiddish is enclosed in brackets and follows the Weinrich/YIVO system. With the exception of the Hebrew chet (represented by "ḥ"), diacritical marks used for the Romanization of letters in the Hebrew alphabet have been ignored [3].

Publication date
First published to the repository on 2019-10-22.

(4) Reuse potential
This data has reuse potential for scholarly research within studies of early American Jewish history and nineteenth-century American history in general. Like its physical counterpart, the chronological focus of the data allows for mining, clustering, and historical context of texts. Its broad subject allows for exploration of Jewish communal activity through society and institutional constitutions, as well as non-Jewish attitudes towards Jewish people during this period, as Jewish communities spread with the territorial expansion of the United States in the second half of the nineteenth century.
For book historians, this dataset offers historical bibliometrics to view trends in book data, identifying macroscopic trends in the nineteenth-century book market. It will allow researchers to identify and visualize the connections among various printers, publishers, and places: a key area of study in Hebrew press history.
This dataset also has the potential for use by those involved in the book trade. Users can utilize the dataset as a reference record for extant copies and subsequent editions. One can also quickly generate counts of monograph holdings at various libraries. Furthermore, archivists and librarians can provide more complete descriptions of existing archival materials and books across private and institutional collections.
This dataset also has the potential for augmentation and expansion, being undertaken in the development of the project [7]. For example, we have linked entries listed in Yosef Goldman's Hebrew Printing in America, 1735-1926: A History and Annotated Bibliography. Goldman's bibliography contains overlapping information with Singerman's bibliography for these entries, but is unique for its inclusion of reproductions of many of the title pages, brief content and author biographical notes and vernacular Hebrew-character titles. The conception and design of this dataset will allow for easy reference and research between the two seminal bibliographic works, and for researchers to identify texts using Hebrew and English characters.
Virtually all of these texts are no longer restricted by US copyright, and many of them have been scanned for digital preservation by libraries. We are expanding the dataset by including links to digital facsimiles of the publications included in Judaica Americana, and intend to incorporate PDF files for integrated full text search and discovery across the corpus. For the development of the Digital Second Edition, we have added links to holdings in WorldCat, HathiTrust, GoogleBooks, and Penn Libraries. This will allow Singerman's bibliography to serve an additional purpose as an all-inclusive digital library for full-text searchable references: an annotated bibliography for the twenty-first century.
Finally, this dataset is complemented by the Union List of Nineteenth-Century Serials included in Judaica Americana in the same repository. Singerman's attempt was the first to arrange this multilingual collection of serials and their post-1900 issue publications in one bibliography [8]. It models the same data process as described above [9]. We are working to collaborate with The Ohio State University's project, "Union List of Digitized Jewish Historic Newspapers, Periodicals and e-Journals" in a mutual partnership to identify and provide access to digital facsimiles [10].

Additional File
The additional file for this article can be found as follows: • Judaica Americana Header Explanations. This Markdown file contains background information on the dataset, including explanations for and descriptions of the headers within the dataset. DOI: https:// doi.org/10.5334/johd.15.s1