(1) Overview

Repository location

The database is available in both Microsoft Access and SQLite versions on Dataverse at https://doi.org/10.7910/DVN/PAGGQS and on Github at https://github.com/cbdb-project/cbdb_sqlite. They are regularly updated with new contents and functions.

Context

The China Biographical Database (CBDB) amasses biographical information from disparate historical sources to facilitate quantitative, prosopographical research of premodern China. The project originated with the dataset that Robert M. Hartwell (1932–1996) created between the mid-1970s and 1995, as part of his research on the social and political history of middle-period China (ca. 7th–13th century), and willed to the Harvard-Yenching Institute. In 2004–05, Michael A. Fuller restructured and converted the data from dBase first into FoxPro and then into Microsoft Access format. It has since been transferred to the Fairbank Center for Chinese Studies at Harvard University, which, together with the Center for Research on Ancient Chinese History at Peking University and the Institute of History and Philology at Academia Sinica, continued to add new contents under the direction of an international committee chaired by Peter K. Bol. Over the past sixteen years, CBDB has grown from a database of about 25,000 individuals to include approximately 491,000 individuals (as of May 2021) whose lives spanned from the seventh through nineteenth centuries and is available for scholarly use in several online and offline (Microsoft Access, Microsoft SQL Server, MySQL, and SQLite) versions. The contents of CBDB benefit from, and are inevitably shaped by, China’s historiographical tradition which provides rich data on family relations, literary exchanges, intellectual interactions, and careers in government, among others, but is often reticent about issues like gender relations and economic transactions. Because of this, CBDB has 275,945 records on bureaucratic appointments, 482,953 records on kinship relations, 160,219 records of non-kin social connections, but hardly any on economic activities as of May 2021.

(2) Method

Steps

There are two core tasks in our data collection: data mining and disambiguation. CBDB is a relational database that uses the entity-relationship model to organize biographical information. Persons are a type of entity. So are places, texts, offices, and so forth. Each entity has its own set of attributes (e.g., each person has a birth year and a death year, and each place has a longitude and a latitude), and every life event is conceptualized as an instance of a relationship between multiple entities (e.g., a bureaucratic appointment is an instance of relationship, from the beginning to the end year of that appointment, between a person, the office he held, and the jurisdiction of that office). Data collection is, in substance, a matter of identifying named entities and their relationships in historical sources that are described in narrative forms. For this purpose, we have experimented with several data mining approaches and found value in algorithms based on regular expressions and neural network models, such as Bidirectional Encoder Representations from Transformers (BERT) and Bidirectional Long Short-Term Memory (Bi-LSTM). We use BERT, for example, to create a vector representation of each Chinese character (an approach known as “word embedding”), which allows us to capture semantic and syntactic relations between characters through mathematical operations. We also use Bi-LSTM to tag the characters and predict whether a character is part of a string that signifies a specific person, place, or bureaucratic office. Outputs from these automated data mining algorithms are reviewed by an editorial team before they are prepared for inclusion into our database.

In merging newly harvested data into CBDB, the chief challenge comes from the complex relationship in natural language between a name and the entity it signifies. CBDB assigns a unique identifier (“id” or “code”) to each named entity regardless of how it is referenced in the sources, and our development team makes every effort to disambiguate all newly harvested data before incorporating them into the database. Take persons for example. While we are blessed by the fact that most people of all walks of life in Chinese society, unlike the Europeans, had possessed both a family name and a given name since the Han dynasty (202 BCE–220 CE) and had the flexibility of composing given names from almost any Chinese character, it is not rare for two persons to have exactly the same name. On the other hand, members of the elite in imperial China were typically known by a wide variety of names and could be referred to by their office titles and other honorific appellations. Therefore, it is often necessary to disambiguate personal names and appellations in historical sources. In practice, we make use of a variety of biographical information such as alternative names, birth and death year, native place, examination degree, and data on kinship and social connections to distinguish a person from his namesake and consolidate data points about the same person whom the sources reference in various ways.

We do not only disambiguate and code entities, but also disambiguate kinship relations. We have designed a set of symbols to describe kinship relations with greater precision than they are expressed in the natural language (e.g., we use FBS and MBS [father’s or mother’s brother’s son], among others, to distinguish different kinds of paternal and maternal cousins). We also normalize social relations by aggregating varied expressions found in historical sources into coded categories. Natural language has numerous ways of describing social relations. While the nuances in these descriptions (e.g., to censure someone vs. to criticize someone) merit attention and may, at least in some cases, reflect subtle differences in the nature of actual social relationships or the perceptions thereof, the strength of CBDB lies in facilitating the analysis of a large amount of historical data in the aggregate. To achieve this goal, we classify social relations into coded categories. As of May 2021, we have 470 pairs of coded relations that are further organized into larger classes and subclasses, which include literary exchanges, teacher-disciple ties, supportive or oppositional political relations, and so forth. After fully disambiguating and normalizing (“coding”) named entities and their relations, we partition the data into separate tables which are subsequently uploaded to the database. The primary key in each data table eliminates duplicate records, and the foreign key ensures proper linkage between tables.

Disambiguation and normalization are time-consuming tasks that require domain knowledge in specific historical periods and topics. To expedite the process, we launched a crowdsourcing platform in 2021 to encourage contributions from historians of premodern China.

Sampling strategy

Our ultimate goal is to collect all biographical information in the extant historical record of premodern China. Resource constraints, however, require that we must set priorities. To produce a large collection of data for scholarly use within a reasonable timeframe, we have worked mainly with digitized, searchable texts, especially those that were written and formatted in a style particularly suitable for automated data extraction, and prioritized data sources that can systematically expand the coverage of our database. These include both modern scholarly works, such as biographical sketches and rosters of officeholders compiled by twentieth-century historians, and primary historical documents, such as biographies in official histories and local gazetteers, tomb epitaphs, records of imperial examination graduates, and the lists of letters and other writings in literary collections.

Several biographical dictionaries, compiled in the 1960s and 1970s, provide a large assemblage of material on the lives of approximately 70,000 persons between the tenth and seventeenth centuries (; ; ). By systematically harvesting the data in these dictionaries, the CBDB team managed to create basic profiles for a large number of historical figures during an early phase of our project.

Since then, we have expanded coverage by concentrating data collection in three areas: bureaucratic appointments, family relations, and literary exchanges. We have collected data from two multi-volume compendia which contributed more than 35,000 records on prefectural appointments from the seventh to thirteenth centuries (; ). These were recently supplemented by another 107,000 entries on local appointments taken from 158 local gazetteers compiled in Ming-Qing times (1368–1912). Using fifty-two examination records from the Ming dynasty (1368–1644), we have added roughly 14,116 metropolitan examination graduates and their 130,000 relatives into the database. We are now expanding data coverage in this area with a new dataset containing 19,576 Song-dynasty (960–1279) examination graduates based on a recent publication (). With the help of Tang historians (Yao Ping and Nicolas Tackett), we have added some 100,000 instances of kinship relations from tomb epitaphs between the seventh and tenth centuries (; ), and we are currently preparing a massive collection of officeholding data from Song-dynasty administrative documents ().

At present, the majority of our data on social relations are based on records of literary exchanges. We collected 18,124 instances of poetic exchange between the seventh and tenth centuries, based on the work of a modern scholar (), and some 8,800 instances of epistolary exchange between the tenth and thirteenth centuries based on Complete Song-Dynasty Prose (). We will soon add another 40,000 instances of epistolary exchange from Ming-dynasty (1368–1644) literary collections. For a full list of our data sources, see https://projects.iq.harvard.edu/cbdb/cbdb-sources.

In addition, we have also coded and incorporated data from existing databases that focus on specific social groups and historical periods. These include, for example, a massive collection of data on family relations and officeholding for more than 46,000 persons from the Database of Names and Biographies () and some 5,000 female writers from Ming-Qing Women’s Writings Project ().

CBDB is a work in progress and has no end date planned. Its current contents reflect its history that began with Hartwell’s dataset of Song-dynasty officials and gradually extended back into the Tang dynasty and forward into the Yuan, Ming, and Qing dynasties. As more historical texts from premodern China become available in searchable digital formats and the technology of data mining improves, the contents of CBDB will continue to grow.

Quality control

Our editorial group, composed of doctoral students in Chinese history who specialize in various topics and periods, review the output from data mining algorithms and, when necessary, manually input data into our database. Additionally, when new data are prepared for uploading to CBDB, the primary and foreign keys in data tables also function as a line of defense for data integrity.

(3) Dataset Description

Object name

SQLite version: CBDB_20210525.7z;

Microsoft Access version: CBDB_bc_20210525.7z

Format names and versions

CBDB is available for downloading in SQLite and Microsoft Access versions. Both its content and interface are constantly evolving. Data contents are dated by the most recent update in the format of yyyy-mm-dd, and the interface is versioned using two lowercase English letters (the latest release is the bc version).

Creation dates – 1970s to 2021–05–25

Dataset creators – Current executive committee members include Peter K. Bol (), Xiaonan Deng (Center for Research on Ancient Chinese History, Peking University), Michael A. Fuller (University of California at Irvine), Song Chen (Bucknell University), Hsi-yuan Chen (), Wenyi Chen (), Xin Luo (Center for Research on Ancient Chinese History, Peking University). Current project managers are Hongsu Wang () and Yang Xu (Peking University). For a list of past and present committee members, editors, and other contributors, see https://projects.iq.harvard.edu/cbdb/core-institutions-and-editors. For a list of crowdsourcing contributors, see https://projects.iq.harvard.edu/cbdb/cbdb-crowdsourcing-projects.

Language – Variable names are in English. Data are bilingual (English and Chinese).

License – CC BY-NC-SA 4.0

Repository name – Dataverse and Github

Publication date – 2021–05–25

(4) Reuse Potential

CBDB assembles biographical information from disparate sources and is particularly suited for data-driven, social scientific research that aims at discovering macroscopic patterns in Chinese history and complements the qualitative, humanistic approach of close reading. The current coverage of CBDB makes it particularly powerful for prosopographical studies of the Chinese elite from the seventh through nineteenth centuries. The data in CBDB is continuously disambiguated and readily formatted for statistical, social network, and spatial analyses. A growing number of articles are published every year that use CBDB data to explore topics ranging from career trajectory, regional composition, and family connections of civil officials to intellectual and social networks of Neo-Confucian moral philosophers, antiquities collectors, and members of political factions. For a full list of publications that use CBDB data, see https://projects.iq.harvard.edu/cbdb/publications-use-cbdb-data.

CBDB also has immense value for developing new digital projects. Online text markup platforms, like MARKUS (), use CBDB code tables to tag persons, bureaucratic offices, places, and temporal references in user-uploaded historical texts. Specialized databases (e.g., Database of Names and Biographies) access CBDB, through our API, to provide more context to their data collections. The Chinese Text Project integrates data from CBDB and other sources to produce a knowledge graph in its Data Wiki (), and the Shanghai Library uses our data for its Linked Open Data project (). Universities, such as Tsinghua, use CBDB to teach digital methods for Chinese studies and incorporate CBDB into their pedagogical platforms () that train the next generation of digital humanists.