Introducing the Historical Index of Ethnic Fractionalization (HIEF) Dataset: Accounting for Longitudinal Changes in Ethnic Diversity

The demographic composition of most countries has changed dramatically in the last decades [19]. The question of how different people with different backgrounds can peacefully interact in the modern globalized world, and how their societies may prosper, are among the most important challenges in recent years. There has been a growing interest among social scientists and policymakers in studying and uncovering the role ethnic diversity plays in shaping social, political and economic outcomes. Exploring this issue is extremely important and relevant to a range of different public policies including those relating to immigration and integration. While social scientists can rely on a variety of indices measuring ethnic diversity, to the best of my knowledge, most of these indices treat ethnic fractionalization as a time-invariant phenomenon. Thus, they do not provide country ethnic fractionalization estimates for different years. However, in treating ethnic diversity as timeinvariant, we severely limit our understanding of the general long-term effects as well as the effect of slow or radical changes in ethnic fractionalization. The goal of this paper is to characterize the Historical Index of Ethnic Fractionalization (HIEF) dataset that provides annual ethnic fractionalization estimates for 162 countries for the years 1945–2013. By introducing a time perspective, the dataset expands previous ethnic fractionalization indices. HIEF estimates are based on the Cline Center for Democracy Composition of Religious and Ethnic Groups (CREG) Project’s original data [8] regarding the percentage of principal ethnic groups in each country. The paper proceeds in three parts. Firstly, I provide general background information on measuring ethnic fractionalization and the problems researchers face when attempting to create ethnic diversity indices. Secondly, I outline how the HIEF dataset was created and describe the data creation in a stepwise manner. Thirdly, I provide a preliminary descriptive analysis of patterns of changes in countries’ ethnic fractionalization over time.

conflict that can take place between two groups of equal size. The idea behind polarization indices is that ethnic conflicts will take place in countries where a large ethnic minority faces an ethnic majority. The mere existence of a large ethnic group, and/or ethnic dominance by this group, is not a sufficient condition for an ethnic conflict to develop. There also needs to be an ethnic minority that is large and not divided into many different groups. Theoretically, having a large ethnic minority is the worst possible situation as measures of polarization reach their maximum when two equally sized groups face each other. The two measures represent two different approaches to diversity because ultimate fractionalization occurs when each individual belongs to a different group, whereas ultimate polarization occurs when there are only two types of groups. Thus, the two measures behave quite differently [15]. The HIEF dataset quantifies fractionalization rather than polarization as a first step in providing longitudinal measures of diversity. Nevertheless, since the original data also allow computing a polarization index, this might become a possible future endeavour.
In economics, the majority of studies employ a measure of ethnic fractionalization called Ethno-Linguistic Fractionalization (ELF). The ELF measure was first used in an influential article by Easterly and Levine [14] which argues that given Africa's high ethnic diversity and the strong link between ethnic heterogeneity and slow economic growth, these two factors played a rather important part in the explanation for the region's "growth tragedy". Easterly and Levine's ELF measure is based on the work carried out by a team of Soviet ethnographers in the early 1960s and published as Atlas Narodov Mira [6]. Despite ELF's popularity and usage by several generations of political scientists, sociologists and economists, the measure also received criticism and other fractionalization indices have been developed. Alesina, Devleeschauwer, Easterly, Kurlat, and Wacziarg [1] propose a classification that distinguishes between ethnic, linguistic and religious diversity and creates separate indices for each. Their reasoning is based on the fact that relying largely on linguistic distinctions (as the ELF does) may obscure other aspects of ethnicity like racial origin, skin colour and so forth. For instance, in many countries in South America groups are largely monolingual, yet ethnically divided. Other researchers argued that a distinction must be made between ethnically and culturally diverse groups [17] or between politically relevant ethnic groups [25].
There have also been efforts to overcome simple fractionalization measures by focusing on conjunctures with other heterogeneities such as the index of ethnic inequality [3] that puts forward the inter-section of ethnic diversity and economic inequality or an index that combines five cleavages, namely race, language, religion, region, and income [27]. Other indices make an effort to account for the distance between groups [16], the historical depth of ethnic cleavages [10] or consider heterogeneity between individuals rather than groups [4].
As explained above, heterogeneity may be defined ethnically, religiously, linguistically, culturally, but also economically as income inequality. It is worth underlining that indices regarding ethnic composition are particularly vulnerable to criticism in their attempts to measure ethnicity. To begin with, empirical efforts to create an ethnic index require that we collect data on ethnic groups in different countries. However, there is no uniform criterion on how to define ethnicity. Group identities are complex and mostly socially constructed which means that quantifying and measuring them is inherently problematic. There can be multiple ways to specify ethnic groups in a country all of which may be equally valid concepts of "ethnic groups". Moreover, even within one country, definitions of ethnicities can change over time. Questions related to the definition of diversity become even harder in comparative research that involves multiple countries each of which has its own concept of ethnicity. These facts notwithstanding and being aware of the possible shortcomings in constructing ethnic classifications, the HIEF dataset is largely based on an ethnic, rather than linguistic, distinction between groups.

Why change over time matters
Definitional issues aside, I argue that a major problem with a large part of the existing social science research on the effects of ethnic diversity is that diversity is often treated as time-invariant. This limits our knowledge about diversity's long-term effects. An increase or decline in ethnic fractionalization over time might have different consequences. For instance, countries with steadily increasing ethnic diversity might be more willing to introduce institutions that effectively manage problems connected to more heterogeneity than countries with shorter histories of ethnically diverse societies or with lower average rates of change in diversity. These institutions may then mediate the relationship between ethnic diversity and social, economic, and political outcomes. Moreover, in instances such as in the case of the dissolution of multi-ethnic states ethnic fractionalization may decrease rapidly which poses completely different challenges to the newly homogeneous societies. Failing to consider these historical developments might seriously hinder our understanding of the effects of ethnic diversity. With HIEF, it is now possible to depict longitudinal relationships that might improve our understanding of the causal relationships between ethnic diversity and relevant outcomes. A number of studies consider changes in ethnic diversity longitudinally in several countries. However, these studies either rely on immigration estimates [20], consider only one country at a time [11], or focus on subnational units [29]. Recently, some scholars published articles that use time-varying measures of ethnic fractionalization [5,7], but all of the indices used are much more limited than HIEF, either with regard to time-variation or countries covered. Moreover, these studies do not make their original dataset publicly available to be used by other researchers.

Creating the Historical Index of Ethnic Fractionalization Dataset
The original data on ethnic groups were gathered from CREG initiated by the Cline Center. The project provided information regarding the percentage of principal ethnic groups present in 162 countries annually for the period 1945-2013 [8]. The main sources for the CREG data were the Britannica Book of the Year, the CIA World Factbook, and the World Almanac Book of Facts [24]. In the original dataset, data were recorded from the main sources by a group of data collectors and later assessed by a group of data integrators who performed a number of checks. These checks accounted for consistency of group names and data outliers such as if there is "a group that is reported as constituting 25% of the population in one year and 35% in the next" [24: 4], and data inconsistencies when "different editions of the same source reports (sic) a group as constituting 18% of the population and 26% of the population in 1968" [24: 4]. Nevertheless, as the original dataset still contained some inconsistencies such as repeated information regarding certain ethnic groups in a single year, the original dataset had to be carefully checked and corrected.
In the HIEF dataset, the degree of ethnic fractionalization has been calculated based on the annual percentage of ethnic groups in each country using the most universally applied formula in the empirical literature which is a decreasing transformation of the Herfindahl concentration index measured by: where EF ct is the level of ethnic fractionalization in country c at time t, i indexes ethnic groups, and S i is the proportion of the population in unit c belonging to ethnic group i (i = 1, …, n) at time t.
As described above, the ethnic fractionalization index for each country at any given year ranges from 0, where there is no ethnic fractionalization in the country and all individuals are members of the same ethnic group, to 1, where each individual in the country belongs to his or her own ethnic group. It should be noted that, historically, who was considered as belonging to a certain ethnic group could change, reflecting the politics and science of the times. The relative meaning of being in a certain category may not be the same from one time-point to another [28] both from the societal or individual point of view. The challenge arises especially with the introduction of categories such as "mixed race", mestizo, mulatto and similar categories in data collection. Thus, the measures may only have "nominal equivalence" and lack "functional equivalence" [9] which makes collecting ethnicity data, and measuring changes over time, challenging.

Descriptive illustration of the new dataset
The HIEF dataset contains three variables, namely Thus, the variable Country includes the entry "Yugoslavia" for the years 1945-1992 and five separate entries "Bosnia-Herzegovina", "Croatia", "Slovenia", "Macedonia" and "Serbia" for 1993-2013. It follows that countries founded after the year 1945 are included beginning from the year they have been officially established.
The variable Year contains the corresponding year of observation for each country, usually ranging from 1945 to 2013. As described above, a shorter time span may be included for certain countries which were yet to be founded or ceased to exist.
The variable EFindex contains the actual value of the ethnic fractionalization index in each country for all available years. Every value of the ethnic fractionalization index can be, as described above, interpreted as one minus a weighted sum of population shares p i where the weights are these shares themselves. Table 1 summarizes the countries and years for which the ethnic fractionalization index is available.
The HIEF dataset is made available as a .csv file and can be found along with a document briefly introducing the dataset on the Harvard Dataverse repository [12].

Comparing changes in diversity among European countries
Explorations of the new dataset illustrate the reasons why it is important to take account of historical changes in ethnic diversity within countries. Figure 1 shows the change of ethnic fractionalization over time in a sample of European countries. We can observe, for example, that Great Britain and the Netherlands had a similar level of ethnic fractionalization in 2013, but since 1949, diversity in the Netherlands has grown at a much faster pace than in Great Britain. In other words, Dutch society had to adapt to diversity more rapidly than the British. In contrast, Finland's ethnic fractionalization has stayed quite stable over the last 50 years and is generally low.
On the other hand, many Central and East European countries are much more ethnically homogenous than they used to be. Moreover, they became homogeneous in a short period. For instance, while former Czechoslovakia used to be an ethnically highly heterogeneous country, its successor states, Czechia and Slovakia, are much more homogeneous. Apart from separations of what used to be ethnically heterogeneous countries such as Czechoslovakia, Yugoslavia, or the Soviet Union, there are a number of reasons why one can observe changes towards more homogeneity in Central and Eastern Europe. Firstly, after the collapse of communism, many workers left their respective countries in search of new economic opportunities. Secondly, in many post-Soviet countries, Russian minorities began to feel unwelcome resulting in return migration [18].
As one can observe in Figure 2, many African countries are highly ethnically heterogeneous with relative fractionalization stability. For instance, highly heterogeneous countries such as South Africa or Uganda have not experienced dramatic changes in fractionalization over the years. On the other hand, although its overall fractionalization is quite low, Swaziland has experienced a steady increase in heterogeneity while diversity in Tanzania and the Democratic Republic of Congo has actually declined. Thus, there might be profound political and societal differences  between these countries not only concerning current ethnic fractionalization, but also how fast the levels of their current ethnic diversity were achieved in recent years.

Dataset robustness check
To test the robustness of the HIEF dataset, three new datasets are created that add some noise to the original data. This procedure is adapted from Kolo [21]. The four datasets should not differ in a significant way. As described in detail in Kolo [21], the noise data is created by employing normal randomization, namely by replacing the original group size with a new size produced by a normal distributed random variable. This way, two alternative datasets have been created. Dataset sigma_1 uses the standard deviation of the group distribution over all observations which is thus equal for all countries, while  dataset sigma_2 uses a country-specific standard deviation. Finally, as a final robustness test a third smaller dataset is created in which the smallest group for each country and for each year is removed. It should be, however, noted that the group is only removed if the number of groups in a country in a given year is greater than one, and the group size of the smallest group is smaller than 1 percent. Pearson correlations between the original HIEF dataset and the three noisy datasets are all very high (sigma_1 r = 0.982; sigma_2 r = 0.974; smaller r = 1.000) confirming high congruency. Moreover, there were no statistically significant differences between the original HIEF dataset and the three noisy ones (sigma_1 t(17568) = -0.186, p = 0.852; sigma_2 t(17568) = -1.411, p = 0.158; smaller t(17568) = -0.062, p = 0.949). Figure 3 shows the values of the three noisy datasets plotted against the HIEF original data.

Conclusion
The aim of this article has been to describe the new Historical Index of Ethnic Fractionalization (HIEF) dataset, the procedures used for its calculation and, finally, to illustrate the importance of considering historical developments in ethnic fractionalization. Focusing on country-year estimates for the period 1945-2013, the HIEF dataset complements already existing ethnic fractionalization indices which do not take into consideration the variation of ethnic fractionalization over time. This is an  important advancement as the variation in ethnic heterogeneity over time might be relevant for the effects of ethnic fractionalization on diverse social, economic, and political outcomes. Many studies have concluded that ethnic diversity has a negative impact on economic development [14], macroeconomic stability [2], social trust [26], quality of governance [22], democracy [23] among others. However, I argue that there may be value in rethinking the assumption of ethnic diversity being relatively timeinvariant. Changes in heterogeneity might play a role in affecting the relationship between ethnic diversity and social, economic, and political outcomes. Looking at longterm effects and (rapid or slow) time-variant changes in ethnic diversity can help us to advance knowledge about the peaceful co-existence in ethnically diverse societies. For instance, it might help social scientists evaluate the under-explored hypothesis that while people usually react negatively to more diversity, in the long run, ethnic diversity can prove beneficial [26].

Supplementary information and material
Supplementary material is provided in the form of data and open source Python scripts. The data used in this study are archived on Harvard Dataverse [12]. A GitHub repository [13] contains the Python script that has been used to generate the HIEF dataset and its noisy versions for data robustness check. Instructions on how to run the script are available in the Readme file contained in the repository.