Critical Reflections on Cinema Belgica: The Database for New Cinema History in Belgium

New Cinema History broadened film studies by emphasising the complexity of cinema as a multifaceted phenomenon that includes the socio-economic context in which films were made, circulated, shown and received. As part of the digital turn, the discipline adopted computational methods and created quantitative research data to research this socio-economic context at scale. However, not all datasets created in this context adhere to FAIR principles, decreasing their reusability. By reconciling 14 cinema-related datasets, Cinema Belgica facilitates research into the history of Belgian cinema. This research paper documents and critically reflects on the choices made when selecting, modelling and reconciling information for the Cinema Belgica database.

(1) INTRODUCTION Cinema Belgica (CB) 1 is an online database that integrates 14 existing datasets covering Belgian film production, distribution, exhibition, censorship and reception.In order to facilitate (inter)national data exchange and comparative research into cinema history, the platform (a) integrates key datasets related to Belgian cinema history, (b) makes them accessible according to FAIR principles (Wilkinson et al., 2016), and (c) enriches them with heritage collections.CB is Belgium's contribution to the growing network of historical cinema databases, such as Cinema Context 2 and The German Early Cinema Database 3 (Dibbets, 2010;Garncarz, 2014).To explain the historiographical value of the contributed datasets and the selection criteria underlying them, the first section of this article discusses the field of New Cinema History (NCH).Section two provides the relevant metadata about the database discussed in this paper.The creation of the database is thoroughly discussed in the third section, with an emphasis on corpus building, data modelling and reconciliation methods.The scalability of CB's approach and the re-use potential of the aggregated data is discussed in the final part of this article.

(1.1) NEW CINEMA HISTORY AND THE RENEWAL OF FILM HISTORIOGRAPHY
In their groundbreaking historiographical reflection on film history, Robert C. Allen and Douglas Gomery (1993, p. 68) argued that the approach to cinema has long been focused on the aesthetical, ideological and representational qualities of films, whereas "economic, technological, and cultural aspects of film history are subordinate to the establishment of a canon of enduring cinematic classics."Since then, film scholars have made a concerted effort to resolve this issue, mainly by paying closer attention to the socio-cultural and economic contexts in which films are produced, circulated, exhibited and received.Identifying themselves as "New Cinema Historians", a new generation of film scholars aims to comprehend the complexity of cinema as a wider phenomenon that is influenced by various spheres of life: "[cinema] involves a specific place (the cinema as exhibition and a physical venue); a space (an imaginary and socially embedded version of this site); an industry (of production, distribution, exhibition and circulation); an experience (cinemagoing as a sensory and imaginative practice); and even a way of life (in which people act, talk, play or think 'cinematically' -comme du cinéma -in everyday life)" (Biltereyst, Maltby & Meers, 2019, p. 2).
Although not entirely new (Cressey, 1938;Elsaesser, 1986;Mayer, 1948), these arguments for viewing and analysing cinema as a broad socio-cultural and economic phenomenon began to emerge more clearly on the research agenda of film/cinema historians at the end of the 1990s and the beginning of the 2000s (Kuhn, 1999(Kuhn, , 2002;;Maltby et al., 2007;Maltby & Stokes, 2004;Stacey, 1994;Stokes & Maltby, 1999a, 1999b, 2001).New cinema historians ushered in a shift from film to cinema history, addressing topics such as an audience's cinemagoing memories, engagement with films, and relationship to film venues, not just as places where films are shown, but also as social spaces embedded within local communities and neighbourhoods' memories (Kuhn, Biltereyst & Meers, 2017).In addition to this audience-centred shift in film historiography, NCH was also greatly influenced by the spatial turn, which placed a strong emphasis on issues related to the location of movie theatres, the spatial characteristics of their interiors and ambiance, and the significance of space and place in interpreting moviegoing experiences and memories.This trend of researching historical film audiences (Egan, Smith & Terrill, 2022) invites the use of a wide range of data, methods, and theoretical foundations.In an effort to understand the historical audience, researchers perform quantitative analyses of box-office revenues (Sedgwick, 2011); they use corporate reports or other recordings and testimonies on the audience coming from the industry (Sullivan, 2010); they turn to film programme analyses in order to understand what cinemagoers viewed in which venues at what kinds of locations (Biltereyst et al., 2011); they examine letters and other traces left by historical film fans; fan literature and movie magazines are now examined in different directions, also by using digitised film magazines or newspapers (Biltereyst & Van de Vijver, 2020); and they use questionnaires or interview older cinemagoers (Taylor, 1989) more mature, inter-and multidisciplinary field.It now makes use of perspectives that were previously frequently ignored within the field of film studies, such as ethnographic research (Richards, 2003;Taylor, 1989), memory studies (e.g., Kuhn, Biltereyst & Meers, 2017), social geography and urban studies (Biltereyst & Van de Vijver, 2016), oral history methods (Treveri Gennari, Dibeltulo, Hipkins & O'Rawe, 2018) or the digital humanities (Acland & Hoyt, 2016;Noordegraaf et al., 2021;Verhoeven, 2012).
In general, NCH scholars share the ambition to create open-access corpora to study the socioeconomic context of cinema at scale (e.g., quantitative research on box office figures, seat count, and screen count; see Sedgwick, 2000Sedgwick, , 2011)).Major historical or longitudinal datasets on film programming issues (van Oort & Noordegraaf, 2020) as well as on the spatiality of cinemas (Biltereyst, van Oort & Meers, 2019;Hallam & Roberts, 2014) have recently been created.These initiatives can be supported by concept of linked data as it facilitates the interlinking and semantic querying of (semi-)structured data from diverse source types.Specifically, it can assist NCH's contextual approach as it gives researcher the ability to query multiple variables while supporting data interoperability and distributed content creation (Noordegraaf, Lotze & Boter, 2018;van Wissen et al., 2021).In this context, the Cinema Belgica database was created.

Belgian film control
An first set of projects and datasets focused on censorship practices in Belgium from the founding of the Filmkeuringscommissie, the Belgian Film Control Board in 1921.At the beginning of the 20th century, Belgium established a system of voluntary censorship, a unique phenomenon in the world (Biltereyst, 2020).The "Vandervelde Act" (1920) prohibited film screenings for children (-16 years) without the Board's approval.Although distributors were not required by law to submit their films, business considerations forced the industry to comply with the Board's demands.The FWO-project "Forbidden Images" (2003-2006;see Depauw & Biltereyst, 2008) supplied the Control Board's decisions on age ratings for particular film titles, reasons for cutting or refusing films, as well as detailed information on production and distribution companies.The decisions made by the Control Board between 2004 and 2010 were added as part of Tim De Canck's master thesis (De Canck, 2013).The primary source is the Board's archives, which are supplemented by additional sources such as trade journals and other archival material.

Cinema venues
A second set focuses on cinemas and other film screening venues in Belgium.The FWO-project "The Enlightened City" (2005)(2006)(2007)(2008); see Biltereyst & Meers, 2007) provided a general database of exhibition venues in Flanders and Brussels for the period 1924 to 2000.This regional data was enriched with in-depth data on three major cinema cities: Antwerp, Ghent and Brussels.1952,1962,1972,1982,1992 yes yes A, D, DC, EC, PC, V C, WS CR, PC, RY, S, TI, V

Cinema programming Ghent 5609 entries on screening weeks
Ghent 1933-36, 1945, 1952, 1962, 1972 Noordegraaf, 2020).The CB model is built around eight main entities (see Figure 1): 1. "Programme" refers to the weekly screening schedule of a (travelling) cinema and includes one or more programme items -usually films.A programme is assigned a dating by "programme_date", which contains the start and end day of the weekly programme.
Both days are based on the mentioned data from a historical source, which is also kept.

"
Film" connects the original title, release year, external identifiers (IMDb, Wikidata and/ or Filmmagie), length and unit (meters, reels or minutes), and digitised film posters to the corresponding film entity.All language variations of the film's title are included via "mentioned_film_title".Films can be related to both "person" and "company" entities if a dataset includes information about people (e.g., actors) or businesses associated in the production or distribution of a film.

"
Censorship" structures all information related to film censorship by the Belgian Film Control Board.This includes censorship decision (e.g., 16+), motivations for the decision, possible appeal to the decision and date of the record.On a more detailed level, "censorship" links to information about cuts demanded by the censor, their description and categorisation.The censorship typology consists of 15 main categories (e.g., "crime") and 119 subcategories (e.g., "Burglary", "Arson" and "Smuggling") based on the work of the FWO-project "Forbidden Images" (2003)(2004)(2005)(2006).
4. "Venue" organises information about the opening and closing year, status (e.g., centre cinema), type (e.g., travelling cinematographer), ideological characteristics (e.g., catholic), and infrastructure (e.g., number of seats) of a cinema operation.A venue is linked to an "address", which contains information about the geographical location and architecture of the building where a venue was located.This conceptual difference was necessary due to the phenomenon of travelling cinemas and issue with venues that relocated their operations to another address.A venue can be associated with both individuals and businesses.This includes the function these entities had in relation to the venue, as well as the duration of that service.  5. "Person" stores all-natural persons that are associated with a specific venue, company and/or production or distribution of a film.
6. "Company" stores are legal entities that are associated with a specific venue and/or the production or distribution of a film.

"
Archive_item" lists all data references to unpublished historical sources.

"
Publication" lists all data references to bibliographical, published sources.

(3.3) RECONCILIATION AND ENRICHMENT
Corpus building and data modelling was followed by a process of exploratory data analysis (EDA).The goal with EDA is to maximise insight into the quality of the data, usually through graphical representations of the data.Each selected dataset was profiled along these steps: 1. Data type identification

Data rejection
In the first step, the data type and meaning of each variable was determined heuristically.Questions regarding the interpretation of variables were clarified by consulting the data contributors or the research from which the data originated.A recurring issue was that the meaning of similarly named variables differed between datasets.Columns that reference the variable "movie year" are an example of this as the term can refer to either the production year or the release year of a film.Both concepts require additional clarification as, for example, the production year can be defined by different events (e.g., first draft of script, start of filming).
In general, the issue of concept variation was addressed by examining the frequency of a concept's interpretation across the datasets.The most common definition was preferred if it could be adequately mapped to the data model.In this particular case, films were most often dated by their release year (i.e., the earliest public release of a film anywhere in the world).
Data collections were further analysed in OpenRefine, a data transformation tool to visualise and manipulate data (Verborgh & De Wilde, 2013).The faceting functionality of OpenRefine was used to filter on subsets of the data and analyse how data columns could relate to the main entity types of the data model.Columns with numerical data types (i.e., year, box office figures, number of seats) were represented in histograms to determine any outliers.During this process, CB encountered both data entry errors and statistical outliers.The former were removed or cleaned, while the latter had to remain in the data.However, confirming that every statistical outlier was a real data point and not a data input error was not possible within the infrastructure project's timeline as data in the corpus was often based on archival sources, which would require considerable time and expert knowledge to consult.
In regards to textual data, data entry inconsistencies (e.g., misspellings, capitalisation and pluralisation differences) were dealt with on a case-by-case basis by using the token-and character-based clustering methods available in OpenRefine.A compromise was found between harmonising errors and a more hands-off approach.Columns with literal source information (e.g., mentioned film title) were not modified in order to accurately reflect the original source.Secondary columns that were not based on the original source but added by the researcher (e.g., IMDb id) were cleaned.The EDA's final phase involved a selection procedure in which parts of the datasets were rejected for integration with CB.This could be due to a variety of factors, including the data's quality, relevance, copyright or incompatibility with the data model.Certain datasets, for example, included information from IMDb (e.g., film genre).This information was removed as a precautionary measure to avoid infringing on any potential copyright.Wherever possible, this material was replaced with publicly available data from Wikidata during the enrichment phase.
Creating the database required reconciliation of the research datasets.This was a timeconsuming process as some data collections did not use external identifiers for films, venues, persons and/or companies.Film reconciliation was challenging when the datasets contained only the French, Dutch and/or German translated film titles.These film title translations often included language variations in order to adapt to local distribution context (Gabrić et al., 2021;Peña Cervel, 2016).This made it unfeasible to match based on automatic retranslations of those titles.To solve this, a methodology was developed using Python that matched and identified films with their IMDb and Wikidata identifiers based on multiple film variables.For matching with IMDb, imdb-sqlite 0.1.3. 5was used to import relevant IMDb data into a separate SQLite database.Additional indexed columns were added to make the matching more performant.The programme datasets were matched with their IMDb identifier based on the following parameters: 1. mentioned titles (3), actors (2) 2. mentioned titles (3), actors (1)

mentioned titles (1), actors (1)
The number in parentheses refers to the number of values that were included for matching.If no adequate match was found during the first step, fewer values were used.This raised the frequency of matches while decreasing their accuracy (i.e., more false-positives).Because programming datasets describe year of screening rather than release year, year information could not be used as a parameter.Implementing margins based on screening year was attempted but ignored in the definitive matching process since (a) narrow margins excluded (much) older films that were rescreened and (b) added false-positives due to films matching in year.For datasets where more information was available, the following parameters were used:

mentioned title, directors
A similar method with the same parameters was used to match with Wikidata.However, due to a lack of historical film metadata on Wikidata and/or IMDb, it was not feasible to properly match and identify film entries in a fully automated way.For the vast majority of the datasets, suggested matches were manually reviewed and confirmed using specialised film databases such as Cinema Context, The German Early Cinema Database and Ciné-Ressources.To avoid misidentification, films were only matched if either (a) the original title matched or (b) their mentioned title matched, plus one other property (e.g., year of release).This was a viable method for feature films but proved inadequate for short films or news reels, as these are less documented.As a result, records for short films and newsreels could only be matched internally and assigned Cinema Belgica identifiers.Wikidata and IMDb ids were added when feature films could be matched.Linking with Wikidata made it possible to import publicly available metadata to improve the matching process (see Figure 2).
Trustworthiness of the data has been an important concern throughout the design process of the CB database.In general, trust issues can relate to (1) data omissions and bias, (2) naming and classification, and (3) certainty and precision (Davis, Vane & Kräutli, 2021).With data derived from historical sources, users often struggle to understand the researcher's process of interpretation.CB accounts for this by providing data provenance, referring both to the research that produced the datasets and the primary historical source material when possible.Additionally, CB did a test case where film programming was linked to a digitised version of the primary sources to increase the quality of the data provenance (see Figure 3).For the historical newspaper "Vooruit", digitised cut-outs were created and linked using the IIIF image API.As a result, CB data platform supplies thumbnails for film listings and programming advertisements for the followings venues: • Leopold (1945Leopold ( -1950) ) • Vooruit (1946-1951) Additionally, datasets were linked to digitised film heritage, such as film posters, reviews and photos.Movies were matched to Ghent University Library's and Ghent Archives' collections of historical film posters.Respectively, 215 and 788 movies were linked with their poster, providing additional metadata for the study of their historical and cultural context (e.g., R. F. Allen, 1994;Hu, 2018).Due to copyright reasons, film posters from Ghent University Library are only accessible within the Ghent University network.CB also partnered with KADOC (Documentation and Research Centre on Religion Culture and Society) to link the datasets with the Filmmagie database. 6Between the 1930s and 2000s, the Belgian Catholic Film League collected press cuttings for films released in the country.This effort produced a documentary database of approximately 60,000 film dossiers.These dossiers are of great importance for evaluating the reception of films in Belgium as they include press releases, reviews and first images.Approximately one fifth of this collection has been linked with CB.
(4) RESULTS AND DISCUSSION (4.1) LIMITATIONS This section highlights shortcomings in CB's reconciliation process and data model so that users can account for them in their analyses.Several challenges arose during the data integration process, mainly due to a lack of documentation on the research data entry methods.CB addressed concerns about the interpretation of sources, the consistency of variables, and the accuracy of data.Solving these problems necessitated the specialised knowledge of the original data producers.However, as previously mentioned, data harmonisation had to be addressed pragmatically, which resulted in some data being omitted.Learning from these experiences, CB is promoting its data cleaning and modelling practices to students and academics to improve the adoption of future datasets. 7Other issues stemmed from the model's inability to account for the complexities of the (semi-)structured cinema data.One major aspect was the difficulties of representing the corporate system that encompasses film production, distribution and exhibition.CB retains the information that the contributing datasets provided on these companies, but does not attempt to systematically reconcile these entities across the datasets as this would require substantial understanding of the organisations' hierarchical structures (e.g., mergers and acquisitions).Furthermore, accounting for temporal elements of an entity -for example, the time a distribution firm was a subsidiary of a parent corporation -would require an extension of the data model.This could prove a worthwhile endeavour as it would facilitate present and future research on distribution practices and strategies in Belgium (e.g., Engelen, 2016;Vande Winkel, 2017).It was, however, outside the scope of the current project.
Other problems related to the reconciliation of differing concepts between the datasets.The issues associated with combining film dates were already covered in section 3. The same holds true for the concepts of "production company" and "country of production", which had several definitions depending on the research subject.These instances relate to broader epistemological concerns regarding the use of upper level ontologies in the humanities (Drucker,

(4.2) RE-USE POTENTIAL AND SCALABILITY
To indicate potential areas of research that can be supported by the database, data stories were developed (Biltereyst, 2021;Sedgwick, 2021;Soberon, 2021;Sproten, 2021;Van Oort & Noordegraaf, 2021;Willems, 2021).8These texts focus on specific parts of Belgian cinema history, contextualising data available on the database and proposing ideas for future research.Sproten (2020), for example, explores the prospects of using East Belgian film programming data to investigate the role of cinema in Ostbelgien's cultural history, a region where Francophone, Flemish and German film cultural influences overlapped and intertwined.Here, the author questions if it would be possible to explore whether, in addition to the regular mainstream film distribution networks, more alternative distribution networks developed for language minority groups.In a similar fashion, Biltereyst (2021), discusses Hollywood's distribution practices and strategies in Belgium using data available on CB.
However, users of the CB database should be aware that the datasets were created in different research contexts.As a result, the granularity and content vary depending on the historical period, place and/or topic at hand.This raises the question of whether expanding CB's approach could address these concerns.Manual data collection is time-intensive and is typically the result of case-specific studies.The inclusion of these studies increases the granularity of CB's data but only marginally improves its fragmented nature.Computational ways to data collection may be more effective for scaling the database.The Dutch DIGIFIL project (Kisjes et al., 2020) demonstrated how cinema screening data can be automatically extracted from digitised historical newspapers.In a recent study (Noordegraaf et al., forthcoming), the CB data and DIGIFIL data on cinemas' geographical locations, infrastructure and programming strategies were used to explore similarities and differences in the location and profile of cinemas for three sample years (1952,1962,1972) in Amsterdam in the Netherlands and Antwerp in Belgium.
These datasets include all known screenings in all cinemas in both cities for the years 1952,1962,1972.The analysis of the data was primarily conducted with Python scripts in a Jupyter notebook for retrieving subsets of the data, performing simple ranking and counting operations on them and generating visualisations of the results. 9The visualisations served the heuristic purpose of identifying the role of the variable in the typology of the cinemas in both cities in the three sample years.In addition, the authors performed a cluster analysis (using the k-means clustering algorithm) to see to what extent the qualitatively obtained insights in reading the tables and charts of individual variables coincided with a more quantitative, bottom-up approach.The ambition was to see if such an approach provides new ways of identifying clusters of cinemas that share a set of characteristics which may or may not be in line with existing classifications.
The results of the analysis invite a rethinking of the type 'premiere theatre', which in both cities proved to be more diverse than expected: in the clusters of cinemas that show films in the first run, 'grand picture palaces' are found next to theatres with a more niche programming, such as arthouse and sex cinemas.In addition, the cluster analysis showed there is no clear correlation between run order and seating capacity, and contextual research confirmed that there were also smaller theatres that showed films in the first run.Moreover, the centre-periphery distinction is not as clear-cut as may be expected; in particular in Antwerp we find cinemas that show films in a higher run order in the city centre, while premiere theatres with many seats showing films in the first or second run are to be found in the districts too.This invites a rethinking of location in relation to programming strategies.Considering the frequency and age of the films shown, in addition to cinemas that show only new films or, by contrast, films from

Figure 2
Figure 2 Cinema Belgica links to publicly available linked data for matched films.The film page for "42 nd street (1933)" is provided with thumbnails by Wikidata.

Table 1
Overview (**) Events: C/censorship practices, PS/premieres and other type of screenings, WS/ weekly screenings, V/various.(***)Other:BO/boxofficedata, CR/censorship reasons, PC/production countries, RY/ release year, S/seat numbers, TI/technical information exhibition, V/various.(Contd.)(3.2) DATA MODELIt is important to consider that data modelling practices and tools employed in the integration process have an effect on knowledge representation(Flanders & Jannidis, 2020).CB used a modified version of the Cinema Context data model for integrating the datasets (van Oort &

Table 2
(van Wissen et al., 2021)).In order to effectively integrate datasets, LOD promotes the use of interoperable concepts.However, the likelihood of losing a more nuanced perspective of the data -one that more accurately reflects historical reality -increases with the generalisation of these concepts.The current constraints of the CB data model demonstrate the need for additional work towards a domain ontology that accommodates all features of NCH research data.The recently published RDF version of the CC data model could be used as a basis for such an ontology(van Wissen et al., 2021).