(1) Introduction

Cinema Belgica (CB)1 is an online database that integrates 14 existing datasets covering Belgian film production, distribution, exhibition, censorship and reception. In order to facilitate (inter)national data exchange and comparative research into cinema history, the platform (a) integrates key datasets related to Belgian cinema history, (b) makes them accessible according to FAIR principles (Wilkinson et al., 2016), and (c) enriches them with heritage collections. CB is Belgium’s contribution to the growing network of historical cinema databases, such as Cinema Context2 and The German Early Cinema Database3 (Dibbets, 2010; Garncarz, 2014). To explain the historiographical value of the contributed datasets and the selection criteria underlying them, the first section of this article discusses the field of New Cinema History (NCH). Section two provides the relevant metadata about the database discussed in this paper. The creation of the database is thoroughly discussed in the third section, with an emphasis on corpus building, data modelling and reconciliation methods. The scalability of CB’s approach and the re-use potential of the aggregated data is discussed in the final part of this article.

(1.1) New Cinema History and the renewal of film historiography

In their groundbreaking historiographical reflection on film history, Robert C. Allen and Douglas Gomery (1993, p. 68) argued that the approach to cinema has long been focused on the aesthetical, ideological and representational qualities of films, whereas “economic, technological, and cultural aspects of film history are subordinate to the establishment of a canon of enduring cinematic classics.” Since then, film scholars have made a concerted effort to resolve this issue, mainly by paying closer attention to the socio-cultural and economic contexts in which films are produced, circulated, exhibited and received. Identifying themselves as “New Cinema Historians”, a new generation of film scholars aims to comprehend the complexity of cinema as a wider phenomenon that is influenced by various spheres of life:

“[cinema] involves a specific place (the cinema as exhibition and a physical venue); a space (an imaginary and socially embedded version of this site); an industry (of production, distribution, exhibition and circulation); an experience (cinemagoing as a sensory and imaginative practice); and even a way of life (in which people act, talk, play or think ‘cinematically’ – comme du cinéma – in everyday life)” (Biltereyst, Maltby & Meers, 2019, p. 2).

Although not entirely new (Cressey, 1938; Elsaesser, 1986; Mayer, 1948), these arguments for viewing and analysing cinema as a broad socio-cultural and economic phenomenon began to emerge more clearly on the research agenda of film/cinema historians at the end of the 1990s and the beginning of the 2000s (Kuhn, 1999, 2002; Maltby et al., 2007; Maltby & Stokes, 2004; Stacey, 1994; Stokes & Maltby, 1999a, 1999b, 2001). New cinema historians ushered in a shift from film to cinema history, addressing topics such as an audience’s cinemagoing memories, engagement with films, and relationship to film venues, not just as places where films are shown, but also as social spaces embedded within local communities and neighbourhoods’ memories (Kuhn, Biltereyst & Meers, 2017). In addition to this audience-centred shift in film historiography, NCH was also greatly influenced by the spatial turn, which placed a strong emphasis on issues related to the location of movie theatres, the spatial characteristics of their interiors and ambiance, and the significance of space and place in interpreting moviegoing experiences and memories.

This trend of researching historical film audiences (Egan, Smith & Terrill, 2022) invites the use of a wide range of data, methods, and theoretical foundations. In an effort to understand the historical audience, researchers perform quantitative analyses of box-office revenues (Sedgwick, 2011); they use corporate reports or other recordings and testimonies on the audience coming from the industry (Sullivan, 2010); they turn to film programme analyses in order to understand what cinemagoers viewed in which venues at what kinds of locations (Biltereyst et al., 2011); they examine letters and other traces left by historical film fans; fan literature and movie magazines are now examined in different directions, also by using digitised film magazines or newspapers (Biltereyst & Van de Vijver, 2020); and they use questionnaires or interview older cinemagoers (Taylor, 1989). In recent years, NCH has developed into a more mature, inter- and multidisciplinary field. It now makes use of perspectives that were previously frequently ignored within the field of film studies, such as ethnographic research (Richards, 2003; Taylor, 1989), memory studies (e.g., Kuhn, Biltereyst & Meers, 2017), social geography and urban studies (Biltereyst & Van de Vijver, 2016), oral history methods (Treveri Gennari, Dibeltulo, Hipkins & O’Rawe, 2018) or the digital humanities (Acland & Hoyt, 2016; Noordegraaf et al., 2021; Verhoeven, 2012).

In general, NCH scholars share the ambition to create open-access corpora to study the socio-economic context of cinema at scale (e.g., quantitative research on box office figures, seat count, and screen count; see Sedgwick, 2000, 2011). Major historical or longitudinal datasets on film programming issues (van Oort & Noordegraaf, 2020) as well as on the spatiality of cinemas (Biltereyst, van Oort & Meers, 2019; Hallam & Roberts, 2014) have recently been created. These initiatives can be supported by concept of linked data as it facilitates the interlinking and semantic querying of (semi-)structured data from diverse source types. Specifically, it can assist NCH’s contextual approach as it gives researcher the ability to query multiple variables while supporting data interoperability and distributed content creation (Noordegraaf, Lotze & Boter, 2018; van Wissen et al., 2021). In this context, the Cinema Belgica database was created.

(2) Dataset description

Object name

The Cinema Belgica Database.

Format names and versions

CSV

Creation dates

2003–2022

Dataset creators

Bram Van Beek: investigation.

Daniel Biltereyst: investigation

Gertjan Willems: investigation.

Kathleen Lotze: investigation.

Khaël Velders: investigation.

Liesbet Depauw: investigation.

Liesbeth Van de Vijver: investigation.

Miro Burgelman: investigation.

Tim De Canck: investigation.

Vitus Sproten: investigation.

Amandine Warrens: data curation.

Emmanuel Vermeire: data curation.

Jari Goerlandt: data curation.

Marthe Waegeman: data curation.

Nicolas Franck: data curation

Pieterjan De Potter: data curation

Salma Mediavilla Aboulaoula: data curation.

Sara Van Den Berghe: data curation.

Tamar Cachet: data curation

Vincent Ducatteeuw: data curation

Language

Dutch, French, German, English

License

Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Repository name

Harvard Dataverse

Publication date

2022–08–25

(3) Method

(3.1) Corpus building

The NCH approach was supported in Belgium through the HoMER4 network (History of Moviegoing, Exhibition, and Reception), which was founded in 2004 by a group of film scholars to promote collaborative cinema historical research. NCH and the network focused on understanding cinema as a broad socio-cultural phenomenon, considering aspects of the industry that produced, distributed, promoted and exhibited films (Biltereyst, Maltby & Meers, 2012; Biltereyst, Maltby & Meers, 2019; Maltby, Biltereyst & Meers, 2011; Treveri Gennari et al., 2018). Besides standard historical research methods (e.g., archival work, document analysis), these scholars started building datasets on various aspects of Belgium’s historical cinema culture. This resulted in a series of research projects and datasets which are relatively heterogeneous in terms of research context, aims, funding and organisation, but which together provide a unique mapping of Belgian cinema history. The Cinema Belgica database was created to bring together these research data. Table 1 provides an overview of the contributed datasets, which can be distinguished into four types.

Table 1

Overview of the contributed datasets integrated and uploaded to the Cinema Belgica database.


ORIGINAL DATASET ENTRIES (N) GEOGRAPHY PERIOD FILMS CINEMAS PEOPLE/COMPANIES * EVENTS ** OTHER ***

Film control inventory 1 10077 entries on decisions Belgium 1922–2003 yes no A, D, DC, EC, PC C PC, RY, V

Film control cuttings 9777 entries on decisions Belgium 1922–1992 yes no A, D, DC, EC, PC C CR, PC, RY, V

Film control inventory 2 2923 entries on decisions Belgium 2004–2010 yes no A, D, DC, EC, PC C PC, RY, V

Film control refused movies 304 entries on decisions Belgium 2004–2009 yes no A, D, DC, EC, PC C CR, PC, RY, V

Cinema exhibition Flanders 35822 entries on venues Flanders 1924–2000 no yes EC, V / S, TI, V

Cinema exhibition Antwerp 3547 entries on venues Antwerp 1902–2007 no yes EC, V / S, TI, V

Cinema exhibition Ghent 12420 entries on venues Ghent 1896–2009 no yes EC, V / S, TI, V

Cinema exhibition Brussels 9360 entries on venues Brussels 1900–1994 no yes EC, V / S, TI, V

Cinema programming Antwerp 5951 entries on screening weeks Antwerp 1952, 1962, 1972, 1982, 1992 yes yes A, D, DC, EC, PC, V C, WS CR, PC, RY, S, TI, V

Cinema programming Ghent 5609 entries on screening weeks Ghent 1933–36, 1945, 1952, 1962, 1972 yes yes A, D, DC, EC, PC, V C, WS CR, PC, RY, S, TI, V

Cinema programming Eupen-Malmedy 3.294 entries on screening weeks Eupen-Malmedy 1918 to 1949 yes yes A, D, EC, PC, V WS PC, RY, V

Cinema programming and box office: Capitole 986 entries on weeks Ghent 1953–71 yes yes A, D, DC, EC, PC, V C, PS, WS BO, CR, PC, RY, S, TI, V

Cinema programming and box office: Vooruit 228 entries on weeks Ghent 1946–51 yes yes A, D, DC, EC, PC, V C, PS, WS BO, CR, PC, RY, S, TI, V

Cinema programming and box office: Leopold 889 entries on weeks Ghent 1945–54 yes yes A, D, DC, EC, PC, V C, PS, WS BO, CR, PC, RY, S, TI, V

Film releases Belgium (Filmmagie) 10172 entries on films released in Belgium Belgium 1902–2015 yes no / / PC, RY, V

Fiction film production Belgium 1369 entries on films produced by Belgium Belgium 2000–2019 yes no D, PC, V / PC, RY, V

(*) People/Companies: A/actors, D/directors, DC/distribution companies or people, EC/exhibition companies or people, PC/production companies or people, V/various others.

(**) Events: C/censorship practices, PS/premieres and other type of screenings, WS/weekly screenings, V/various.

(***) Other: BO/box office data, CR/censorship reasons, PC/production countries, RY/release year, S/seat numbers, TI/technical information exhibition, V/various.

  1. Belgian film control
    An first set of projects and datasets focused on censorship practices in Belgium from the founding of the Filmkeuringscommissie, the Belgian Film Control Board in 1921. At the beginning of the 20th century, Belgium established a system of voluntary censorship, a unique phenomenon in the world (Biltereyst, 2020). The “Vandervelde Act” (1920) prohibited film screenings for children (–16 years) without the Board’s approval. Although distributors were not required by law to submit their films, business considerations forced the industry to comply with the Board’s demands. The FWO-project “Forbidden Images” (2003–2006; see Depauw & Biltereyst, 2008) supplied the Control Board’s decisions on age ratings for particular film titles, reasons for cutting or refusing films, as well as detailed information on production and distribution companies. The decisions made by the Control Board between 2004 and 2010 were added as part of Tim De Canck’s master thesis (De Canck, 2013). The primary source is the Board’s archives, which are supplemented by additional sources such as trade journals and other archival material.
  2. Cinema venues
    A second set focuses on cinemas and other film screening venues in Belgium. The FWO-project “The Enlightened City” (2005–2008; see Biltereyst & Meers, 2007) provided a general database of exhibition venues in Flanders and Brussels for the period 1924 to 2000. This regional data was enriched with in-depth data on three major cinema cities: Antwerp, Ghent and Brussels. Information for Antwerp and Ghent was based on two follow-up research projects: Antwerpen Kinemastad (2009–2012; see Lotze, 2020) and Gent Kinemastad (2009–2012; see Van de Vijver, 2011). The dataset on Brussels is based on a publicly funded inventory, additional sources from trade, and other journals. These files provide unique information on the location of film theaters, including names, addresses, and, if available, information on the number of seats, as well as the names of people, corporations, and organisations such as distributors and exhibitors. Different sources were employed in these projects, including company archives, trade journals, newspapers, and various sorts of archival information.
  3. Film programming
    A third collection centers on the movies that were actually shown in cinemas, more precisely in Antwerp, Ghent and the East-Belgian region of Eupen-Malmedy. Due to the vast amount of venues, films and weekly changing programmes, the aforementioned projects Antwerpen Kinemastad and Gent Kinemastad created in-depth cases for sample years per decade (1933–1992). For Antwerp, citywide film programming data was gathered for: 1952, 1962, 1972, 1982 and 1992. Eight sample years are available for Ghent: 1933–1936, 1945, 1952, 1962 and 1972. For Eupen-Malmedy, Vitus Sproten compiled film programming information for 1918 to 1949. These longitudinal datasets include data on cinemas, film titles, runtimes, genres, production, distribution, and exhibition firms, directors, actors, and data on official and Catholic film control. They were created by using newspaper film listings (see Table 2), trade and other publications, and other archive material such as company archives. More in-depth information, including box office data, is available for three key cinemas in Ghent. This includes the socialist “Vooruit” for 1946 to 1951 (Burgelman & Deneckere, 2010); the main premiere cinema “Capitole” for 1953 to 1971 (Velders, 2011); and the sex cinema “Leopold” from 1946 to 1954 (Biltereyst, 2018).
  4. Belgian films
    A fourth collection contains two datasets concerning the release and production of (Belgian) films. The Belgian Catholic Film League gathered press cuttings for films released in the country between the 1930s and the 2000s. This initiative resulted in the creation of a documentary database with roughly 60,000 film dossiers. Additionally, the database created in the context of the doctoral research of Bram Van Beek (2019–2023) compiles Belgian films released between 2000 and 2020.

Table 2

Overview of newspapers used as a source for film programming information for specific cities/regions. Frequency refers to how often a newspaper is referenced as a source for a programme item.


FREQUENCY NEWSPAPER TITLE REGION

11767 Gazet van Antwerpen Antwerp

10541 De Gentenaar Ghent

9560 Vooruit Ghent

831 Grenz-Echo Eupen-Malmedy

439 La Semaine Eupen-Malmedy

410 Korespondenzblatt Eupen Eupen-Malmedy

230 Die Fliegende Taube Eupen-Malmedy

161 St. Vither Volkszeitung Eupen-Malmedy

71 Eupener Zeitung Eupen-Malmedy

48 Het Laatste Nieuws Antwerp

3 Eupener Nachrichten Eupen-Malmedy

(3.2) Data model

It is important to consider that data modelling practices and tools employed in the integration process have an effect on knowledge representation (Flanders & Jannidis, 2020). CB used a modified version of the Cinema Context data model for integrating the datasets (van Oort & Noordegraaf, 2020). The CB model is built around eight main entities (see Figure 1):

  1. “Programme” refers to the weekly screening schedule of a (travelling) cinema and includes one or more programme items – usually films. A programme is assigned a dating by “programme_date”, which contains the start and end day of the weekly programme. Both days are based on the mentioned data from a historical source, which is also kept.
  2. “Film” connects the original title, release year, external identifiers (IMDb, Wikidata and/or Filmmagie), length and unit (meters, reels or minutes), and digitised film posters to the corresponding film entity. All language variations of the film’s title are included via “mentioned_film_title”. Films can be related to both “person” and “company” entities if a dataset includes information about people (e.g., actors) or businesses associated in the production or distribution of a film.
  3. “Censorship” structures all information related to film censorship by the Belgian Film Control Board. This includes censorship decision (e.g., 16+), motivations for the decision, possible appeal to the decision and date of the record. On a more detailed level, “censorship” links to information about cuts demanded by the censor, their description and categorisation. The censorship typology consists of 15 main categories (e.g., “crime”) and 119 subcategories (e.g., “Burglary”, “Arson” and “Smuggling”) based on the work of the FWO-project “Forbidden Images” (2003–2006).
  4. “Venue” organises information about the opening and closing year, status (e.g., centre cinema), type (e.g., travelling cinematographer), ideological characteristics (e.g., catholic), and infrastructure (e.g., number of seats) of a cinema operation. A venue is linked to an “address”, which contains information about the geographical location and architecture of the building where a venue was located. This conceptual difference was necessary due to the phenomenon of travelling cinemas and issue with venues that relocated their operations to another address. A venue can be associated with both individuals and businesses. This includes the function these entities had in relation to the venue, as well as the duration of that service.
  5. “Person” stores all-natural persons that are associated with a specific venue, company and/or production or distribution of a film.
  6. “Company” stores are legal entities that are associated with a specific venue and/or the production or distribution of a film.
  7. “Archive_item” lists all data references to unpublished historical sources.
  8. “Publication” lists all data references to bibliographical, published sources.
Simplified Entity Relationship Diagram of the Cinema Belgica data model
Figure 1 

Simplified Entity Relationship Diagram (ERD) of the Cinema Belgica data model. The six entities highlighted in red (programme, film, censorship, venue, person and company) and two entities marked in blue (archive_item and publication) form the logical organisation of the deposited files. Most relationships from the archive_item and publication entities have been left out to prevent cluttering the diagram as these are connected to all other (highlighted) entities.

(3.3) Reconciliation and enrichment

Corpus building and data modelling was followed by a process of exploratory data analysis (EDA). The goal with EDA is to maximise insight into the quality of the data, usually through graphical representations of the data. Each selected dataset was profiled along these steps:

  1. Data type identification
  2. Graphical data representation
  3. Outlier determination
  4. Data rejection

In the first step, the data type and meaning of each variable was determined heuristically. Questions regarding the interpretation of variables were clarified by consulting the data contributors or the research from which the data originated. A recurring issue was that the meaning of similarly named variables differed between datasets. Columns that reference the variable “movie year” are an example of this as the term can refer to either the production year or the release year of a film. Both concepts require additional clarification as, for example, the production year can be defined by different events (e.g., first draft of script, start of filming). In general, the issue of concept variation was addressed by examining the frequency of a concept’s interpretation across the datasets. The most common definition was preferred if it could be adequately mapped to the data model. In this particular case, films were most often dated by their release year (i.e., the earliest public release of a film anywhere in the world).

Data collections were further analysed in OpenRefine, a data transformation tool to visualise and manipulate data (Verborgh & De Wilde, 2013). The faceting functionality of OpenRefine was used to filter on subsets of the data and analyse how data columns could relate to the main entity types of the data model. Columns with numerical data types (i.e., year, box office figures, number of seats) were represented in histograms to determine any outliers. During this process, CB encountered both data entry errors and statistical outliers. The former were removed or cleaned, while the latter had to remain in the data. However, confirming that every statistical outlier was a real data point and not a data input error was not possible within the infrastructure project’s timeline as data in the corpus was often based on archival sources, which would require considerable time and expert knowledge to consult.

In regards to textual data, data entry inconsistencies (e.g., misspellings, capitalisation and pluralisation differences) were dealt with on a case-by-case basis by using the token- and character-based clustering methods available in OpenRefine. A compromise was found between harmonising errors and a more hands-off approach. Columns with literal source information (e.g., mentioned film title) were not modified in order to accurately reflect the original source. Secondary columns that were not based on the original source but added by the researcher (e.g., IMDb id) were cleaned. The EDA’s final phase involved a selection procedure in which parts of the datasets were rejected for integration with CB. This could be due to a variety of factors, including the data’s quality, relevance, copyright or incompatibility with the data model. Certain datasets, for example, included information from IMDb (e.g., film genre). This information was removed as a precautionary measure to avoid infringing on any potential copyright. Wherever possible, this material was replaced with publicly available data from Wikidata during the enrichment phase.

Creating the database required reconciliation of the research datasets. This was a time-consuming process as some data collections did not use external identifiers for films, venues, persons and/or companies. Film reconciliation was challenging when the datasets contained only the French, Dutch and/or German translated film titles. These film title translations often included language variations in order to adapt to local distribution context (Gabrić et al., 2021; Peña Cervel, 2016). This made it unfeasible to match based on automatic retranslations of those titles. To solve this, a methodology was developed using Python that matched and identified films with their IMDb and Wikidata identifiers based on multiple film variables. For matching with IMDb, imdb-sqlite 0.1.3.5 was used to import relevant IMDb data into a separate SQLite database. Additional indexed columns were added to make the matching more performant. The programme datasets were matched with their IMDb identifier based on the following parameters:

  1. mentioned titles (3), actors (2)
  2. mentioned titles (3), actors (1)
  3. mentioned titles (2), actors (2)
  4. mentioned titles (2), actors (1)
  5. mentioned titles (1), actors (2)
  6. mentioned titles (1), actors (1)

The number in parentheses refers to the number of values that were included for matching. If no adequate match was found during the first step, fewer values were used. This raised the frequency of matches while decreasing their accuracy (i.e., more false-positives). Because programming datasets describe year of screening rather than release year, year information could not be used as a parameter. Implementing margins based on screening year was attempted but ignored in the definitive matching process since (a) narrow margins excluded (much) older films that were rescreened and (b) added false-positives due to films matching in year. For datasets where more information was available, the following parameters were used:

  1. original title, mentioned title, directors, release year
  2. original title, directors, release year
  3. mentioned title, directors, release year
  4. original title, mentioned title, release year
  5. original title, release year
  6. mentioned title, release year
  7. original title, mentioned title, directors
  8. original title, directors
  9. mentioned title, directors

A similar method with the same parameters was used to match with Wikidata. However, due to a lack of historical film metadata on Wikidata and/or IMDb, it was not feasible to properly match and identify film entries in a fully automated way. For the vast majority of the datasets, suggested matches were manually reviewed and confirmed using specialised film databases such as Cinema Context, The German Early Cinema Database and Ciné-Ressources. To avoid misidentification, films were only matched if either (a) the original title matched or (b) their mentioned title matched, plus one other property (e.g., year of release). This was a viable method for feature films but proved inadequate for short films or news reels, as these are less documented. As a result, records for short films and newsreels could only be matched internally and assigned Cinema Belgica identifiers. Wikidata and IMDb ids were added when feature films could be matched. Linking with Wikidata made it possible to import publicly available metadata to improve the matching process (see Figure 2).

Webpage for the film "42nd Street (1933)" providing metadata and images
Figure 2 

Cinema Belgica links to publicly available linked data for matched films. The film page for “42nd street (1933)” is provided with thumbnails by Wikidata.

Trustworthiness of the data has been an important concern throughout the design process of the CB database. In general, trust issues can relate to (1) data omissions and bias, (2) naming and classification, and (3) certainty and precision (Davis, Vane & Kräutli, 2021). With data derived from historical sources, users often struggle to understand the researcher’s process of interpretation. CB accounts for this by providing data provenance, referring both to the research that produced the datasets and the primary historical source material when possible. Additionally, CB did a test case where film programming was linked to a digitised version of the primary sources to increase the quality of the data provenance (see Figure 3). For the historical newspaper “Vooruit”, digitised cut-outs were created and linked using the IIIF image API. As a result, CB data platform supplies thumbnails for film listings and programming advertisements for the followings venues:

Film programming webpage with thumbnail image of and link to the advertisement in the historical newspaper
Figure 3 

Thumbnail of a digitised advertisement that was the source for a programme item on Cinema Belgica. The IIIF image API links to the scan of the digitised newspaper at Ghent University Library.

  • Leopold (1945–1950)
  • Vooruit (1946–1951)

Additionally, datasets were linked to digitised film heritage, such as film posters, reviews and photos. Movies were matched to Ghent University Library’s and Ghent Archives’ collections of historical film posters. Respectively, 215 and 788 movies were linked with their poster, providing additional metadata for the study of their historical and cultural context (e.g., R. F. Allen, 1994; Hu, 2018). Due to copyright reasons, film posters from Ghent University Library are only accessible within the Ghent University network. CB also partnered with KADOC (Documentation and Research Centre on Religion Culture and Society) to link the datasets with the Filmmagie database.6 Between the 1930s and 2000s, the Belgian Catholic Film League collected press cuttings for films released in the country. This effort produced a documentary database of approximately 60,000 film dossiers. These dossiers are of great importance for evaluating the reception of films in Belgium as they include press releases, reviews and first images. Approximately one fifth of this collection has been linked with CB.

(4) Results and discussion

(4.1) Limitations

This section highlights shortcomings in CB’s reconciliation process and data model so that users can account for them in their analyses. Several challenges arose during the data integration process, mainly due to a lack of documentation on the research data entry methods. CB addressed concerns about the interpretation of sources, the consistency of variables, and the accuracy of data. Solving these problems necessitated the specialised knowledge of the original data producers. However, as previously mentioned, data harmonisation had to be addressed pragmatically, which resulted in some data being omitted. Learning from these experiences, CB is promoting its data cleaning and modelling practices to students and academics to improve the adoption of future datasets.7 Other issues stemmed from the model’s inability to account for the complexities of the (semi-)structured cinema data. One major aspect was the difficulties of representing the corporate system that encompasses film production, distribution and exhibition. CB retains the information that the contributing datasets provided on these companies, but does not attempt to systematically reconcile these entities across the datasets as this would require substantial understanding of the organisations’ hierarchical structures (e.g., mergers and acquisitions). Furthermore, accounting for temporal elements of an entity – for example, the time a distribution firm was a subsidiary of a parent corporation – would require an extension of the data model. This could prove a worthwhile endeavour as it would facilitate present and future research on distribution practices and strategies in Belgium (e.g., Engelen, 2016; Vande Winkel, 2017). It was, however, outside the scope of the current project.

Other problems related to the reconciliation of differing concepts between the datasets. The issues associated with combining film dates were already covered in section 3. The same holds true for the concepts of “production company” and “country of production”, which had several definitions depending on the research subject. These instances relate to broader epistemological concerns regarding the use of upper level ontologies in the humanities (Drucker, 2012; Hacıgüzeller et al., 2021). In order to effectively integrate datasets, LOD promotes the use of interoperable concepts. However, the likelihood of losing a more nuanced perspective of the data – one that more accurately reflects historical reality – increases with the generalisation of these concepts. The current constraints of the CB data model demonstrate the need for additional work towards a domain ontology that accommodates all features of NCH research data. The recently published RDF version of the CC data model could be used as a basis for such an ontology (van Wissen et al., 2021).

(4.2) Re-use potential and scalability

To indicate potential areas of research that can be supported by the database, data stories were developed (Biltereyst, 2021; Sedgwick, 2021; Soberon, 2021; Sproten, 2021; Van Oort & Noordegraaf, 2021; Willems, 2021).8 These texts focus on specific parts of Belgian cinema history, contextualising data available on the database and proposing ideas for future research. Sproten (2020), for example, explores the prospects of using East Belgian film programming data to investigate the role of cinema in Ostbelgien’s cultural history, a region where Francophone, Flemish and German film cultural influences overlapped and intertwined. Here, the author questions if it would be possible to explore whether, in addition to the regular mainstream film distribution networks, more alternative distribution networks developed for language minority groups. In a similar fashion, Biltereyst (2021), discusses Hollywood’s distribution practices and strategies in Belgium using data available on CB.

However, users of the CB database should be aware that the datasets were created in different research contexts. As a result, the granularity and content vary depending on the historical period, place and/or topic at hand. This raises the question of whether expanding CB’s approach could address these concerns. Manual data collection is time-intensive and is typically the result of case-specific studies. The inclusion of these studies increases the granularity of CB’s data but only marginally improves its fragmented nature. Computational ways to data collection may be more effective for scaling the database. The Dutch DIGIFIL project (Kisjes et al., 2020) demonstrated how cinema screening data can be automatically extracted from digitised historical newspapers. In a recent study (Noordegraaf et al., forthcoming), the CB data and DIGIFIL data on cinemas’ geographical locations, infrastructure and programming strategies were used to explore similarities and differences in the location and profile of cinemas for three sample years (1952, 1962, 1972) in Amsterdam in the Netherlands and Antwerp in Belgium. These datasets include all known screenings in all cinemas in both cities for the years 1952, 1962, 1972. The analysis of the data was primarily conducted with Python scripts in a Jupyter notebook for retrieving subsets of the data, performing simple ranking and counting operations on them and generating visualisations of the results.9 The visualisations served the heuristic purpose of identifying the role of the variable in the typology of the cinemas in both cities in the three sample years. In addition, the authors performed a cluster analysis (using the k-means clustering algorithm) to see to what extent the qualitatively obtained insights in reading the tables and charts of individual variables coincided with a more quantitative, bottom-up approach. The ambition was to see if such an approach provides new ways of identifying clusters of cinemas that share a set of characteristics which may or may not be in line with existing classifications.

The results of the analysis invite a rethinking of the type ‘premiere theatre’, which in both cities proved to be more diverse than expected: in the clusters of cinemas that show films in the first run, ‘grand picture palaces’ are found next to theatres with a more niche programming, such as arthouse and sex cinemas. In addition, the cluster analysis showed there is no clear correlation between run order and seating capacity, and contextual research confirmed that there were also smaller theatres that showed films in the first run. Moreover, the centre-periphery distinction is not as clear-cut as may be expected; in particular in Antwerp we find cinemas that show films in a higher run order in the city centre, while premiere theatres with many seats showing films in the first or second run are to be found in the districts too. This invites a rethinking of location in relation to programming strategies. Considering the frequency and age of the films shown, in addition to cinemas that show only new films or, by contrast, films from a broad variety of ages, we also found hybrid cinema types which combined the screening of new films with (much) older ones and that would extend the duration of these screenings into the extreme (up to 33 weeks for the new titles).

(5) Conclusion

The Cinema Belgica database is an important resource for the study of Belgian cinema history. On a primary level, its content may be used as an authoritative encyclopaedia of Belgian cinema history, with historical source references provided where possible. The database facilitates longitudinal and/or comparative analyses of cinema exhibition data (Noordegraaf et al., forthcoming). The database, however, has limitations, particularly for conducting large-scale statistics. These constraints, as previously stated, originate from the fragmentary nature of the reconciled data, as well as the decisions taken in modelling and reconciling. Future work will focus on developing computational methods for data collection and a more nuanced ontology for New Cinema History research data that can reduce these biases.