1. Context and Motivation

In the Social Sciences and Humanities (SSH) disciplines, interdisciplinary teams often need to integrate data from various formats and models, which can be complex, particularly in terms of data modeling. This intricacy is evident in the integration of differently structured bibliographic and citation data.

OpenCitations () is a non-profit independent infrastructure organization dedicated to open scholarship for the collection, curation, management, and publication of citation and bibliographic data. Some of the infrastructure’s distinctive traits include treating scholarly citations as first-class data entities (each with its persistent identifier) and offering comprehensive, freely accessible global citation data with a focus on semantic interoperability implemented via the adoption of Semantic Web technologies (). Furthermore, OpenCitations provides inclusive, transparent, and interoperable scholarly citation services, driven by open principles under academia-based governance (). The data it provides are collected from different sources, reshaped according to the OpenCitations Data Model (OCDM) (; ), and exposed using Linked Data technologies (), with the ultimate goal of providing free access to data in highly interoperable formats, such as RDF, SCHOLIX (), and CSV.

In this paper, we illustrate the process of producing and updating OpenCitations data collections following the ingestion of an Anglo-Japanese dataset provided by the Japan Link Center (JaLC). The aim is to test the flexibility of a language-agnostic methodology for the management of multilingual or non-English datasets, arguing that this approach is functional for the safeguarding of the bibliodiversity of the acquired data, which represents a crucial aspect in cultural heritage preservation and humanities-oriented studies (), as also stressed in the Recommendation on Open Science by UNESCO ().

1.1 OpenCitations Datasets

OpenCitations currently maintains two primary datasets. The first dataset is OpenCitations Index, the unified collection of open citations ingested from different sources created starting from the experience done with its first index, i.e. COCI (), that gathered citations from Crossref. The other dataset is OpenCitations Meta (), a collection of bibliographic metadata of all the resources included in the OpenCitations Index as citing or cited entities.

To date, the information integrated into the OpenCitations infrastructure comes from Crossref (), DataCite (), PubMed (; ), OpenAIRE (; ), and – since November 2023 – the Japan Link Center (JaLC) ().

OpenCitations Meta data are available as CSV and RDF dumps. The stored metadata for the bibliographic resources includes document IDs, titles, authors, publication dates, venue information, volume and issue identifiers, page ranges, resource types, publishers, and editors. Instead, OpenCitations Index data is made available in CSV, N-Triples, and Scholix (JSON) formats, and it exposes the Open Citation Identifier (OCI) () for the citation, the OpenCitations Meta Identifier (OMID, a persistent identifier for the entities included in OpenCitations Meta) of both entities involved in the citation, the citation creation date, the timespan between the publication dates of the cited and citing entities, and fields indicating whether both entities are published in the same journal and if they share at least one common author.

Both OpenCitations Index and Meta data are accompanied by provenance information (), which includes the responsible agent, the source URL, and the creation and modification date of the record, as well as tracking the changes of the data associated with an entity.

1.2 JaLC Data as a Prototype Testing Facility for a Language-agnostic Approach

This paper focuses on the process of metadata crosswalk () of multilingual data provided by JaLC to the OCDM. To this end, we implemented a particular workflow () to produce citation and bibliographic data.

The introduction of an Anglo-Japanese dataset marks OpenCitations’ first formal effort to ingest extensively non-English sources, emphasizing the importance of handling multilingual data and promoting inclusivity and global knowledge dissemination.

We avail ourselves of this opportunity to expose a methodology that covers data acquisition, curation, and the production of citation and bibliographic data, highlighting the benefits of a language-agnostic approach to data integration. Indeed, we claim that this strategy fosters the preservation of bibliodiversity, enhances scholarly research, and facilitates knowledge access to data provided by any data source, including multilingual ones.

Coherently, the integration of the JaLC collection tests the management of multilingual datasets following a language-agnostic approach that prioritizes displaying data in the original language when available, to facilitate access and reuse for an international academic and research community.

2. Dataset Description

Objects names, format names, versions, and repository location

JaLC citation data are integrated into OpenCitations Index, and the bibliographic metadata provided by JaLC is published in OpenCitations Meta. The JaLC data are included in the last dumps of OpenCitations Index – in CSV (), RDF (), and Scholix () formats – and OpenCitations Meta, in CSV format ().

Creation dates

JaLC’s ingestion process took six months, from June 1, 2023, to the dataset production on November 29, 2023.

Dataset creators

The input data was supplied by the Japan Science and Technology Agency and then analyzed and processed using custom software components. Below is the list of students and researchers affiliated with the Research Centre for Open Scholarly Metadata and the Digital Humanities Advanced Research Center of the University of Bologna who took part in the OpenCitations’ datasets creation, together with their roles:

  • Marta Soricetti: Software;
  • Arianna Moretti: Data curation, Software;
  • Ivan Heibi: Data curation, Software;
  • Arcangelo Massari: Data curation, Software;
  • Silvio Peroni: Supervision, Project administration, Conceptualization;
  • Elia Rizzetto: Software.

Language

For communication and dissemination purposes, the datasets’ structure follows the OpenCitations Data Model, an English-based data model. However, since the OCDM has no restrictions concerning the language choice, metadata values are exposed in the original formulation when possible. For JaLC, the data include a combination of information in English, Japanese, and, occasionally, other languages.

License

In full compliance with FAIR principles () and the spirit of Open Science, the citation and bibliographic data produced from JaLC are released under a CC0 waiver.

Repository name

Dumps are released on OpenCitations’ Figshare page, and the related links are published at https://opencitations.net/download. In addition, the data are available through other services for programmatic access, listed on the OpenCitations’ website.

Publication date

The latest versions of the dumps of OpenCitations Index and OpenCitations Meta, which include JaLC data, were published on Figshare on 11 December 2023 and 30 November 2023 respectively.

3. Method

In this section, we introduce the methodology for generating citation and bibliographic data compliant with OCDM given an external source’s dataset, focusing on the extension of OpenCitations software infrastructure for integrating data from JaLC. This case study offers the opportunity to introduce an updated version of the workflow – adopted for the first time for JaLC data ingestion – and elaborate on the measures taken for handling multilingual aspects.

3.1 Data Ingestion in OpenCitations

Originally, the data in Meta was derived exclusively from Crossref. However, between December 2022 and July 2023, data from PubMed, DataCite, and OpenAIRE were introduced, and simultaneously the software extension activities for JaLC dataset integration commenced. To facilitate software maintenance and define a systematic approach at the time of such expansions, a structural re-engineering of the ingestion and production workflow became necessary (Figure 1).

Figure 1 

Workflow for the ingestion of citation data and bibliographic metadata into the OpenCitations datasets.

As anticipated, a section of the methodology was formalized in a workflow () published on the Social Sciences & Humanities Open Marketplace (; ). However, while this workflow primarily focuses on citation data production, here we provide an overview of its application in the JaLC case study by exposing it in conjunction with the procedure for producing bibliographic metadata. Indeed, OpenCitations’ data integration methodology relies on the interconnection between bibliographic and citation data creation processes.

3.1.1 Data Source Selection and Documentation Analysis

The Japan Link Center is a Japanese DOI registration agency, and the data ingestion in OpenCitations was planned as an agreement between the parties. Accordingly, the JaLC technical team organized the data exposed via their API into a dataset in JSON format, to allow OpenCitations to reuse its previously developed software components.

The resulting dataset is organized as follows. A main ZIP archive contains the dataset directory, which stores a JSON file about all the DOI prefixes handled and a series of ZIP archives. Each of these archives is named after a DOI prefix and contains a directory storing all the JSON files of the DOI entities having that prefix – thus, the number of files per directory is variable. Note that each file represents a single bibliographic entity and citation data are provided as its bibliographic metadata.

To gain a deep understanding of the source data model, we conducted an e-mail correspondence to clarify any doubts and studied in detail the documentation, available both in Japanese and in English translation.

For the reproducibility of the process, it is crucial to declare that the JaLC input dataset was not made publicly available as it was ingested by OpenCitations. Despite this, the contained data can be freely retrieved by querying the API with the help of the provided documentation and used to restructure the bibliographic information as exemplified in the sample data used for OpenCitations’ software tests. As a result of possible API source data updates, the obtained dataset might contain more recent information than OpenCitations’.

3.1.2 Development of a Software Plug-in for Data Conversion

Although the current OpenCitations software infrastructure allows for the reuse of many general components common to the processing procedure of all the input datasets, two specific extensions had to be developed for each new data source. The first is aimed at reading the structure of the source dataset, extracting the bibliographic entities data and citations, and producing output files; the second performs the metadata crosswalk between the data model used by the source – in this case, JaLC – and OCDM. These components are developed as plugins of the OpenCitations software oc_ds_converter (), released under an ISC license.

Since JaLC is the DOI registration agency responsible for assigning DOIs to the citing entities, these identifiers were accepted as valid without further checks. Nevertheless, no information was provided about the registration agencies that assigned DOIs to the cited entities. Thus, to avoid redundant API checks, we adopted a two-step validation process relying on an ad-hoc data storage system.

During the first iteration of the dataset, all citing DOIs are accepted as valid and stored in memory as such. Metadata CSV tables are produced, concerning the citing entities only (Figure 2). In the second iteration, the cited identifiers are analyzed to verify their validity. During this phase, we produce metadata and citations CSV tables concerning the cited entities whose DOIs proved to be valid (Figure 3).

Figure 2 

Flowchart describing the preliminary processing of citing bibliographic entities.

Figure 3 

Flowchart describing the processing of cited bibliographic entities, their validation, and the production of metadata and citation tables.

The process leverages a system involving multiple storage solutions to prevent duplications and minimize external API calls. The validation pipeline includes the checks listed below:

  1. Search for the identifier in the temporary in-memory storage containing data concerning the current chunk of data being processed. If the identifier is among these data, it can be considered valid, since it was encountered and validated while processing the current data chunk.
  2. If the identifier was not found in (1), search for it in the main storage containing data concerning the whole dataset. Finding the DOI here implies its validity since it was detected and validated previously.
  3. If neither (1) nor (2) was successful, search for the identifier in the OpenCitations databases, containing a mapping between each identifier ever encountered while ingesting any dataset and its assigned OMID. If the DOI is found in there, it is considered valid.
  4. Use ID-schema-specific API services to check the validity of the ID if none of the previous attempts was successful.

After having processed each chunk of the dataset, the data tables for bibliographic and citation information are generated. These tables (CSV) serve as input for the two tools performing the data production tasks for Meta and Index. Simultaneously, the ID validation information in the temporary memory storage is transferred to the permanent storage, collecting all the valid identifiers encountered in the dataset up to the current moment.

3.1.3 Production of Metadata and Citation Data Collections

Using the software extensions on the dataset provided by JaLC as input, tables of bibliographic and citation data are obtained. These tables are used as input for the subsequent steps of the process.

The bibliographic entities CSV tables (Table 1) are used as input for the Meta software (), which curates the provided information and generates new data compliant with OCDM.

Table 1

Sample of Meta input tables produced by oc_ds_converter, storing bibliographic entities’ metadata.


IDTITLEAUTHORPUB_DATEVENUEVOLUMEISSUEPAGETYPEPUBLISHEREDITOR

DOI: 10.14825/kaseki.68.0_14 本邦産白亜紀アンモナイトデータベースおよび種多様性について利光, 誠一; 平野, 弘道; 松本, 崇; 高橋, 一晴2000化石 [issn:0022-9202 issn:2424-2632 jid:kaseki]68014–16journal article日本古生物学会

DOI: 10.1126/science.235.4793.1156 Chronology of fluctuating sea levels since the Triassic1987Science2351156–1167

The DOI-to-DOI citation CSV tables (Table 2) serve as input for the Index software (), a tool used for producing collections of references between bibliographic entities identified by OMIDs.

Table 2

Sample of Index input tables, produced by oc_ds_converter, storing citation data.

3.1.4 Ingestion of Metadata Collection into OpenCitations Meta

We use the Meta software to perform curation tasks and produce the collection of bibliographic metadata, where JaLC bibliographic entities will be included together with the ones derived from all the other sources. The software assigns a new OMID to each new entity and propagates an existing OMID to those records that have already been included in OpenCitations Meta. This latter operation is performed by deduplicating identical entities ingested from different data sources, potentially with multiple identifiers.

3.1.5 Production of Citation Data

After the integration of the new bibliographic records in Meta, the Index software, which is responsible for producing citation data (RDF, SCHOLIX, and CSV) compliant with OCDM, generates the OMID-to-OMID citations from JaLC data. In this step, OpenCitations Meta is exploited to retrieve OMIDs.

3.1.6 Input Dataset Analysis and Multilingualism Information Loss Forecast

OCDM allows only a single value for each metadata field (title, authors, venue, etc.). Therefore, we prioritized metadata in the original language when both the original and English versions were provided. However, an analysis of the initial JaLC dataset revealed that, in a few instances, the declared original language is not Japanese, and linguistic information is not always provided. Adding to the complexity, the dataset permits the specification of multiple values for publishers related to the same entity. This goes beyond allowing different translations of the same publisher name; it extends to cases where a single entity may have associations with multiple distinct publishers. Therefore, to avoid attributing wrong metadata information to bibliographic entities when the linguistic information is not formally specified and when the declared language differs from Japanese or English, our approach was to prioritize the first encountered value. We assumed the first available value to be the most commonly used, thereby mitigating the risk of inaccurate metadata attribution. This choice was motivated by the need for a pragmatic solution following OCDM without introducing data inconsistencies.

We sought to assess the impact of the single-language constraint by analyzing specific metadata elements in the JaLC dataset, such as the bibliographic entity title, publication venue title, and author names. In the case of JaLC, we found that the metadata attributed with the highest forecasted loss of information compared to the input dataset (since only one language can be accepted) is the citing entities’ journal title (41.44%) (Figure 5, Table 3).

Figure 4 

Language distribution in Meta bibliographic entities, calculated on Meta dump, version 5 (https://doi.org/10.6084/m9.figshare.21747461.v5). The analysis was performed on bibliographic entities with a declared title.

Figure 5 

Bar charts illustrating the analysis of multilingualism within the input dataset, categorized by bibliographic metadata fields.

Table 3

Table showing the metadata languages in the original dataset and the linguistic information loss due to OCDM constraints. The total amount of metadata provided for a field is the sum of the number of values provided solely in one language, twice the number of values supplied in two languages, and the product between the number of values provided in more than two languages and the precise number of furnished languages. The information loss is calculated as the sum of values provided in more languages out of the total calculated. The publisher’s name field has not been included in the table since it does not necessarily concern the loss of linguistic information but might involve cases where the information loss derives from having multi-publisher values.


1 LANGUAGE2 LANGUAGES3+ LANGUAGESTOTAL VALUES PROVIDEDINFORMATION LOSS WRT. THE ORIGINAL DATASET

title citing5,701,2851,641,89539(3 languages)8,985,1921,641,973; 18.27%

title cited217,31612,6160242,54812,616; 5.2%

authors citing9,892,5224,556,81239(3 languages)19,006,2634,556,890; 23,98%

authors cited308,079157,5560623,191157,556; 25.28%

journal title citing1,137,3682,658,67821,213 (20,572 3 languages; 641 4 languages)6,519,0042,701,745; 41.44%

journal title cited180,51500180,5150

3.2 Tools and Software

Three OpenCitations software tools are involved in the process:

  1. oc_ds_converter (). This tool handles metadata crosswalks from the JaLC data model to OCDM, the identifier validation, and the production of citations and bibliographic data in CSV format, meant to serve as input for subsequent processes. This software includes two nested modules, i.e. the oc_idmanager, for the validation of persistent identifiers, and the oc_data_storage, for the storage system management. On the occasion of codebase extension for JaLC integration, the Identifier Manager module was extended to handle JIDs, the identifiers assigned by J-STAGE to publication venues (e.g. journals). To facilitate the reuse of the software components, we released a Python package on PyPI. The latest version is 1.0.0, distributed on 27 October 2023.
  2. oc_meta (). This software curates bibliographic data, deduplicates entities, assigns OMIDs, and generates a metadata dataset related to the bibliographic entities involved in OpenCitations Index citations. As output, it produces bibliographic data in RDF and CSV formats, and a dataset for provenance and change tracking in RDF. For this software, a PyPI package was released to maximize reuse potential: the latest available version is 1.2.4, published on 16 February 2023.
  3. index (). The Index software produces RDF, CSV, and SCHOLIX data formats of OMID-to-OMID citations and a corresponding collection of provenance data in CSV and RDF formats. The current version can be consulted in the “meta-index” branch, bound to be merged into the master branch once a stable version is assessed.

Each software is released under an ISC license and hosted on a public GitHub repository, developed following a Test-Driven Development approach () and monitoring the percentage of tested code features with the Python “coverage” library. Currently, 83% of both oc_ds_converter and oc_meta software code is covered by tests. All the abovementioned repositories come with README documentation and use Poetry as a dependency management system to facilitate maintenance and foster workflow reproduction.

3.3 Language Agnostic Approach

Despite the current dominance of English-language content on the web, language-agnostic architectures are needed to address the challenges posed by globalization and the rising demand for multilingual web accessibility. Nonetheless, a universal solution for representing, storing, and processing multilingual data () and, in particular, bibliographic data () is still lacking. Over time, OpenCitations has developed general solutions aimed at meeting the needs of a wide range of data to be incorporated into a comprehensive and constantly expanding database capable of accepting information from diverse sources irrespective of the language in which it is generated (). By not allowing the specification of multiple translations for the same value, the OCDM imposes the selection of a single language for each bibliographic metadata entry, even in cases where the data source provides more options, leading to an inevitable loss of information.

Recognizing the implications of this endeavor, we emphasize the need to strike a delicate balance between preserving linguistic elements and respecting infrastructural and data model constraints, leading us to the choice of adopting a language-agnostic approach. For this reason, we decided to store metadata in the original language only, where it is possible. This choice is the most suitable in our case since it allows the preservation of bibliodiversity over global uniformity in a dominant language.

3.3.1 The Authors’ Names Management

In the context of adopting a language-agnostic approach, in addition to the project’s design goal of preserving linguistic diversity, additional considerations have emerged during the consolidation phase of the methodology. Particularly, a noteworthy case came out during the process of cleaning and standardizing the names of authors in the Meta dataset.

Initially, there was a proposition to eliminate all characters from names except for letters, numbers, periods preceded by a letter, and the ampersand. However, such a decision assumes an exhaustive knowledge of all permissible characters in personal names worldwide. This would require an understanding of all global alphabets and discerning which letters from these alphabets are genuinely used in personal names. For instance, we would need to consider scripts ranging from Basic Latin, Latin-1 Supplement, and Latin Extended-A, to Cyrillic Supplement, Armenian, Hebrew, and more. A pertinent question that arises is whether African click letters (ǀ, ǁ, ǂ, and ǃ) can be used in personal names. While it is feasible to craft a regular expression capturing all these alphabets broadly (e.g., ‘[A-Za-zÀ-ÖØ-öø-ÿĀ-ňƀ-njΑ-ω…]’), it does not seem to be a robust solution. Furthermore, other characters, not necessarily letters, are permissible in personal names. For instance, characters resembling an apostrophe (e.g., O’Connel) and a hyphen (e.g., Mun, Ji-Hye) are valid. Given these complexities, we decided to remove only the reserved characters used in the syntax of the CSV files with which Meta is populated, specifically “;”, “[”, “,”, and “]”. We chose to remain agnostic regarding all other characters. Our observations indicated that creating whitelists introduced far more errors than those resolved. This is primarily because the diversity in names is vast, and it is impractical to verify them all.

3.3.2 Handling a Multilingual Data Source

In this section, we delve into some additional methodological precautions of specific interest for handling multilingual or predominantly non-English sources.

  • Metadata Mapping and Data Selection. Accurate metadata mapping is essential for a successful data ingestion process. Collaborative efforts with data providers facilitate mapping source metadata to the end data model, aligning language-specific terms and concepts.
  • Proper Encoding-Decoding Choices. Appropriate encoding and decoding choices ensure the accurate transformation of data formats while preserving information integrity. When managing multilingual corpora, it is crucial to select broad encodings or develop ad-hoc solutions. For the JaLC Anglo-Japanese dataset, we adopted UTF-8 (), one of the most flexible encodings for representing Unicode characters (), as it can represent any language. However, as a broad rule for all the cases when it is not possible to know the languages included in a multilingual dataset in advance, avoiding ASCII is a good practice, even if it is enough to represent English and Latin characters ().
  • Culture-Specific Considerations. Each language has its peculiarities. In Asian languages, such as Japanese, homonymy is more common in surnames than in given names. For this reason, in the absence of a distinct and persistent identifier, we suggest facing the challenge with ad-hoc solutions, such as using an external DOI to ORCID mapping.

4. Results and Discussion

4.1 Current Dataset Overview

The infrastructure currently comprises 1,975,552,846 unique citations and 114,621,237 bibliographic entities. Based on a computational analysis of the titles of the publications performed in January 2024 with a script exploiting the “langdetect” Python library for language detection, 15% of the bibliographic metadata stored in OpenCitations Meta are not in English (Figure 6), representing a 6.4% increase compared to the previous version of the dataset (Figure 4) following the ingestion of JaLC data.

Figure 6 

Language distribution in Meta bibliographic entities, calculated on Meta dump, version 6 (https://doi.org/10.6084/m9.figshare.21747461.v6). The analysis was performed on bibliographic entities with a declared title.

4.2 OpenCitations Datasets After JaLC Dataset Ingestion

The JaLC input dataset counts 7,343,638 bibliographic entities and 416,125 citations. After processing the data, we eventually included in OpenCitations 7,333,238 bibliographic entities and 396,788 citations. Until the integration of this dataset, only bibliographic entities involved in Index citations were included in OpenCitations Meta. However, from a prior analysis of the JaLC dataset, we found that 6,908,305 entities had no citations, 329,078 entities were involved in citations to publications not identified with persistent identifiers managed by OpenCitations, and only 106,255 entities cited publications identified by DOI. For this reason, and given the small number of entities to be processed, we decided to integrate all bibliographic records in the dataset with DOIs assigned by JaLC into Meta.As can be seen in Table 4, it is not unusual for different sources to have overlapping information. However, only 1,137 of the citations by JaLC are in common with Crossref, and the majority of its contribution to OpenCitations is unique.

Table 4

Yellow cells represent the single contribution of each collection to OpenCitations Index, i.e., the number of citations uniquely derived by a given source. Pink cells represent the number of citations in the sources’ intersection. The table is based on OpenCitations data at its latest update (29 November 2023).


INDEXCROSSREFDATACITEPUBMEDOPENAIREJALC

INDEX1,975,552,8461,563,218,160169,814,412695,988,81014,645,838396,788

Crossref1,100,963,34627,051458,309,2973,917,3291,137

DataCite169,663,2559,623114,4830

PubMed237,208,8679,711,789125

OpenAire1,067,7120

JaLC395,526

The most recent update of the datasets dates back to 29 November 2023. According to these data, the OpenCitations Index includes information collected from five sources and comprises 1.97 billion citations involving 72,848,673 citing entities and 71,805,805 cited entities, for a total of 89,920,081 unique entities involved in at least one citation link.

4.3 JaLC Data Retrieval From OpenCitations Datasets

JaLC data were included both in Meta and Index, available in all the formats mentioned above.

To retrieve JaLC citations from Index, there are two methods. The first approach implies retrieving the identifiers of the citations from the provenance dataset and then looking for the collected OCIs in the OpenCitations Index CSV dump. More in detail:

  1. Access the OpenCitations Index on the “Download” page.
  2. Download the dataset Citation data sources’ info (N-Triple): information regarding the data source collection.
  3. Identify OCI subjects containing the string “joci”, standing for JaLC OpenCitations Index, to locate citations from JaLC.
  4. Download the citation data dataset in CSV format.
  5. Match OCIs in the CSV dataset with those from N-Triples to obtain JaLC citation information.

The alternative approach is suggested for the Semantic Web experts and implies using the OpenCitations Index SPARQL Endpoint to execute a query to retrieve subjects of triples with <http://www.w3.org/ns/prov#atLocation> property and the IRI identifying the internal OpenCitations Index collection dedicated to JaLC derived data (https://w3id.org/oc/index/joci) as the object, as shown in the following SPARQL query:

SELECT ?s WHERE {
      ?s
             <http://www.w3.org/ns/prov#atLocation>
             <https://w3id.org/oc/index/joci/>.
}

To access JaLC records from OpenCitations Meta, it is key to note that primary source information is contained in the provenance data, as per the OCDM model. Each Meta entity is linked to a snapshot (prov:Entity) via prov:specializationOf. This snapshot includes a prov:hadPrimarySource property indicating the primary source. For JaLC records, the primary source is https://api.japanlinkcenter.org/. Since the provenance data are not in a triplestore, downloading the Meta Provenance dataset is necessary to identify JaLC records.

5. Implications/Applications

5.1 Reuse Potential

The user base of OpenCitations data includes Funders, Resource and Research Managers, Researchers, Policy Makers, Research Organisations, and Providers. More in detail, the OpenCitations datasets benefit scholars in developing countries and professionals outside academic institutions without access to commercial citation indexes, as well as ordinary citizens seeking open data, open science partners, academic publishers, tool developers, bibliometricians, and librarians (; ). OpenCitations’ citation collection is currently used in several projects, e.g. B!son, Optimeta, repositories, e.g. the Staatsbibliothek zu Berlin and ORBi ULiege, and search tools, e.g. PURE suggest.

JaLC data integration opens up interesting usage prospects, especially for bibliometric studies on bibliographic metadata and citations among non-English resources. This aspect is especially relevant for bibliodiversity preservation in cultural heritage and humanities-oriented studies, in which awareness toward reproducibility and replicability of research is still being established (), and the introduction of open access tools that facilitate open science practices is a necessary starting point. In addition, the adopted ingestion workflow not only paves the way for addressing the intricacies of multilingual data ingestion but also opens doors for broader applications. Researchers and institutions dealing with non-English-based data sources can leverage such workflow, adapting and customizing it to suit their specific needs.

5.2 Future Developments

As a future development, since OCDM excludes multilingual storage but the language of cataloging can differ from the users’ language preferences, data reuse limitations could be contrasted with software-driven data retrieval solutions for extemporaneous translation. In this perspective, scalable solutions to the lack of common practices for multilingual access include promoting the use of controlled terms and established value vocabularies for simplicity and cost-effectiveness (). However, the current state-of-the-art open-source solutions do not guarantee the level of accuracy we aim to achieve. Thus, since sophisticated tools are needed for cross-lingual information retrieval, we plan to develop software exploiting open-source technologies for data exposure only, specifically trained on translation tasks regarding bibliographic metadata. Of course, the tool would come with a clear statement of the nature of the data displayed, i.e., whether it is presented in its original form or has been translated. Recently, similar tasks were addressed with approaches to multilingual information retrieval based on using pre-trained multilingual language models (). Such a solution would limit the storage issues caused by the maintenance of the English translations of the data, in addition to respecting the OCDM structure and maximizing the reuse potential of exposed data, therefore leveraging a balance between the need for information preservation and the constraints of physical space and data model nature.

Data Accessibility Statement

The two main datasets are deposited on Figshare under a CC0 license and can be downloaded in different formats, alongside their provenance information.

Meta:

Index: