The investigation of classical and new text mining methods using a bilingual dataset can enhance the meaningfulness of comparisons of these techniques. The original way to use a parallel text dataset is to benefit from its construction, by which the texts are supposed to be strictly similar, leading us to expect that exploratory results from text mining will be similar too. We decided to explore a parallel dataset from a domain to extract knowledge from a technical area (e.g., finance). The choice of the pair Chinese–English has several motivations: firstly, the data is more easily available; secondly, there is a demand for English and Chinese tools and datasets, as English is already the lingua franca in many areas (political, economical, cultural, and scientific), and we also see an increasing interest in Chinese, which is now being taught at schools in western countries. One can keep in mind 1.41 billion people speak Chinese as their first or second language, while this is 1.35 billion for English (the overlap is no more than 20%). Secondly, China and the USA, as the areas of the native speakers, are drivers for the world economy. The language of business and finance has always attracted interest, since the movement of stock indexes can be an indicator, a ‘barometer,’ of the general trend in the economy. When we look at the availability of domain-specific parallel corpora, the majority of them are constructed around the following drivers: biomedicine (Neves, Yepes, & Névéol, 2016), digital humanities/culture (Christodoulopoulos & Steedman, 2014), city, transport (Lefever, Macken, & Hoste, 2009), food, the environment (Xiong, 2013), ICT (Labaka, Alegria, & Sarasola, 2016), digital humanities/law, and governance (Steinberger et al., 2006). Concerning Chinese–English, Chang (2004) from Peking University made one of the first large scale Chinese–English parallel corpora from HTML files with alignments at the paragraph and sentence levels, leading to a size of 10 million Chinese characters about different genres (news, technical articles, subtitles). Concerning the domain of finance, there are some small corpora for different pairs of languages, but not Chinese–English, (Arcan, Thomas, de Brandt, & Buitelaar, 2013; Bick & Barreiro, 2015; Smirnova & Rackevičienė, 2020; Tiedemann, 2012; Volk, Amrhein, Aepli, Müller, & Ströbel, 2016). The largest one is the SEDAR dataset,1 containing 8.6 million French–English sentence pairs in the finance domain from PDF files of the regulations of the province of Quebec (Ghaddar & Langlais, 2020). To our knowledge, the dataset discussed in our article represents new available material for the community. The question we address is to consider the state of the art techniques and the main contemporary approaches to text mining, and see what finally we can extract from a dataset of news in a specialized domain such as fintech. Knowing that each news item contains the same version in Chinese and English, another question to explore is the following:”are the efficiency and extraction exactly the same or do some cultural aspects influence the translation and so the lexical and semantic content? In this way, the general dataset we present in this article can be seen as a gold standard for the output of calibrated measures for all kinds of techniques. In general, studies use text collection within the framework of a specific method such as disinformation analysis (Turenne, 2018) or the development of medical drugs (Kolchinsky, Lourenco, Wu, & Rocha, 2015), or for a specific task such as part of speech (POS) tagging (Akbik, Blythe, & Vollgraf, 2018) or named entity extraction (Chiu & Nichols, 2016). In this article, we also take a domain dataset (namely, fintech) and a specific genre of document (news), but we do not have a specific task to improve. We try easy tasks intuitively and directly usable on such a dataset: clustering (named entity and word), classification (topic and sentiment), and pattern extraction (word life and citation). We made the dataset using the Financial Times website from which we grabbed 60,473 news items from between 2007 and 2021, each containing a version in English and its translation into Chinese. We focus on three families of techniques within the text-mining framework: (i) pre-processing techniques; (ii) supervised approaches involving deep learning techniques such as LSTM, BERT, CNN and also SVM, naïve Bayes, and random forest; and (iii) unsupervised techniques involving k-means, community detection, biclustering, co-cord analysis, and topic modeling (Turenne et al., 2020). This paper is divided into the following sections: we discuss the dataset and its sub-datasets, describe the state-of-the-art research based on bilingual corpora, machine learning, and natural language processing, and then present the results of our experiments.
2 Related Work
2.1 Parallel language dataset building
Zhao and Vogel (2002) is probably one of the pioneering studies about combining a parallel Chinese–English dataset and mining approach. They used 10 years of the Xinhua bilingual news collection, but that is not available. Koehn (2005) is a large-scale document multilingual and parallel dataset containing ∼60 million words on average per language for 21 European languages, but nothing about Chinese. In the same way, we find a topic detection and tracking repository.2 It contains 30K in Chinese and English, but not in parallel. Christodoulopoulos and Steedman (2014) and Sturgeon (2021) are open data repositories and digital humanities projects. They contain books with English–Chinese versions but their content is closely related to philosophy, religion, and difficult-to-understand contemporary thinking: for example, manual annotation for classification is not easy. The UCI Machine Learning Repository (Dua & Graff, 2017) and Kaggle3 are repositories of datasets, and many of them are used for the evaluation of algorithms. There are no English–Chinese parallel corpora. Zhai, Liu, Zhong, Illouz, and Vilnat (2020) made a dataset considering 11 genres (constructed based on existing work: art, literature, law, material for education, microblogs, news, official documents, spoken, subtitles, science, and scientific articles) and made a parallel English–Chinese dataset with 2,200 sentences to test the translation of literals. Tian et al. (2014) presents UM-Corpus,4 designed for sentence machine translation (SMT) research. It contains 15 million English–Chinese parallel sentences and treats eight genres: News, Spoken, Laws, Theses, Educational Materials, Science, Speech/Subtitles, and Microblog. Globally, the dataset contains 2.2 million sentences in both languages (450,000 for news alone). This dataset is freely available but named entities are anonymized.
2.2 Building domain-specific parallel datasets
In this section we present an extensive literature review of domain-specific datasets, their language pairs, and topics. We observed an increased interest in domain-specific parallel datasets in the past year. The main use of such material is to make a specialized learning dataset to improve a statistical machine translation system and to do cross-lingual information retrieval (McEnery & Xiao, 2007) from a computational point of view, to extract automatically or semi-automatically a specialized lexicon in different languages (Rosemeyer & Enrique-Arias, 2016) from a linguistic point of view. In the following review, we consider as domain-specific a dataset focused on all aspects of one topic. A text genre, such as news or technical publications, is considered as a domain.
2.2.1 Digital Humanities: culture
In this domain we have found 20 datasets, of which large pair datasets are as follows. In the area of religious studies, Christodoulopoulos and Steedman (2014) is about the Bible in 100 languages. We also find the Chinese–English (Sturgeon, 2021), the Arabic–English (Hamoud & Atwell, 2017), a presentation of the same ancient religious texts in different Germanic dialects (Dipper & Schultz-Balluff, 2013), and a parallel dataset of English and Persian religious texts (Beikian & Borzoufard, 2016). In literary studies Fraisse, Tran, Jenn, Paroubek, and Fishkin (2018) created a massively parallel dataset of translated American literary texts, with 23 languages. Altammami, Atwell, and Alsalka (2020) present a bilingual parallel English–Arabic dataset of narratives reporting different aspects of Muhammad’s life. In the domain of tourism and traveling, Espla-Gomis et al. (2014) built a domain-specific English―Croatian parallel dataset from different websites, Ponay and Cheng (2015) made an English–Tagalog dataset, Bureros, Tabaranza, and Roxas (2015) created a English–Cebuano dataset, Woldeyohannis, Besacier, and Meshesha (2018) made an Amharic–English dataset, Srivastava and Sanyal (2015) made a small parallel English–Hindi dataset, and Boldrini and Ferrández (2009) got 4500 questions/answers from customers about tourism in Spanish translated into English. About literary texts, Rovenchak (2021) published a Bamana–French analysis concerning Bamana tales, Kenny (1999) describes GEPCOLT, an electronic collection of some fourteen works of contemporary German-language fiction alonside their translations into English, Giouli, Glaros, Simov, and Osenova (2009) made a Greek–Bulgarian dataset about cultural, literary and folk texts, Kashefi (2020) made a Persian–English dataset with masterpieces of literature, Frankenberg-Garcia (2009) built a parallel dataset of English and Portuguese literary texts, Miletic, Stosic, and Marjanović (2017) made Paracolab a dataset of English, French and Serbian literary books, Guzman (2013) describes a dataset of literary texts with versions in Spanish, French, German, and Catalan.
D.-Y. Lee (2011) used an interesting approach, for Korean and English, to improve financial phrase translation, but the corpora are comparable without being really parallel. There are some parallel corpora about finance, with a limited size, such as Smirnova and Rackevičienė (2020), who made a dataset of European documents in English translated to French and Lithuanian related to finance, but the size is relatively small, consisting of 154 documents from 2010 to 2014. Bick and Barreiro (2015) made a Portuguese–English parallel dataset of about 40,000 sentences in the Legal-Financial domain, coming from a company translation memory. We will next mention four notable parallel corpora about finance, for which we will give the details below: the ECB dataset,5 the DBpedia-linguee dataset, the CSB dataset,6 and the SEDAR dataset.1 All of them have been made for automatic translation and cross-lingual information retrieval purposes. In the Opus project (Tiedemann, 2012), we can find the ECB dataset, employing 19 European languages and concerning financial and legal newsletters from the European Central Bank. As an example, it contains 113,000 English–German pairs of sentences. Arcan et al. (2013) used DBpedia datasets to extract the titles of relevant Wikipedia articles, and the Linguee database, obtaining 193,000 aligned sentences (English–German, English–French, and English–Spanish) to find translations of financial terms. The Credit Suisse Bulletin dataset (CSB) is based on the world’s oldest banking magazine, published by Credit Suisse since 1895 in both German and French (Volk et al., 2016). The SEDAR dataset (i.e., the System for Electronic Document Analysis and Retrieval) contains 8.6 million French–English sentence pairs in the finance domain from PDF files of regulations of the province of Quebec (Ghaddar & Langlais, 2020). However, all these datasets are about pairs of European languages. Guo (2016) describes how it can be feasible to make a domain-specific Chinese–English parallel dataset in the financial service domain, but it is restricted to giving guidelines about which tool to use to get raw data and how to use a parallel dataset, with the description and availability of the dataset. We have seen in this review that, firstly, domain-specific datasets are for different topics of societal challenges. Secondly, although the finance domain is not lacking in datasets, English–Chinese is not covered yet.
2.3 Parallel language dataset exploration
Parallel corpora have been investigated to make alignments between sentences. Wu and Xia (1994) is a pioneering work using parallel sentences in the framework of automatic translation. They used literal translations of sentences from the parliamentary proceedings of the Hong Kong Legislative Council, with five million words, to predict the Chinese translation of each English entry. In Yang and Li (2003), an alignment method is presented at different levels (title, word, and character) based on dynamic programming (DP). Lu, Tsou, Jiang, Kwong, and Zhu (2010) used a non-open dataset of 157,000 files, with both Chinese and English versions. More recently, Schwenk, Chaudhary, Sun, Gong, and Guzmán (2021) have made an alignment process over 85 languages and 135 million sentences from Wikipedia (available as open data), but they found only 790 sentences for English–Chinese, which is very few for a text mining workflow. Li, Wang, Huang, and Zhao (2011) used a linear combination and minimum sample risk (MSR) algorithm to make a matching between named entities (Person, Organization) and obtained an F-score of 84%. A pioneering work in text mining and English–Chinese texts is probably C.-H. Lee and Yang (2000), who used a neural network clustering method called Self-Organizing maps to extract clusters from an English–Chinese parallel dataset (this parallel dataset is made with Sinorama magazine articles with 50,000 sentences)7 but their conclusion only reveals the potential of the approach. Lan and Huang (2017) construct a bilingual English–Chinese latent semantic space and also select k-means initial cluster centers, but the interpretation of the clustering is not very clear.
3 The Dataset
3.1 Data collection
The news was collected for the period from 2007 to 2021.
After collating the links, the pages were downloaded with ‘wget’ and stripped of HTML. The encoding of the files was normalized to UTF-8 (R package ‘httpr’). Cloud computing under the SLURM framework was used to parallelize the NLP preprocessing.
In all, we got an uncleaned raw text dataset with 90,003 documents.
3.2 Data preprocessing
We carried out sentence segmentation, word splitting, and named entity extraction. For linguistic preprocessing, we used regular expressions for field extraction, sentence and paragraph splitting. We used Jieba and spaCy algorithms for tokenization and tagging, and the Stanford NER framework for named entity extraction.
The use of HTML was helpful to automatically extract from each news item its timestamp, title (in both languages), text body (in both languages), and topic tags. But in some cases, a translation was not available, so we took it as is. We tried to carry out a paragraph alignment between two equivalent documents in Chinese and English. Splitting into paragraph is also quite easy using line break markers. However, in some cases the number of paragraphs does not match, and we did not achieve this alignment because of the expensiveness of a human validation.
We proceeded to clean the documents using two rules: (1) each one had to have both English and Chinese versions; (2) only files with a text body containing more than two characters were kept.
We got a cleaned raw text dataset of 60,473 documents.
The dataset is available at https://doi.org/10.5281/zenodo.5591908
3.3 Data statistics
The dataset contains various metadata, such as title and text body both in English and Chinese, the time of publication, and some topic tags. Table 1 shows the extraction of elementary linguistic features.
3.4 Categories of finance domain
We made different samples for topic prediction using classification methods. This is the list of the 10 topic-metadata tags contained in the documents, used by the Financial Times to annotate the area of each news item. A news item can contain several tags: book, business, culture, economy, lifestyle, management, markets, people, politics, or society. There were 57,584 documents containing topic metadata. This is the list of the 10 tags from the Financial Times websites about the economic sector we used for manual annotation: technology, consumer services, health care, consumer goods, basic materials, industrials, oil & gas, and telecommunications. There are 2,993 documents that were tagged manually.
The top influential media in Finance are: 1. The Wall Street Journal. 2. Bloomberg. 3. The New York Times. 4. The Financial Times. 5. CNBC. 6. Reuters. 7. The Economist.
Five items of the Financial Times website can be clearly identified as related to the “economy” (equities, currencies, commodities, bonds, funds & ETFS) and the item world market can be associated with “markets,” company as “business,” and director dealings as management. The economy, management, markets, and business are among the tags contained in each document as metadata. However, we also find other tags, such as lifestyle, politics, and people. In fact, many influential people have an impact on the evolution of markets.
Other items as sectors and industrials can be further split into:
- id01 – Technology (Software & Computer Services, Technology Hardware & Equipment)
- id02 – Consumer Services (General Retailers, Travel & Leisure, Food & Drug Retailers, Media)
- id03 – Health Care (Health Care Equipment & Services, Pharmaceuticals & Biotechnology)
- id04 – Consumer Goods (Automobiles & Parts, Leisure Goods, Personal Goods, Food Producers, Household Goods, Tobacco, Beverages)
- id05 – Basic Materials (Industrial Metals, Mining, Chemicals)
- id06 – Industrials (Support Services, Electronic & Electrical Equipment, Industrial Transportation, Aerospace & Defense, Construction)
- id07 – Financials (Real Estate Investment & Services, Financial Services, General Financial, Life Insurance, Banks, Nonlife Insurance)
- id08 – Oil & Gas (Alternative Energy, Oil & Gas Producers, Oil Equipment, Services & Distribution)
- id09 – Utilities (Gas, Water & Multi-utilities, Electricity)
- id10 – Telecommunications (Fixed Line Telecommunications, Mobile Telecommunications)
Sectors, in finance, act both as a guide to make promising investments in the right places and as representation of areas of activity.
Topics id01 to id10 are used for manual annotation so their representation is less important than topics inserted into each document as metadata. From the manual annotation, the most frequent topics are: financials, consumer goods, consumer services, and technology. From the metadata, the most frequent topics are: business, the economy, markets, management, politics, lifestyle, and society.
3.5 Manual annotation
To carry out the manual annotation, we made a set of document batches, each one containing 100 distinct documents. A population of 31 students (year-3 level in computer science, with B1 to C1 level of English) received one batch each. Multiple annotation was possible, and the format of the annotation was quite elementary, such as document id followed by class id, one annotation by line, e.g.:
- 1014550; id07
- 1014871; id11
An extra annotator assessed the annotations by choosing randomly 10 files for each batch. If the annotation done by the extra annotator showed more than four differences from those produced by the annotator (i.e., >40% disagreement), the batch had to be revised by the annotator. Nineteen batches were revised. Finally, after the second round, we compiled all the batches together.
3.6 Data usage
As mentioned in the previous section on the literature, there are several ways to use a parallel dataset. The same is true for our Chinese–English parallel dataset for the domain of finance. So here are five main key points as possible usage:
- The influence of the language on the knowledge discovery
We present the results of different clusterings for topic discovery and classification for topic detection. Here, the algorithm is not supposed to take into account specificities of the language (i.e., it is to be language-independent). This dataset can be useful to study how a language-dependent algorithm could be more efficient.
- Keyword in context
Concordances of a word in the domain of finance can be extracted. In such a usage case, different contexts make possible the study of the meaning of a phrase and its variation.
- Automatic translation
A classical usage case is to exploit such a dataset to make automatic translations of documents in the domain of finance, using this dataset as a training set for a statistical machine translation system (SMT)
- Neologism translation
Translation is always a challenge, especially for new words. A usage case of the dataset is the study of neologisms. For example, to find the Chinese equivalent to about a new named entity in English (company name, people name).
- Time series of a domain-specific word
The last case can be the study of the distribution of words or phrases over time and see their popularity.
4 Discovery of some frequent interesting terms
In this section, we will search for some interesting words or phrases in the dataset and count their frequency of occurrence, which will be conducive to our further understanding of the dataset. Next, this section will be divided into three parts to explore the frequency of English proverbs and Chinese idioms, important finance related terms, and globally famous companies in the dataset. We made some experiments about lexical variation over time and proverb analysis (see appendix A for more details).
4.1 Discovery of frequent terms of finance domain
The first step is deciding how to choose some commonly used financial terms. Our decision was to use Fundera. Fundera is an online marketplace that connects small business owners with the best providers of capital for their businesses. It offers product marketplaces that cover everything from loans to legal services, free financial content, and one-on-one access with experienced lending experts. Based on the founding editor and vice president of the Fundera Ledger Meredith Wood’s “60 Business and finance terms you should definitely know”,11 we selected the top 20 financial terms that appear most frequently in the dataset. The results are shown on Table 2.
|Interest Rate||1036||Fixed Asset||101|
|Balance Sheet||522||Working Capital||83|
|Depreciation||368||Line of Credit||46|
Next, we imitated the method used above to detect the most frequent idioms and proverbs, extracting the statements in the dataset and calculating the frequency of occurrence of each financial term (see appendix file).
4.2 Discovery of frequent company names
We used the same method to collect statistics on the frequency of occurrence of company names in the dataset. Among them, we find the Chinese company Huawei, which shows that with the increase of China’s international influence, Chinese technology enterprises are increasingly favored by global business people.
5 Text-mining approaches and the domain of finance
The first point for people interested in finance or natural language processing about such a dataset as this, is that we provide a full analysis taking into account state of the art text-mining technology. These experiments were of three kinds (see appendix B and appendix C for technical details):
- lexical extraction (words, noun phrases, names of people, names of companies)
- classification (rervised learning)
- clustering (unsupervised learning)
As we showed in the section on the discovery of lexical items, this dataset is useful for identifying the important concepts and actors of the domain. These concepts are not new for an expert working in finance everyday, but the dataset can be used as an educational tool for students at school or college to understand what is finance through real life events and practical information. A list of frequent noun phrases (such as ‘asset,’ ‘interest rate’), a list of famous and influential people (such as Elon Musk, Xi Jinping), a list of names of famous organizations (such as the IMF and the Fed) were extracted, and one hundred frequent items for these three categories can easily serve as a basic framework of concepts for educational purposes. We also studied and compared the properties of the English and Chinese languages through the use of proverbs, which is one of the high-level linguistic patterns of any language. We discovered that in the domain of finance, which is highly related to technology and also to society, in the Chinese language, people used more freely proverbs but not at all in English. We do not have an explanation for this except that it may be an important cultural difference in how people use language to disseminate information (even in a technological area). We have shown that using this classification technique some potential readers could process new documents (unseen from the dataset), which may be interesting for them, according to the ontology of 20 topics described in Section 3.4.
Clustering, by definition, relies mainly on organizing knowledge about a set of unstructured data. We have carried out several experiments and clustering has revealed some classical topics of finance, such as business or markets, but also surprising topics in the finance domain, such as lifestyle, art and life, politics, and British education, which seem to play a big role. This shows that finance is not just an activity in society, like sports for example, but also seems to be an ideological model. Secondly, the clusters show that even if finance is globalized, a polarity about the specific relationship between China and US appears to emerge as more important than all others.
Chinese and English is an interesting combination of languages for testing algorithms and mining. Finance is a hot area of activity in our contemporary world. We made a text dataset using the Financial Times website from which we grabbed 60,473 news items from between 2007 and 2021. This dataset is a bilingual Chinese–English parallel dataset of news in the domain of finance, and is open access. We used a text mining analytical framework. As a future perspective, our dataset can be used to infer the translation of new terms from English to Chinese (i.e., company names), to extract the distribution of occurrences of new concepts for time series analysis (i.e., neologisms) or to apply a more innovative clustering approach to discover new concepts (i.e., ontology learning).
The additional files for this article can be found as follows:Appendix A
Discovery of some frequent interesting terms. DOI: https://doi.org/10.5334/johd.62.s1Appendix B
Classification. DOI: https://doi.org/10.5334/johd.62.s2Appendix C
Clustering. DOI: https://doi.org/10.5334/johd.62.s3