1 Introduction

Idioms are multiword expressions whose prototypical meaning is figurative (; ). For instance, the Italian idiom essere la ciliegina sulla torta (“be the cherry on the cake”) means “be something extra that makes a good thing even better”, and the English kick the bucket means “die”. As idioms are non-compositional (), the linguistic contexts in which they occur typically show no reference to their literal constituents. Note, however, examples 1a and 1b:

(1)a.La lettura è un fattore di modernità, non è la ciliegina sulla torta ma è la torta stessa.
‘Reading is a factor of modernity; it is not the icing on the cake but the cake itself.’
la Repubblica
b.But, if I do end up on some opiate,! the bucket has been kicked and leaking significantly.

Such examples show an interesting contextual variation whereby the literal constituents of the idioms are exploited in the co-text to create ambiguity where there is interaction between literal and figurative meaning; such uses are also referred to as “creative” ().

All idioms are characterised by some degree of non-compositionality and formal complexity. However, they are also heterogeneous elements that differ from each other with respect to several variables (). In particular, idioms that are positively characterised by specific features seem to be more likely to occur in ambiguous contexts (see section 1.1.1). Consequently, to investigate the relationship between idiom external variation (i.e. contextual) and idiom internal variation (i.e. how each idiom is characterised with respect to the features of interests), it is necessary to collect quantifiable information on the idiom features that are relevant for the study of ambiguous idiomatic contexts. This is achieved by implementing a cross-linguistic norming study on a selection of English and Italian idioms. Norming studies are questionnaire-style tools by which participants are asked to provide judgements (typically Likert-scale ratings or categorical choices) with respect to the relevant variables. Our norming study led to the creation of the dataset described in this paper: a normed cross-linguistic lexicon of 150 English – Italian idiom pairs.

In the remainder of the introduction, we will define the variables included in the study (1.1) and the motivations supporting the creation of the dataset (1.2). Following this, the methodology adopted will be described in detail in 2. The results of the norming study will then be presented and discussed in section 3, including an in-depth look at ethical decision-making in crowdsourcing scenarios (3.1), the description of the cross-linguistic lexicon (3.2) and its reliability measurement (3.3). The paper will subsequently conclude in 4 with a concise summary and the outlines of future research.

1.1 The variables

1.1.1 Content-Based Variables

Following Hubers et al. (), we define Content-Based Variables (“CBVs”) as variables whose assessment is based on an idiom’s linguistic content. The CBVs considered are literal plausibility, decomposability and transparency; they were selected because idioms positively characterised by such features seem to be more likely to occur in ambiguous contexts (; ; ).

Literal plausibility is also referred to as “literality” () or “ambiguity” (); it refers to the possibility that the literal-compositional meaning of an idiom denotes an event consistent with our world knowledge (). Since it is reasonable to imagine someone literally kicking a bucket, kick the bucket is a literally plausible idiom. In contrast, when considering the idiom be in seventh heaven (“be very happy”), it is unrealistic that someone could literally be in a heaven labelled number seven. In other words, literally plausible idioms have “possible alternate non-idiomatic readings” (). According to Vulchanova et al. (), such a feature implies that these idioms are particularly prone to trigger competition between literal and figurative meaning; conversely, implausible idioms have no such internal competition between the two meanings, hence they “do not have a strong potential ambiguity” ().

Decomposability is also referred to as “compositionality” (; ) and “analysability” (); it refers to the possibility of decomposing the figurative meaning of an idiom and associating the parts with its literal constituents (). Decomposability is, therefore, a variable that investigates the interface between the syntactic and semantic level of an idiom (). In fact, both Geeraerts () and Fadlon, Horvath, Siloni, and Wexler () argue that decomposable idioms are characterised by isomorphism between the syntactic and semantic structure. Consider two examples: kick the bucket is typically considered non-decomposable, since the figurative meaning “die” cannot be associated with both the constituents “kick” and “the bucket”; conversely, spill the beans, whose meaning is “reveal secret information”, is usually considered decomposable, since the syntactic-semantic bridge is rather easily identifiable: “spill” = “reveal” and “the beans” = “secret information” (see ).

Given its relevance to both syntax and semantics, decomposability has been included in most norming studies. It has also been pointed out as a key factor in explaining the degree of linguistic flexibility of idioms (). However, recent developments in linguistics and psycholinguistics invite to resize the overall explanatory power of decomposability (; ) and to consider it in a multi-layered scenario where other variables also play a key role in the analysis of idiomatic variation.

Transparency is also referred to as “motivation” (), “relation” (), “semantic bridge” (); it refers to the possibility of tracing a synchronic relationship between the literal and figurative meaning of an idiom (). Such relationship serves as rationale for the figurative use, which can consequently be inferred from the literal constituents even when one is not familiar with an idiom (). Kick the bucket can be considered opaque (i.e., non-transparent), since the scene evoked by the literal meaning has no link with the scene evoked by the figurative meaning (there is an “incongruence”, in the terms of ); in contrast, the idiom pull strings, whose meaning is “leverage influential connections”, can be considered transparent when one thinks of the links between the two semantic poles: the first link is the conceptual metaphor CONTROL IS OBJECT MANIPULATION (present in the MetaNet database), while the second link is the presence in one’s world knowledge of the figure of the puppeteer, who imposes control over a puppet with the aid of some strings (). Transparency has been overall less investigated, as it is often subsumed within decomposability itself (like in ). However, Geeraerts () and later Hubers et al. () argue that these variables are distinct constructs: decomposability draws attention to the relationship between the syntactic and semantic dimensions of an idiom, while transparency highlights any conceptual links (typically provided by metaphor, metonymy and world knowledge, see ; ) that may exist between the figurative and literal meaning of an idiom.

Only few studies have considered both decomposability and transparency: see Carrol, Littlemore, and Dowens () and Michl (, ). However, these studies differ from each other in their definitions and operationalisations. Carrol et al. () defines transparency as the ease with which the meaning of an idiom can be guessed on the basis of individual words, and is operationalised by means of a 1 to 7 Likert scale. In their study, transparency is presented pre-meaning, i.e. it is rated before the figurative meaning of the idiom is made explicit. As for decomposability, it is considered as the ease with which the connection between individual words and the figurative meaning can be seen once the figurative meaning of the idiom is made explicit. It is hence presented post-meaning. Like transparency, it is operationalised on a 1 to 7 Likert scale.

Differently, Michl () divides transparency into two sub-variables: comprehensibility and relation. The first is defined as “the ease with which the meaning of an idiomatic unit can be recovered” (p. 102), and the author herself equates it to the variable meaningfulness introduced in other studies (see 1.1.2). The latter describes “the relatedness between the literal and the non-literal meaning of an idiom” (p. 102). It refers to the strength of the semantic-conceptual link between the two meaning poles of the idiom. Relation is therefore closer to transparency as understood in this paper. Both sub-variables are operationalised on a 1 to 5 Likert scale and are presented post-meaning. As for decomposability, it is examined in Michl (), and is defined as “the degree to which the meaning of an idiom can be composed from the meaning of its constituents” (p. 1288). It is presented post-meaning, but this time it is operationalised on a 1 to 7 Likert scale.

Overall, such different definitions and operationalisations may lead to different results, making a comparative analysis of transparency and decomposability complex.

Transparency becomes especially relevant for the present norming study when one considers that the identification of the relationship between the two meanings allows speakers to move more easily along the continuum between literalness and figurativeness (), hence fostering the possibility of using transparent idioms in ambiguous contexts. Finally, the results reported by van Ginkel and Dijkstra () show how transparency interacts with literal plausibility in creating competitiveness between the literal and figurative meaning of an idiom.

1.1.2 Experience-Based Variables

Following Hubers et al. (), we define Experience-Based Variables (“EBVs”) as variables whose assessment strictly depends on speakers’ subjective experience with idioms. The EBVs considered are familiarity, meaningfulness and objective knowledge. EBVs have been consistently reported to influence CBVs perception and assessment (; ), which is why it is essential to consider both groups to obtain a comprehensive understanding of idioms’ internal variation.

Familiarity refers to the subjective frequency with which an idiom is used and heard/read (). In general, there is widespread consensus on the impact of familiarity on the perception of the other variables (see especially ; ), as well as on idiom cognitive processing ().

Meaningfulness is also referred to as “knowledge” (); it refers to the perceived, subjective knowledge of the figurative meaning of an idiom (; ). As both familiarity and meaningfulness measure a speaker’s subjective experience with an idiom, the two EBVs are related. This does not mean, however, that they describe the same thing (), since one may know well the figurative meaning of an idiom that is nevertheless used infrequently.

Objective knowledge refers to the actual knowledge of the figurative meaning of an idiom; it is the objective counterpart of meaningfulness, as it verifies the correctness of the figurative meaning associated with an idiom. Citron et al. () highlight that speakers can confidently believe to know the figurative meaning (high meaningfulness) while having an incorrect paraphrase in mind (no objective knowledge). This is why it is appropriate to include objective knowledge as balancer.

1.2 Motivations

The lexicon is a validation tool for future research on ambiguous idiomatic contexts. More precisely, for each idiom, the sum of the mean ratings of the three CBVs (literal plausibility, decomposability and transparency) provides the index of “Potential Idiomatic Ambiguity” (“PIA”), whereby the higher this is, the more suitable the syntactic, semantic and lexical configuration of the idiom is for occurring in ambiguous contexts like 1a and 1b. The lexicon is therefore a “structured tool” containing “empirically determined variables” (), which makes it an ideal starting point for the setup of future experiments.

Secondly, the dataset is a rich source of information to explore per se. Indeed, the cross-linguistic lexicon enables us to:

  1. Further enhance research in the field of idiom norming studies, especially because it bridges the current gap in cross-linguistic research (; ).
  2. Enrich knowledge on Italian idioms, which, as far as we know, have only been investigated in the norming study by Tabossi et al. ().
  3. Shed light in a more systematic way on two of the most complex variables considered in the literature, decomposability and transparency. As norming studies differ greatly in variable definitions and operationalisations (), the knowledge available on the two dimensions is fragmentary and diverse (; ; ; ). Since the norming study that led to the creation of the cross-linguistic lexicon exploited the same definitions and operationalisations for both English and Italian, it is possible to make progress in this regard too.

2 Method

2.1 Materials

One hundred and fifty pairs of English and Italian idioms with similar meanings were selected for the study. Initially, a larger sample (=600) of English idioms was pooled from several existing resources (the MAGPIE and TInCAP corpora by and , plus the dataset resulting from the norming study by ). Afterwards, the corresponding Italian idioms were manually selected by exploiting online resources and corpora. This resulted in 300 cross-linguistic idiom pairs.

The cross-linguistic pairs were subsequently manually annotated by categories of translatability (), which describes “how closely the idioms match in their word-for-word translations” (p. 27), meaning the degree of lexical similarity within each idiom pair. Following Irujo () and Liontas (), Beck () describes three levels of translatability: lexical (“LL”), in which there is an exact overlap of lexical material between two idioms; semi-lexical (“SL”), in which the lexical overlap is only partial; post-lexical (“PL”), where there is no lexical overlap. To obtain a balanced dataset, it was decided to include 50 pairs for each translatability category in the lexicon, resulting in a total of 150 pairs of cross-linguistic idioms. See Table 1 for some examples of idiom pairs belonging to each category.

Table 1

Idioms illustrating the three levels of translatability.


be green with envy – essere verdi d’invidiabe under someone’s feet – stare tra i piedi a qualcuno
(“stay among someone’s feet”)
spill the beans – sputare il rospo
(“spit the toad”)

pick up the pieces – raccogliere i pezziput your foot down – puntare i piedi
(“point the feet”)
go through the wringer – passare le pene dell’inferno
(“pass the hell’s pains”)

All idioms comprised in the lexicon are presented within a minimal verb structure. Idioms that typically occur with the verb “to be” are also included.

2.2 Participants

Thirty participants per language were recruited via the crowdsourcing platform Prolific. Hubers et al. () showed that a minimum of twenty participants is necessary to obtain reliable data in norming studies. Furthermore, their analyses together with the ones in Nordmann and Jambazova () point out that an increase in the number of participants does not result in higher data reliability. For this reason, thirty seemed the appropriate number in view of possible exclusions due to data quality issues.

The selection of potential participants was filtered by native language (English or Italian), minimum number of submissions on Prolific (15–40), minimum success rate in previous studies (75–100%).

Before deciding whether to take part in the study, participants were informed of the relevant details on Prolific; these consisted of a brief description of the tasks to be completed, the estimated completion time (based on a previously conducted pilot), the way we would keep track of data quality, the devices allowed to take part in the study, and payment details. In this regard, it was decided to meet a standard of £9 per hour, which Prolific labels as “good payment”.

All participants provided informed consent before starting the study. They were then asked to provide some demographic information: sex assigned at birth, age, country of residence, highest degree obtained, number and type of languages known. Data collection was divided into three batches: ten participants at a time were recruited for each language.

As a result of data quality control (2.3.1), one Italian-speaking participant and two English-speaking participants were excluded. See Table 2 for a summary of the demographic details of the workers whose data have been included in the lexicon. To view the demographic details of each participant, consult the “participant_details_ENG” and “participant_details_ITA” files in the repository.

Table 2

Overview of participants’ demographic details.


Sex20 F; 8 M17 F; 12 M

Age – mean (s.d.)33.3 (9.5)30.8 (11.1)

Residence countriesUK (14); South Africa (5); Ireland (3); Canada (3); Poland (1); Portugal (1); USA (1)Italy (26); UK (3)

EducationHigh school diploma (10); Bachelors’ degree (10); Master’s degree (7); PhD (1)High school diploma (15); Bachelor’s degree (8); Master’s degree (5); Junior high school diploma (1)

Number of languages per participant – median (IQR)1 (1)2 (1)

2.3 Design

The data collection was carried out entirely online via the Gorilla Experiment Builder research platform. For each language, a two-part norming study was developed. During the first part, participants were asked to assess the EBVs and literal plausibility: familiarity, meaningfulness (which was defined “knowledge of the meaning” in the study, for the sake of clarity) and literal plausibility were operationalised using 1 to 5 Likert scales, while objective knowledge was operationalised via a dropdown option in which to choose the correct idiom paraphrase from three alternatives. In the second part, participants assessed decomposability and transparency, both operationalised using 1 to 5 Likert scales (see Supplementary Files to access operationalisation examples in the following files: “example_ratings_part1.pdf” and “example_ratings_part2.pdf”).

Before accessing the evaluation phase, participants read detailed instructions including definitions of all the relevant terms and careful illustrations of the variables for each section. Where appropriate, variable explanation was accompanied by examples showing the extremes of the Likert scale. In addition, a reference question was formulated for each variable, which participants were invited to ask themselves in case of doubts during the rating phase. For instance, the reference question for transparency was “how easily can I infer the figurative meaning from the literal meaning – based on a possible relationship between the two?” (see Supplementary Files: “instructions_part1.pdf” and “instructions_part2.pdf”). Each reference question was then placed next to each variable operationalised by means of a Likert scale, so that it was readily available.

For each section, idioms were spread over five randomised pages comprising approximately 30 idioms each. Upon completion of each page, participants were invited to take a short break before continuing. After completing the first section, participants only had access to the second one after a minimum of 24 hours; they were informed in advance on Prolific that compensation for their work would be paid once both parts of the study were completed.

Setting up and monitoring a multi-part online study poses challenges with regard to potential technical issues and participant engagement. Despite this, the choice of such complex design is motivated. The division of the variables into two separate sections is motivated by the need to present the idioms without their figurative meaning paraphrase in the first part. This ensures that the tasks (the selection of the correct meaning and the rating of the plausibility of the literal meaning) are not interfered with. In the second part, the idioms are presented with their figurative meaning paraphrase to obtain reliable ratings for decomposability and transparency (). Furthermore, the multi-part design made it possible to dilute the cognitive load of the participants. Therefore, the selected structure seemed the most suitable option to have a reliable fully within-subject design in which each participant rated each idiom for each variable.

It should now be pointed out that the most popular design in idiom norming studies is the between-subject one, where different groups of participants evaluate a single variable or a group of them, after which the data are aggregated (see, among others, ; ; ; ). In this regard, Nordmann and Jambazova () have shown that design choice does not impact the correlations between the variables. However, they also claim that “it is important to collect these ratings within subjects, because they can never be independent and should not be treated as such” (p. 200). Following the same line, Carrol et al. () also stated that “familiarity has a direct influence on perceptions of transparency, and semantic judgements therefore cannot be treated as independent” (p. 40). Therefore, adopting a within-subject design is the most rigorous approach to obtain data that accurately reflect the relationships between idiom variables.

2.3.1 Data quality tracking

To ensure the quality of the answers provided by the crowdworkers, some control items were included in the study. For their creation, the norming study by Bulkes and Tanner () was taken as a model.

In the first section, ten idioms (two per page) from other languages (Thai, Japanese, Korean, Arabic, Hindi) translated into Italian and English were included. These control idioms do not exist in English and Italian, so the expectation was to obtain low ratings (≤2) on the Likert scale associated with familiarity and meaningfulness. In the second section, ten (two per page) literal expressions (such as “eat a sandwich”, which has been paraphrased as “nourish oneself with some stuffed bread”) were included. As these expressions do not have a figurative meaning, the expectation was to obtain high ratings (≥4) for both decomposability and transparency. The control elements were considered successfully passed if participants answered seven out of 10 per section correctly (but see 3.1). In case of failure, participants were contacted on Prolific, given detailed explanations of their responses and of the contradicted expectations, and finally asked to “Return” their submission. Note that at the end of each section, a debriefing was included in which the control elements were explained to the workers (see Supplementary Files).

3 Results and discussion

3.1 Crowdsourcing outcome and ethics

Table 3 shows the average completion times of the study, divided into first part, second part and total. To view the completion times of each participant, consult the “participant_details_ENG” and “participant_details_ITA” files in the repository.

Table 3

Average completion times.



58 (32)47 (23)105 (52)

To the best of our knowledge, the combination of tools used (Gorilla and Prolific) and design adopted (fully within-subject and multi-part) is novel in the field. For this reason, it seems appropriate to devote a section to exemplify through two case studies the strategies followed to make ethical decisions towards the crowdworkers. The first episode involves a participant who failed the control items of the second part (=3/10). For this reason, they were contacted and requested to “Return” the submission. The worker subsequently wrote a detailed explanation for their ratings, revealing a lack of understanding of decomposability and transparency. For this reason, it was decided to collect more information on their behaviour: firstly, a widespread tendency to give low ratings of decomposability and transparency for all idioms was noted. The time taken to read the instructions and to complete the ratings was also checked, and it was evident that the worker had invested their time in completing the study. Therefore, the combination of quantitative and qualitative observations revealed that their responses were intentional and consistent throughout the second section of the study. As a result, it was decided to compensate the participant, but their data was not included in the lexicon.

The second episode involves a participant who passed only six out of 10 control items in the first section. Upon examining the languages the participant claimed to know, it was noticed that these included Malay and Chinese, which are languages geographically close to Thai and Japanese, from which two idioms each were translated. Interestingly, when checking which control idioms the participant considered to be relatively familiar and meaningful, we found the two Thai idioms and one Japanese idiom. It was hypothesized that linguistic interference might be at play, as the participant may be familiar with similar idioms in the known languages. As a result, the participant was granted access to the second section, where they performed perfectly on the control items. Consequently, their data was included in the normed idiom lexicon.

In conclusion, when crowdsourcing is exploited to collect data, it is advisable to have both quantitative and qualitative tools to track data quality. When no problems arise, quantitative tools are sufficient to declare the goodness of data. However, when facing doubts and/or complex situations, qualitative observations regarding workers’ behaviour and responses should also be exploited, so that researchers can make informed, ethical decisions with respect to the hired workers.

3.2 Dataset description

The normed lexicon of cross-linguistic idioms is in the repository of the University of Göttingen, it is searchable and downloadable. Its Digital Object Identifier (DOI) is 10.25625/EPSWDY, the license applied is CC BY-NC 4.0. The data files are in the non-proprietary csv format.

The lexicon in its entirety is contained in the “whole_dataset” file. Let us see the columns it contains:

  • IDIOM_ID: the id associated with each idiom. Note that each pair of cross-linguistic idioms has the same numeric id, differentiated by the language (e.g., the idioms 15eng and 15ita constitute a cross-linguistic pair).
  • IDIOM: the idiom.
  • LANGUAGE: two-level variable (“ENG” or “ITA”) indicating whether the idiom is Italian or English.
  • TRANSLATABILITY: three-level variable (“LL”, “SL”, “PL”) indicating the translatability relationship within the cross-linguistic idiom pair.
  • SYNTAX: shallow parsing of the syntactic structure of each idiom. This column was included for its potential in future research, especially regarding decomposability, as the variable is particularly relevant to the analysis of the interface between idiom syntax and semantics. Note that idioms presented within a verbal structure beginning with the verb “to be” are differentiated: be a breath of fresh air is analysed as VPbe_NP_PP, while get a slice of the pie as VP_NP_PP.
  • SAME_SYNTAX: two-level variable (“YES” or “NO”), it indicates whether the idioms within a cross-linguistic pair have the same syntactic structure.
  • PARTICIPANT: the id of the participant who provided the data in the row.
  • Familiarity; Meaninguflness; Literal_Plausibility; Decomposability; Transparency: the rating from 1 to 5 provided by the participant for the relevant variable.
  • Objective_Knowledge: two-level variable (0 or 1) indicating whether the participant correctly guessed (= 1) or not (= 0) the figurative meaning of the idiom.

The normed lexicon in its entirety is ready to be used for descriptive and/or inferential statistical analyses.

It should be noted that releasing all raw ratings of an idiom norming study is not common practice: in most cases, only aggregate data for each variable is released, typically the mean associated with the standard deviation (some examples: ; ; ; ; ). To ensure maximum transparency and adhere as closely as possible to the open science guidelines, it was decided to release the dataset in its entirety.

Both because it is common practice in the field and because we want to allow for observations of a more qualitative nature, it was nevertheless decided to release the dataset also in the form of aggregate data. In particular, two perspectives are present in the repository: the monolingual and the cross-linguistic one.

The “monolingual overviews” folder contains two csv files, one for English and the other for Italian (“eng_overview” and “ita_overview”). These files are designed for those interested in only one of the two languages. Let us see the columns of “ita_overview” as an example, i.e. the aggregate data regarding Italian idioms.

  • IDIOM_ID, IDIOM_ITA, SYNTAX: same as in “whole_dataset” (“IDIOM_ITA” instead of just “IDIOM”).
  • IDIOM_LITERAL_TRANSLATION: the literal English translation of the Italian idiom, to make the content accessible to everybody (absent in the “eng_overview” file).
  • IDIOM_MEANING_ITA: the figurative meaning of the idiom in Italian. Note that this is the paraphrase that was read by the participants while completing the study.
  • IDIOM_MEANING_ENG: the paraphrase in English, to make it accessible to everybody.
  • Familiarity_MEAN: mean value resulting from all raw ratings the idiom received on the familiarity scale. The same applies to “Meaningfulness_MEAN”, “Literal_Plausibility_MEAN”, “Decomposability_MEAN” and “Transparency_MEAN”.
  • Familiarity_SD: standard deviation of the mean value of familiarity. The same applies to “Meaningfulness_SD”, “Literal_Plausibility_SD”, “Decomposability_SD” and “Transparency_SD”.
  • Objective_Knowledge_PROPORTION: proportion ranging from 0 to 1 indicating how often the meaning of the idiom was correctly selected.
  • PIA: the index of Potential Idiomatic Ambiguity associated with the idiom, resulting from the sum of the mean values of literal plausibility, decomposability and transparency.

Finally, the “cross-linguistic overviews” folder contains three csv files, one for each translatability relation (“crossling_overview_LL”, “crossling_overview_SL”, “crossling_overview_PL”). These files bring together the pairs of idioms that share the same translatability relationship, and allow for an immediate comparison based on the aggregate data between the idioms that are within a cross-linguistic pair. The files are arranged horizontally, so that on the left-hand side are the English idioms and on the right-hand side the corresponding Italian idioms. The columns included are “IDIOM”, “IDIOM_ID”, “SYNTAX”, “SAME_SYNTAX”, “IDIOM_MEANING” (for the Italian part also “IDIOM_LITERAL_TRANSLATION”), all “MEAN” values and associated “SD” for idiom features on Likert scales, the “PROPORTION” for objective knowledge, and finally the “PIA”.

3.3 Reliability

By data reliability we refer to the degree to which the ratings obtained covary in a sensible way (see ). Measuring reliability is in this case especially important, since the combination of tools (Gorilla and Prolific) and design (fully within-subject, based on a two-part structure) has not been previously tested in the literature. It is therefore a way to validate or invalidate the proposed methodology through statistical analysis.

A well established metrics is the Intraclass Correlation Coefficient (ICC), a statistical measure spanning from 0 to 1 that quantifies the degree of consistency among ratings made by different raters (see ). There are different types of ICCs: the most appropriate for the present case is the ICC (2,k), being a tool for checking rating consistency while assuming random effects for both participants and items ().

Furthermore, Hubers et al. () introduced generalizability theory for assessing data reliability in idiom norming studies. This theory exploits two coefficients: the generalizability (“g-coefficient”) and the dependability coefficient (“d-coefficient”). Following Clayson, Carbine, Baldwin, Olsen, and Larson (), the former is low when inter-individual ratings are inconsistent; the latter “is low when measurements of the same individuals are inconsistent” (p. 8). Like the ICC, both coefficients span from 0 to 1.

For each language, we report the ICC (2,k) (calculated using R in the RStudio environment, by exploiting the pysch package), the g-coefficient and the d-coefficient (obtained via the gtheory package. The R script is available in the “supplementary files” folder in the repository). See Table 4 for results on English and Table 5 for results on Italian.

Table 4

English reliability ratings.





Literal plausibility0.970.970.97



Table 5

Italian reliability ratings.





Literal plausibility0.980.970.97



Results between 0.75 and 0.9 are indicative of good reliability; results above 0.9 are indicative of excellent reliability (). Tables 4 and 5 report very high rating reliability for both languages, including the typically challenging CBVs, whose assessment is considered to be more fluctuating and difficult (; ). These results therefore validate the selected research tools and the multi-part design adopted. Factors contributing to these results may include: the clear variable definitions; the explicit attention to the crowdworkers, manifested through ethical decisions towards them and the valorisation of their work; an effective distribution of cognitive load. These elements ensure a high-quality dataset reusable for future research.

4 Conclusions and future research

This paper described the creation of a normed lexicon of 150 pairs of English and Italian idioms annotated by translatability level. The lexicon contains variables that are particularly relevant for the analysis of ambiguous idiomatic contexts where there is interference between literal and figurative meaning. It is stored in the repository of the University of Göttingen and can be explored through three different perspectives. The dataset was created through the implementation of a cross-linguistic norming study carried out entirely online and resorting to crowdsourcing. The reliability analysis validates the research tools used and the design selected.

Future research is two-fold: firstly, the lexicon will be exploited for the selection of idioms to be used in experimental settings to test the relationship between the internal structure of idioms and ambiguous contexts. This will be particularly feasible thanks to the PIA index, which is, for each idiom, the sum of the mean ratings obtained for literal plausibility, decomposability and transparency. Idioms can therefore be selected on the basis of this index, to verify whether idioms that have a high PIA are indeed particularly suitable for occurrence in ambiguous contexts.

Second, the lexicon serves as a valuable research tool per se, representing the first step to bridge the gap in cross-linguistic research in idiom norming studies. Since the methodology has been explained in detail in the present paper, it can serve as a starting point for the creation of further cross-linguistic idiom datasets. In addition, the present lexicon allows for testing correlations between variables within each language and across languages. It also enables the application of statistical models to explore the relationships between variable sets, facilitating reliable cross-linguistic comparisons, as the definitions and operationalisations used are the same for both languages. Finally, these analyses will facilitate a more systematic understanding of both decomposability and transparency.

Data accessibility statement

The normed lexicon of cross-linguistic idioms is in the repository of the University of Göttingen, it is searchable and downloadable. Its Digital Object Identifier (DOI) is 10.25625/EPSWDY, the license applied is CC BY-NC 4.0. The data files are in the non-proprietary csv format.

Supplementary Files

Excerpts of the norming study in pdf format are available in the “supplementary files” folder in the data repository. These files include the consent form, instructions, data quality tracking strategies, Likert scales examples and final debriefing for both sections of the study. The same folder includes the R script to measure data reliability.