Data from ‘The Dative Alternation Revisited: Fresh Insights from Contemporary British Spoken Data’

(2) Methods The EAS is composed of transcripts of spontaneous conversations, recorded in the period 2012–2014 and contains over 4 million tokens. The corpus also contains rich metadata about the speakers and the context of the conversation [1]. The EAS provides a rich opportunity for studying linguistic phenomena in a deeper sociolinguistic context. The dataset presented here deals with the so-called English dative alternation. To identify such constructions, we manually queried the EAS via the CQPweb interface, an online corpus query and analysis system [3]. The queries were carried out for six frequent verbs that occur with both dative alternation patterns [2]: give, lend, show, send, offer, and sell. Although these queries used the EAS, they can be reproduced in the full BNC2014 corpus via the metadata tag Sample release inclusion (available in CQPweb and the underlying XML). The queries produced six intermediate sets of results with concordance lines containing a limited surrounding context for each occurrence of the target verb in the corpus. These intermediate result sets were saved manually as separate spreadsheet files. The raw files were manually examined, and the rows that did not correspond to either of the constructions in i. or ii. were filtered out. Examples of omitted results are phrasal verbs like “give up” and idioms such as “give a shit”. The remaining results were manually annotated for two syntactic patterns exemplified in i. and ii. Additionally, the syntactic head of the noun phrase arguments (e.g. “board” in the phrase “the board”) were manually identified. The corpus markup does not include annotation for the relevant features we required. Experiments with automated syntactic annotation using tools trained on written English data did not yield good results, and for lack of appropriate training data a manual annotation process was followed. Moreover, we manually annotated the lemmas of recipients and themes with information about animacy. The resulting file is available as a supporting file alongside the dataset. DATA PAPER

In a subsequent step, the concordance results were enriched with metadata annotation, downloaded separately from the corpus interface, and with speaker information, provided as a spreadsheet. This step was automated by means of a Python script which combines the semantic data exported from CQPweb with the data containing the manually annotated syntactic patterns from the corpus. Further data cleaning, primarily for increasing consistency in the annotation, was carried out using R [4]. Both the Python script and the R script are available as supporting files alongside the dataset.  The verb lemma, one of "give", "lend", "show", "send", "offer", and "sell".

(4) Reuse potential
There is a growing trend in linguistics for quantitative research, a trend which is not proceeding at the same pace in all branches of linguistics [6]. A natural corollary of this increasing quantitative research is a focus on replicable and reproducible research [7]. True replicability is difficult to achieve in many fieldbased disciplines and social sciences [7]. A more achievable goal is reproducibility. Reproducibility is clearly important for increasing scientific transparency and accountability. A move towards greater reliance on usage-based theory development can drive convergence in linguistic theory generally [8] as well as in specific sub-fields [6]. Despite some notable exceptions (such as second language acquisition), most linguistic sub-fields do not have a strong tradition for making research data available [7]. Publishing not only corpora and raw data, but also the annotated research datasets means that data can be compared quantitatively across research traditions, or pooled into metastudies for greater theoretical insights.
For linguistics, and in particular corpus linguistics, the aim of reproducibility requires not only access to raw corpus data, but also to manually retrieved, annotated, and categorised data. Despite advances in computational linguistics, automatic annotation tools still fall short in theoretically important areas such as pragmatics and semantics. In the case of transcribed spoken text, the challenges are compounded by the nature of spoken language. Moreover, parsing tools for automatic syntactic analysis are still not performing as well as on such data as on written text. As a result, manual annotation is in many cases inevitable.
Another effect of the required manual effort is that the annotated research datasets remain comparatively small. From this observation, two further use cases for shared data automatically follow. First, by pooling together different datasets, the resulting increase in statistical power may allow researchers to draw new conclusions based on correlations that remained obscure in smaller datasets. Second, despite great advances in the range of statistical NLP tools, there are still gaps when it comes to specialised but valuable tasks such as annotating linguistic data for a specific construction. The problem with training data for NLP tools is more commonly associated with historical linguistics [6,20]. However, much of the freely available NLP data stem from written, not spoken language. Furthermore, any specific task for which training data is required will require specific training data, and such data will often be scarce due to the cost involved in manual annotation.
By publishing the dative alternation data, we contribute to all these reusability scenarios. The dative alternation is a topic of active research in linguistics, not least because it has been studied from different theoretical traditions. The dative alternation is a prominent example of the convergence of different theoretical and empirical research questions in linguistics, providing evidence for the motivations behind the linguistic decisions that speakers make [9]. It is well established that both syntactic and pragmatic factors (especially discourse-new versus discourse-old information) play a role in choosing between the two constructions i. and ii. as shown in [10] and [11]. Later studies have confirmed these findings while adding further nuance. The semantics of the verb arguments also plays a role [12], and there is agreement that, on the whole, the dative alternation is subject to broadly similar constraints across different macro-varieties of English [2,[13][14][15][16][17].
Despite this activity, the dative alternation continues to draw theoretical and empirical attention in linguistics, with a number of relevant and underexplored questions remaining. These include questions of linguistic prototypicality [16], the role of probability in spoken grammar [17], and the role of individual-level sociolinguistic factors [2].
Despite the interest in the dative alternation, few datasets from the published literature have been made publicly available. One notable exception is the dataset from [11], which was made available in an R package in 2008 [18] and re-used for didactic purposes in [19]. Another recent exception is [17].
By publishing this dataset we contribute to advancing the awareness of the need for reproducibility in linguistics, and specifically the progress of empirical research on the English dative alternation.