This data was produced as part of the research for a publication on the English dative alternation for spoken data .
The EAS is composed of transcripts of spontaneous conversations, recorded in the period 2012–2014 and contains over 4 million tokens. The corpus also contains rich metadata about the speakers and the context of the conversation .
The EAS provides a rich opportunity for studying linguistic phenomena in a deeper sociolinguistic context. The dataset presented here deals with the so-called English dative alternation. To identify such constructions, we manually queried the EAS via the CQPweb interface, an online corpus query and analysis system . The queries were carried out for six frequent verbs that occur with both dative alternation patterns : give, lend, show, send, offer, and sell.
Although these queries used the EAS, they can be reproduced in the full BNC2014 corpus via the metadata tag Sample release inclusion (available in CQPweb and the underlying XML). The queries produced six intermediate sets of results with concordance lines containing a limited surrounding context for each occurrence of the target verb in the corpus. These intermediate result sets were saved manually as separate spreadsheet files.
The raw files were manually examined, and the rows that did not correspond to either of the constructions in i. or ii. were filtered out. Examples of omitted results are phrasal verbs like “give up” and idioms such as “give a shit”. The remaining results were manually annotated for two syntactic patterns exemplified in i. and ii. Additionally, the syntactic head of the noun phrase arguments (e.g. “board” in the phrase “the board”) were manually identified. The corpus markup does not include annotation for the relevant features we required. Experiments with automated syntactic annotation using tools trained on written English data did not yield good results, and for lack of appropriate training data a manual annotation process was followed. Moreover, we manually annotated the lemmas of recipients and themes with information about animacy. The resulting file is available as a supporting file alongside the dataset.
In a subsequent step, the concordance results were enriched with metadata annotation, downloaded separately from the corpus interface, and with speaker information, provided as a spreadsheet. This step was automated by means of a Python script which combines the semantic data exported from CQPweb with the data containing the manually annotated syntactic patterns from the corpus. Further data cleaning, primarily for increasing consistency in the annotation, was carried out using R . Both the Python script and the R script are available as supporting files alongside the dataset.
(3) Dataset description
Format names and versions
Version 1, comma-separated (csv) file
- Jenset, Gard B.; (data curation, investigation, formal analysis, conceptualisation, software, methodology)
- McGillivray, Barbara (data curation, investigation, formal analysis, conceptualisation, software, methodology)
- Rundell, Michael (data curation, methodology)
This dataset consists of 1840 observations of transcribed informal spoken British English, along the following 44 variables. Each observation corresponds to an occurrence of the verbs give, lend, show, send, sell, and offer in the BNC Spoken 2014 corpus. Missing values are coded as “NA” for compatibility with R.
|Variable||Description||Example relative to the sentence just send Christmas cards … to people you don’t see from year to year|
|Verb||The verb lemma, one of “give”, “lend”, “show”, “send”, “offer”, and “sell”.||send|
|VerbSemTag||The semantic tag of the verb, obtained from the corpus semantic annotation, based on UCREL  semantic analysis system USAS; tags are available at http://ucrel.lancs.ac.uk/usas/semtags.txt.||M2 (‘Putting, taking, pulling, pushing, transporting &c.’)|
|Pattern||The observed dative construction, one of “VNPP” or “VNN”||VNPP|
|Recipient||The recipient’s noun phrase||people you don’t see|
|RecLen||The number of characters in the recipient||21|
|RecHead||The recipient’s syntactic head||people|
|RecPrn||Boolean defined programmatically based on the semantic tag of the recipient. If the semantic tag is ‘Z8’, the value is TRUE; otherwise, the value if FALSE.||NA|
|RecSemTag||String with the UCREL  semantic tag of the recipient’s syntactic head||S2 (‘people’)|
|AnimateRec||Boolean indicating whether the recipient’s head is animate (TRUE) or inanimate (FALSE). This was manually annotated||FALSE|
|Theme||String with the theme’s noun phrase||Christmas cards|
|ThemeLen||The number of characters in the theme||15|
|ThemeHead||String with the theme’s syntactic head||cards|
|ThemePrn||Boolean defined programmatically based on the semantic tag of the theme. If the semantic tag is ‘Z8’, the value is TRUE; otherwise, the value if FALSE.||FALSE|
|ThemeSemTag||String with the UCREL semantic tag of the theme’s syntactic head||Q1 (‘LINGUISTIC ACTIONS, STATES AND PROCESSES; COMMUNICATION’)|
|ThemeField||First letter of the semantic tag of the theme’s syntactic head.||Q|
|DefTheme||Boolean indicating if the theme is expressed as a definite phrase (TRUE) or indefinite (FALSE)||FALSE|
|AnimateTheme||Boolean indicating whether the theme’s head is animate (TRUE) or inanimate (FALSE)||FALSE|
|NumSpeakers||Number of speakers in the conversation||Texts with 2 speakers|
|Location||Location where the conversation took place||Speakers’ home|
|Relation||Relationship between the speakers in the conversation||Close family, partners, very close friends|
|Subject||Subject of conversation||Mother and daughter talking about theatre|
|Topics||Topics covered in the conversation||Theatre, Disney films, websites, post, Christmas, jobs||
|ExactAge||Exact age of the main speaker in the conversation||44|
|AgeRange||The age range of the main speaker in the conversation||40_49|
|AgeRangeMid||Mid-point of the age range of the main speaker in the conversation. This variable is automatically calculated||45|
|AgeImputed||Equals the exact age of the main speaker in the conversation if it is recorded; it is the mid-point of the age range of the main speaker in the conversation, if the age range is recorded but not the exact range; otherwise, NA.
This variable is automatically calculated
|Gender||Gender of the main speaker in the conversation (M or F)||F|
|Nationality||Nationality of the main speaker in the conversation||British|
|BirthCountry||Country of birth of the main speaker in the conversation||England|
|L1||First language of the main speaker in the conversation||English|
|LingOrigin||Country of linguistic origin of the main speaker in the conversation||England|
|Accent||Accent of the main speaker in the conversation||South East England|
|City||City where the conversation took place||High Wycombe|
|Country||Country where the conversation took place||England|
|Level1Dialect||First level of granularity in the categorization of the dialect of the main speaker in the conversation||uk|
|Level2Dialect||Second level of granularity in the categorization of the dialect of the main speaker in the conversation||english|
|Level3Dialect||Third level of granularity in the categorization of the dialect of the main speaker in the conversation||south|
|Level4Dialect||Fourth level of granularity in the categorization of the dialect of the main speaker in the conversation||southeast|
|SpeakerHighestQual||Highest qualification of the main speaker in the conversation||Graduate|
|Occupation||Occupation of the main speaker in the conversation||Team leader|
|SpeakerSocGrade||Social grade of the main speaker in the conversation, according to the classification developed by the National Readership Survey (https://web.archive.org/web/20110303033539/http://www.nrs.co.uk/lifestyle.htm)||E|
|ForeignLangs||Foreign languages spoken by the main speaker in the conversation||French–level unspecified; Spanish–level unspecified|
|NumUtterances||Number of utterances of the conversation’s main speaker in the whole corpus||99|
|NumWords||Number of words uttered by the conversation’s main speaker in the whole corpus||1622|
CC BY 4.0
(4) Reuse potential
There is a growing trend in linguistics for quantitative research, a trend which is not proceeding at the same pace in all branches of linguistics . A natural corollary of this increasing quantitative research is a focus on replicable and reproducible research .
True replicability is difficult to achieve in many field-based disciplines and social sciences . A more achievable goal is reproducibility. Reproducibility is clearly important for increasing scientific transparency and accountability. A move towards greater reliance on usage-based theory development can drive convergence in linguistic theory generally  as well as in specific sub-fields . Despite some notable exceptions (such as second language acquisition), most linguistic sub-fields do not have a strong tradition for making research data available . Publishing not only corpora and raw data, but also the annotated research datasets means that data can be compared quantitatively across research traditions, or pooled into meta-studies for greater theoretical insights.
For linguistics, and in particular corpus linguistics, the aim of reproducibility requires not only access to raw corpus data, but also to manually retrieved, annotated, and categorised data. Despite advances in computational linguistics, automatic annotation tools still fall short in theoretically important areas such as pragmatics and semantics. In the case of transcribed spoken text, the challenges are compounded by the nature of spoken language. Moreover, parsing tools for automatic syntactic analysis are still not performing as well as on such data as on written text. As a result, manual annotation is in many cases inevitable.
Another effect of the required manual effort is that the annotated research datasets remain comparatively small. From this observation, two further use cases for shared data automatically follow. First, by pooling together different datasets, the resulting increase in statistical power may allow researchers to draw new conclusions based on correlations that remained obscure in smaller datasets. Second, despite great advances in the range of statistical NLP tools, there are still gaps when it comes to specialised but valuable tasks such as annotating linguistic data for a specific construction. The problem with training data for NLP tools is more commonly associated with historical linguistics [6, 20]. However, much of the freely available NLP data stem from written, not spoken language. Furthermore, any specific task for which training data is required will require specific training data, and such data will often be scarce due to the cost involved in manual annotation.
By publishing the dative alternation data, we contribute to all these reusability scenarios. The dative alternation is a topic of active research in linguistics, not least because it has been studied from different theoretical traditions. The dative alternation is a prominent example of the convergence of different theoretical and empirical research questions in linguistics, providing evidence for the motivations behind the linguistic decisions that speakers make . It is well established that both syntactic and pragmatic factors (especially discourse-new versus discourse-old information) play a role in choosing between the two constructions i. and ii. as shown in  and . Later studies have confirmed these findings while adding further nuance. The semantics of the verb arguments also plays a role , and there is agreement that, on the whole, the dative alternation is subject to broadly similar constraints across different macro-varieties of English [2, 13, 14, 15, 16, 17].
Despite this activity, the dative alternation continues to draw theoretical and empirical attention in linguistics, with a number of relevant and underexplored questions remaining. These include questions of linguistic prototypicality , the role of probability in spoken grammar , and the role of individual-level sociolinguistic factors .
Despite the interest in the dative alternation, few datasets from the published literature have been made publicly available. One notable exception is the dataset from , which was made available in an R package in 2008  and re-used for didactic purposes in . Another recent exception is .
By publishing this dataset we contribute to advancing the awareness of the need for reproducibility in linguistics, and specifically the progress of empirical research on the English dative alternation.