(1) Overview

Context

This data was produced as part of the research for a publication on the English dative alternation for spoken data [].

(2) Methods

The EAS is composed of transcripts of spontaneous conversations, recorded in the period 2012–2014 and contains over 4 million tokens. The corpus also contains rich metadata about the speakers and the context of the conversation [].

The EAS provides a rich opportunity for studying linguistic phenomena in a deeper sociolinguistic context. The dataset presented here deals with the so-called English dative alternation. To identify such constructions, we manually queried the EAS via the CQPweb interface, an online corpus query and analysis system []. The queries were carried out for six frequent verbs that occur with both dative alternation patterns []: give, lend, show, send, offer, and sell.

Although these queries used the EAS, they can be reproduced in the full BNC2014 corpus via the metadata tag Sample release inclusion (available in CQPweb and the underlying XML). The queries produced six intermediate sets of results with concordance lines containing a limited surrounding context for each occurrence of the target verb in the corpus. These intermediate result sets were saved manually as separate spreadsheet files.

The raw files were manually examined, and the rows that did not correspond to either of the constructions in i. or ii. were filtered out. Examples of omitted results are phrasal verbs like “give up” and idioms such as “give a shit”. The remaining results were manually annotated for two syntactic patterns exemplified in i. and ii. Additionally, the syntactic head of the noun phrase arguments (e.g. “board” in the phrase “the board”) were manually identified. The corpus markup does not include annotation for the relevant features we required. Experiments with automated syntactic annotation using tools trained on written English data did not yield good results, and for lack of appropriate training data a manual annotation process was followed. Moreover, we manually annotated the lemmas of recipients and themes with information about animacy. The resulting file is available as a supporting file alongside the dataset.

In a subsequent step, the concordance results were enriched with metadata annotation, downloaded separately from the corpus interface, and with speaker information, provided as a spreadsheet. This step was automated by means of a Python script which combines the semantic data exported from CQPweb with the data containing the manually annotated syntactic patterns from the corpus. Further data cleaning, primarily for increasing consistency in the annotation, was carried out using R []. Both the Python script and the R script are available as supporting files alongside the dataset.

(3) Dataset description

Object name

BNCspoken2014_dative_dataset_v1.csv

Format names and versions

Version 1, comma-separated (csv) file

Creation dates

2016-08-21–2016-10-16

Dataset Creators

  1. Jenset, Gard B.; (data curation, investigation, formal analysis, conceptualisation, software, methodology)
  2. McGillivray, Barbara (data curation, investigation, formal analysis, conceptualisation, software, methodology)
  3. Rundell, Michael (data curation, methodology)

Language

This dataset consists of 1840 observations of transcribed informal spoken British English, along the following 44 variables. Each observation corresponds to an occurrence of the verbs give, lend, show, send, sell, and offer in the BNC Spoken 2014 corpus. Missing values are coded as “NA” for compatibility with R.

Linguistic variables

VariableDescriptionExample relative to the sentence just send Christmas cards … to people you don’t see from year to year

VerbThe verb lemma, one of “give”, “lend”, “show”, “send”, “offer”, and “sell”.send
VerbSemTagThe semantic tag of the verb, obtained from the corpus semantic annotation, based on UCREL [] semantic analysis system USAS; tags are available at http://ucrel.lancs.ac.uk/usas/semtags.txt.M2 (‘Putting, taking, pulling, pushing, transporting &c.’)
PatternThe observed dative construction, one of “VNPP” or “VNN”VNPP
RecipientThe recipient’s noun phrasepeople you don’t see
RecLenThe number of characters in the recipient21
RecHeadThe recipient’s syntactic headpeople
RecPrnBoolean defined programmatically based on the semantic tag of the recipient. If the semantic tag is ‘Z8’, the value is TRUE; otherwise, the value if FALSE.NA
RecSemTagString with the UCREL [] semantic tag of the recipient’s syntactic headS2 (‘people’)
AnimateRecBoolean indicating whether the recipient’s head is animate (TRUE) or inanimate (FALSE). This was manually annotatedFALSE
ThemeString with the theme’s noun phraseChristmas cards
ThemeLenThe number of characters in the theme15
ThemeHeadString with the theme’s syntactic headcards
ThemePrnBoolean defined programmatically based on the semantic tag of the theme. If the semantic tag is ‘Z8’, the value is TRUE; otherwise, the value if FALSE.FALSE
ThemeSemTagString with the UCREL semantic tag of the theme’s syntactic headQ1 (‘LINGUISTIC ACTIONS, STATES AND PROCESSES; COMMUNICATION’)
ThemeFieldFirst letter of the semantic tag of the theme’s syntactic head.Q
DefThemeBoolean indicating if the theme is expressed as a definite phrase (TRUE) or indefinite (FALSE)FALSE
AnimateThemeBoolean indicating whether the theme’s head is animate (TRUE) or inanimate (FALSE)FALSE

Metadata

VariableDescriptionExample

NumSpeakersNumber of speakers in the conversationTexts with 2 speakers
LocationLocation where the conversation took placeSpeakers’ home
RelationRelationship between the speakers in the conversationClose family, partners, very close friends
SubjectSubject of conversationMother and daughter talking about theatre
TopicsTopics covered in the conversationTheatre, Disney films, websites, post, Christmas, jobs|
ExactAgeExact age of the main speaker in the conversation44
AgeRangeThe age range of the main speaker in the conversation40_49
AgeRangeMidMid-point of the age range of the main speaker in the conversation. This variable is automatically calculated45
AgeImputedEquals the exact age of the main speaker in the conversation if it is recorded; it is the mid-point of the age range of the main speaker in the conversation, if the age range is recorded but not the exact range; otherwise, NA.
This variable is automatically calculated
44
GenderGender of the main speaker in the conversation (M or F)F
NationalityNationality of the main speaker in the conversationBritish
BirthCountryCountry of birth of the main speaker in the conversationEngland
L1First language of the main speaker in the conversationEnglish
LingOriginCountry of linguistic origin of the main speaker in the conversationEngland
AccentAccent of the main speaker in the conversationSouth East England
CityCity where the conversation took placeHigh Wycombe
CountryCountry where the conversation took placeEngland
Level1DialectFirst level of granularity in the categorization of the dialect of the main speaker in the conversationuk
Level2DialectSecond level of granularity in the categorization of the dialect of the main speaker in the conversationenglish
Level3DialectThird level of granularity in the categorization of the dialect of the main speaker in the conversationsouth
Level4DialectFourth level of granularity in the categorization of the dialect of the main speaker in the conversationsoutheast
SpeakerHighestQualHighest qualification of the main speaker in the conversationGraduate
OccupationOccupation of the main speaker in the conversationTeam leader
SpeakerSocGradeSocial grade of the main speaker in the conversation, according to the classification developed by the National Readership Survey (https://web.archive.org/web/20110303033539/http://www.nrs.co.uk/lifestyle.htm)E
ForeignLangsForeign languages spoken by the main speaker in the conversationFrench–level unspecified; Spanish–level unspecified
NumUtterancesNumber of utterances of the conversation’s main speaker in the whole corpus99
NumWordsNumber of words uttered by the conversation’s main speaker in the whole corpus1622

License

CC BY 4.0

Repository name

Figshare

Publication date

2018-11-16

(4) Reuse potential

There is a growing trend in linguistics for quantitative research, a trend which is not proceeding at the same pace in all branches of linguistics []. A natural corollary of this increasing quantitative research is a focus on replicable and reproducible research [].

True replicability is difficult to achieve in many field-based disciplines and social sciences []. A more achievable goal is reproducibility. Reproducibility is clearly important for increasing scientific transparency and accountability. A move towards greater reliance on usage-based theory development can drive convergence in linguistic theory generally [] as well as in specific sub-fields []. Despite some notable exceptions (such as second language acquisition), most linguistic sub-fields do not have a strong tradition for making research data available []. Publishing not only corpora and raw data, but also the annotated research datasets means that data can be compared quantitatively across research traditions, or pooled into meta-studies for greater theoretical insights.

For linguistics, and in particular corpus linguistics, the aim of reproducibility requires not only access to raw corpus data, but also to manually retrieved, annotated, and categorised data. Despite advances in computational linguistics, automatic annotation tools still fall short in theoretically important areas such as pragmatics and semantics. In the case of transcribed spoken text, the challenges are compounded by the nature of spoken language. Moreover, parsing tools for automatic syntactic analysis are still not performing as well as on such data as on written text. As a result, manual annotation is in many cases inevitable.

Another effect of the required manual effort is that the annotated research datasets remain comparatively small. From this observation, two further use cases for shared data automatically follow. First, by pooling together different datasets, the resulting increase in statistical power may allow researchers to draw new conclusions based on correlations that remained obscure in smaller datasets. Second, despite great advances in the range of statistical NLP tools, there are still gaps when it comes to specialised but valuable tasks such as annotating linguistic data for a specific construction. The problem with training data for NLP tools is more commonly associated with historical linguistics [, ]. However, much of the freely available NLP data stem from written, not spoken language. Furthermore, any specific task for which training data is required will require specific training data, and such data will often be scarce due to the cost involved in manual annotation.

By publishing the dative alternation data, we contribute to all these reusability scenarios. The dative alternation is a topic of active research in linguistics, not least because it has been studied from different theoretical traditions. The dative alternation is a prominent example of the convergence of different theoretical and empirical research questions in linguistics, providing evidence for the motivations behind the linguistic decisions that speakers make []. It is well established that both syntactic and pragmatic factors (especially discourse-new versus discourse-old information) play a role in choosing between the two constructions i. and ii. as shown in [] and []. Later studies have confirmed these findings while adding further nuance. The semantics of the verb arguments also plays a role [], and there is agreement that, on the whole, the dative alternation is subject to broadly similar constraints across different macro-varieties of English [, , , , , ].

Despite this activity, the dative alternation continues to draw theoretical and empirical attention in linguistics, with a number of relevant and underexplored questions remaining. These include questions of linguistic prototypicality [], the role of probability in spoken grammar [], and the role of individual-level sociolinguistic factors [].

Despite the interest in the dative alternation, few datasets from the published literature have been made publicly available. One notable exception is the dataset from [], which was made available in an R package in 2008 [] and re-used for didactic purposes in []. Another recent exception is [].

By publishing this dataset we contribute to advancing the awareness of the need for reproducibility in linguistics, and specifically the progress of empirical research on the English dative alternation.