1 Overview

Repository location Zenodo: https://zenodo.org/records/10252341

Context

Making sense of face-to-face human-human interaction routinely involves more than just spoken or written language; other modalities include but are not limited to gesture, user-interaction modeling, gaze, joint attention, and markers of involvement/engagement. This is particularly true in the case of modeling human collaboration. For instance, having students engage in collaborative problem solving (CPS) has been shown to be an effective pedagogical technique that is correlated with positive learning outcomes (; ; ), and linguistic discourse alone does not reliably indicate effective collaboration.

The Weights Task Dataset (WTD) is a novel dataset of a situated, shared collaborative task, originally collected to study multimodal indicators of collaborative problem solving. This dataset complements other datasets for human-human interaction such as Anderson et al. (); Liu, Cai, Ji, and Liu (); Van Gemeren, Poppe, and Veltkamp (); Wang et al. (); Yun, Honorio, Chattopadhyay, Berg, and Samaras (), which lack at least one of: multimodal data, physical object manipulation, or multiparty interaction. Our data is novel in the joint presence of speech, gestures, and actions in a collaborative multiparty task. Annotation encodes many cross-cutting aspects of the situated and embodied involvement of the participants in joint activity.

2 Method

The Weights Task is completed by triads at a round table. A webcam captures the task equipment and participants. Kinect Azure cameras capture RGBD video from different angles. Task equipment includes 6 blocks (of varying weight, size, and color), a balance scale, a worksheet, and a computer with a survey where participants submit their answers.

2.1 Steps

Participants (English speakers, ≥18 years) were recruited from the student body of Colorado State University. Informed consent was obtained. Table 1 shows breakdown of gender and ethnicity.

Table 1

Participant pool distribution of gender and ethnic background. The task was conducted in English. Native languages besides English included Assamese, Bengali, Gujarati, Hindi, Malayalam, Persian, Spanish, Telugu, and Urdu.


MALEFEMALECAUCASIAN NON-HISPANICHISPANIC/LATINOASIAN

80%20%60%10%30%

Participants are given a balance scale to determine the weights of five blocks. They are given the weight of one of the blocks (10g), and must determine the weights of the others. As the weight of each block is discovered, it is placed on the worksheet in the cell corresponding to the weight. Next, participants are given a new block and must identify its weight without the scale, by deducing it based on the pattern observed in the initial block weights. Finally, participants must infer the weight of the next hypothetical block in the set and explain how they determined it. After each stage, groups submit their answers in the survey form.

The dataset consists of 10 videos (~170 minutes). Table 2 provides descriptive statistics of the data. Figure 1 shows participants engaging with the objects on the table from the perspective of the main Kinect. Figure 2 shows different annotations (described below).

Table 2

Dataset descriptive statistics.


AVG.SDMIN.MAX.

Participant age (yrs.)24.584.581935

Video length (mins.)17.007.00934

Figure 1 

Three participants engaged in the Weights Task. Participant #3 (on the right) is taking a block off the scale to try another configuration while Participant #2 (in the middle) wants to clarify the weight of the block under it. Multimodal information is required to make such a judgment.

Figure 2 

Multichannel (GAMR, NICE, speech transcription, and CPS) annotation “score” using ELAN ().

Utterance Segmentation and Transcription

Audio from all groups were segmented into utterances, or a single person’s continuous speech, delimited by silence, and transcribed. Segmentation and transcription were conducted by humans, by Google Cloud ASR (), and by OpenAI’s Whisper model (). Human transcription was performed by listening and transcribing what was said by each participant during a given manually-segmented utterance. Google and Whisper transcriptions were conducted over the utterances segmented by the same system (which may conflate overlapping speech by multiple people). Transcriptions are presented in .csv files.

Collaborative Problems Solving (CPS) Facets

CPS coding is performed at the utterance level using the framework of Sun et al. (). Annotators watched the video and coded each utterance with potentially multiple labels based on content, context, and position in the conversational sequence. Videos were annotated by two annotators (κ = 0.62) and adjudicated by an expert who underwent extensive training in the framework. CPS is presented in .csv files.

Gesture Abstract Meaning Representation (GAMR)

Participant gestures are annotated using the GAMR framework (). Most WTD gestures are deictic, indicating reference to an object or a location. Iconic gestures represent attributes of an action or object. The meaning of emblematic gestures is set by cultural convention. GAMR was dual annotated by annotators trained by authors of the framework (SMATCH F1-score = 0.75). This data is presented in PENMAN notation in .eaf files.

Nonverbal Indicators of Collaborative-Learning Environments (NICE)

The NICE coding scheme () captures nonverbal behaviors when people are working together in groups, such as the direction of gaze, posture (e.g., leaning toward or away from the activity area), and usage of tools (including pointing at or to the tool, as well as directly manipulating it). NICE was annotated by an author of the framework over Groups 1–3 and Group 5. This data is presented in .xlsx format.

Azure Kinect Data

We extracted joint positions and orientations from each frame of the raw RGBD data, for all 32 joints on each body detected by Microsoft’s body tracking SDK. This information (JSON) can be used to analyze body pose and gesture correlation to other modalities, or alone to classify gestures.

2.2 Quality control

By convention, participants are identified numerically from left (P1) to right (P3). Camera and microphone positioning are kept constant and the cameras calibrated using the standard Kinect SDK calibration procedure at the start of each session.

The raw data was recorded in .mkv format, including the depth channel, which is too large to include in the distributable dataset. We converted the RGB video to .mp4 and extracted the skeleton data from Azure depth channel.

In addition to individual .csv files, annotations have been loaded into .eaf files, which can be loaded in the ELAN environment ().

3 Dataset Description

Object name Weights Task Dataset

Format names and versions MP4, CSV, EAF, Excel, JSON

Creation dates 2022-09-22 — 2022-10-26

Dataset creators Ibrahim Khebour, Richard Brutti, Indrani Dey, Rachel Dickler, Kelsey Sikes,1 Kenneth Lai,2 Mariah Bradford,1 Brittany Cates,1 Paige Hansen,1 Changsoo Jung,1 Brett Wisniewski,1 Corbyn Terpstra,1 Leanne Hirshfield,4 Sadhana Puntambekar,3 Nathaniel Blanchard,1 James Pustejovsky,2 Nikhil Krishnaswamy1

Language English

License CC 4.0

Repository name Zenodo

Publication date 2023-09-27

4 Reuse Potential

This data was originally gathered to study multimodal indicators of CPS, but its rich multichannel nature also lends itself well to other lines of research. Researchers in education and learning sciences can use it to develop activities to support collaborative interaction and learning. Researchers in linguistics and psychology can use it to study interactive behavior and communication, including modeling the evolution of group common ground over time, a la Clark and Carlson (), and for natural language processing tasks such as assessing speech recognition fidelity (e.g., , which compared the effects of different segmentation methods). The rich multimodality will be of use to researchers in AI. For example, the Kinect data can be used to develop and train gesture recognition algorithms (e.g., ) or object and action detectors. The different modalities can serve as signals to an interactive AI agent that assists facilitators and scale up collaborative group activities by interpreting key multimodal aspects of collaborative group interaction in context (cf. ). The dataset will continue to be updated at the public repository as additional annotations are performed, including of object positions, actions taken with the different objects, and of the common ground constructed between participants as the task unfolds.

Potential limitations or issues with reuse may include: while using the Azure (skeleton) data, the body IDs in some frames do not align with participant IDs, as the Microsoft tracker assigns a new body ID if it loses and regains a participant. The prosodic features, although useful in a number of applications, could introduce noise if during a single segmented utterance, more than one voice is actually talking at the same time.

Updates will be noted at the dataset link. The data is freely available for research purposes, as indicated in the consent form (also available at the dataset link).