Corpora Heading link
Check out the NLP Lab’s GitHub repository for our publicly-released corpora.
Instructional corpus annotated with Rhetorical Relations.
This is a 5MB corpus on home-repair composed of 176 documents containing written English instructions. Texts were manually segmented into Elementary Discourse Units (EDUs). The corpus contains an average of 32.6 EDUs per document, for a total of 5744 EDUs and 53,250 words. Further, the corpus was manually annotated for 5172 rhetorical relations in the RST tradition (please see some pointers below). The corpus was originally developed under NSF Award IIS-0133123, and is further described in the following paper
Rajen Subba and Barbara Di Eugenio. An effective Discourse Parser that uses Rich Linguistic Information. NAACL-HLT 2009. The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Boulder, Co. June 2009.
Some pointers to RST: [Mann & Thompson 88], in Text-Interdisciplinary Journal for the Study of Discourse, 8(3); [Moser and Moore 96] in Computational Linguistics 22(3), [Marcu 99], in ACL 1999.
It contains 19 transcribed dialogues between two subjects (a helper and an elderly person) performing Activities of Daily Living (walking, preparing dinner, getting up from a chair) in a realistic environment. The interactions were videotaped, and both subjects wore a microphone, and a data glove to collect haptic data. The 19 dialogues in this corpus were transcribed in their entirety and then they were coded with multimodal annotations. Annotations include: co-reference, dialogue acts and dialogue games; pointing gestures; haptic-ostensive actions, and haptic interactions. Within the 19 dialogues, most types of annotations have been performed on 137 “Find tasks” (about 30% of the transcribed data: 1516 utterances, 6593 words, almost 1000 pointing gestures haptic actions). A FindTask is a continuous time span during which the two subjects are collaborating on finding and retrieving objects that are needed to perform Activities of Daily Livings such as preparing dinner. The data has been annotated with Anvil http://www.anvil-software.org/.
The released corpus only contains the transcribed dialogues their annotations (we regret that we cannot release the original videos due to human subject protection constraints; as concerns the haptic data we collected, it is not included per se in this release, but the annotation is).
The RoboHelper/FindTask was collected under NSF Award IIS-0905593. Further details can be found in the following paper:
Lin Chen, Maria Javaid, Barbara Di Eugenio, Miloš Žefran, “The roles and recognition of Haptic-Ostensive actions in collaborative multimodal human–human dialogues”, Computer Speech & Language, Volume 34, Issue 1, November 2015, Pages 201-231.
Cookie Theft Dialogue Act Corpus.
This corpus contains dialogue act labels for 100 transcripts from the Cookie Theft Picture Description Task portion of the Pitt Corpus (https://dementia.talkbank.org/access/English/Pitt.html), a subset of DementiaBank (https://dementia.talkbank.org/). Each of the 1616 utterances are labeled with one or more of 26 dialogue act labels according to the schema defined in the annotation guidelines below. Here, we include public links to a zipped folder containing the corpus and associated ReadMe file, as well as the annotation guidelines. For further details regarding the corpus itself, please refer to the following publication:
Shahla Farzana, Mina Valizadeh, and Natalie Parde. Modeling Dialogue in Conversational Cognitive Health Screening Interviews. To appear in Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC 2020). Marseilles, France, May 11-16, 2020.
Annotation Guidelines: annotation_guidelines.pdf