Prosodic, Intonational and Voice Quality Correlates of Disfluency

Funded by the National Science Foundation, 2004-2007

Mark Hasegawa-Johnson, Jennifer S. Cole, and Chilin Shih.

Project Summary

Nearly one out of every twenty words, in spontaneous speech, is part of the reparandum or the edit of a disfluency. It has been previously reported that disfluent words are marked by a special type of prosody: the "interruption point" often truncates the last word of the reparandum, leaving a word fragment followed by a filled or unfilled pause. Previous methods for the automatic recognition of disfluency have used observations including the pause duration, the correspondence between words and parts of speech in the reparandum and the alteration, and the pitch reset at the onset of the alteration. Using these methods, it is possible to automatically detect as many as 90% of all disfluencies, given a correct word-level transcription of the utterance; given error-filled automatic transcriptions, the precision of current automatic disfluency detectors drops considerably.

This project studies two aspects of the prosody of disfluency that have not been extensively studied in the past. First, we are studying glottalization. Other investigators have reported that the interruption point or edit segment of a disfluency are occasionally glottalized, but because glottalization is only occasional, and because glottalization is difficult to characterize using standard speech signal processing methods, this acoustic feature of disfluency has apparently not been extensively studied or incorporated into automatic speech recognition. Second, we are studying the repetition, in the alteration, of the pitch and energy contours of the reparandum. Repetition of the intonational contour is a perceptually salient characteristic of many disfluencies, and seems to help human listeners to identify and rapidly process disfluency, but to our knowledge, this characteristic of disfluency has never been carefully studied nor incorporated into a speech recognizer. Our results show that, while perceptually salient, the phenomenon of intonation repetition is difficult to characterize acoustically: the alteration seems to frequently repeat the dynamic contour of the reparandum pitch and energy, but the contour in the alteration is almost always shifted, scaled, and temporally compressed relative to the contour in the reparandum.

We are studying these phenomena using the following methods:

  1. We have transcribed pitch accents, prosodic phrase boundaries, and disfluent intervals present in several hundred short spontaneous utterances from the Switchboard spontaneous telephone speech corpus; this transcription will continue.
  2. We are studying glottalization using semi-automatic methods for the design and selection of signal processing measurements, combined with an interactive phonetic and statistical analysis of lexical and contextual features of the utterance that may condition the presence of glottalization.
  3. We are studying intonation repetition in detail using the Stem-ML physiologically motivated intonational analysis system, developed by Dr. Shih.
  4. Acoustic models of intonational matching, and of glottalization near the interruption point, are being incorporated together with our previously developed prosody-dependent acoustic and language models into our telephone-band automatic speech recognition systems, resulting in a system capable of simultaneously transcribing the words, intonational phrase boundaries and accents, and disfluency boundaries of an unknown spontaneous utterance.
[an error occurred while processing this directive]