RI: Collaborative Research: Landmark-based Robust Speech Recognition Using Prosody-Guided Models of Speech Variability

Funded by the National Science Foundation, 2007-2010

Carol Espy-Wilson, Abeer Alwan, Jennifer Cole, Louis Goldstein, Mary Harper, Mark Hasegawa-Johnson, and Elliot Saltzman

Project Summary

Despite great strides in the development of automatic speech recognition (ASR) technology, we are far from achieving the Holy Grail: an ASR system with performance comparable to humans in automatically transcribing unrestricted conversational speech, representing many speakers and dialects, and embedded in adverse acoustic environments. Our approach to ASR applies new high-dimensional machine learning techniques, constrained by empirical and theoretical studies of speech production and perception, to learn from data the information structures that human listeners extract from speech. To do this, we will develop large-vocabulary psychologically realistic models of speech acoustics, pronunciation variability, prosody, and syntax by deriving knowledge representations that reflect those proposed for human speech production and speech perception, using machine-learning techniques to adjust the parameters of all knowledge representations simultaneously in order to minimize the structural risk of the r ecognizer.

Highlights of this project include:

  • developing nonlinear acoustic landmark detectors and pattern classifiers that a) integrate auditory-based signal processing and acoustic phonetic processing, b) are invariant to noise, change in speaker characteristics and reverberation, and c) can be learned in a semi-supervised fashion from labeled and unlabeled data;
  • using variable frame rate analysis to emphasize perceptually significant regions in the speech signal, allowing for multi-resolution and linguistically-appropriate analysis;
  • implementing lexical access based on gestural phonology using both acoustic and articulatory training data;
  • explicitly incorporating prosody at every level of recognition; and
  • adopting structured language models that combine prosody and syntax to handle disfluencies.

Our approach to ASR has substantially improved communication and collaboration between people and machines with ASR systems that handle variability due to noise, speaker differences, coarticulation, prosody, and dialect, and with future work extending the approach to ASR for other languages. Our knowledge-based approach will also improve understanding of how humans produce and perceive speech. The ideas in this proposal were developed in collaboration with a large number of our colleagues from around the world. Among other sources, these ideas have been developed in three successive summer workshops at the Johns Hopkins Center for Language and Speech Processing (WS04, WS05 and WS06) where researchers investigated the use of landmarks, articulatory features and prosody in ASR systems. The programs and databases developed at the workshops are publicly available, and continue to be the focus of our research and of the research of our WS04-06 collaborators at MIT, ICSI/SRI, CMU, the University of Washington, the University of Edinburgh, and the Technical University of Copenhagen. In year 3 of this project, we ran a special session at Interspeech on landmark-based ASR, at which papers were published from sites around the world pursuing related research goals.

The ideas developed in this research have been, and will continue to be, incorporated into our regular university-level teaching. One of our goals is to change the fundamental design principles of automatic speech recognition. Instead of the phone, future courses in speech recognition should teach students about landmarks and gestures; instead of the hidden Markov model, they should teach students about Bayesian networks. Our goal is achievable because it is shared by much of the academic speech recognition community.

[an error occurred while processing this directive]