Landmark-Based Speech Recognition in Music and Speech Backgrounds

Mark Hasegawa-Johnson

Funded by the National Science Foundation, 2002-2007

Project Summary

Humans recognize speech with error rates 10 to 100 times lower than the best automatic speech recognizers across a broad range of tasks. The discrepancy is even worse in the field of computational auditory scene analysis: in a recent test, a computational auditory scene analysis algorithm failed to separate speech sources from any non-stationary broadband background, while humans perform this task routinely. There is an interesting asymmetry in these two results. Automatic speech recognition is performed using a statistical automaton informed only minimally by psychological results; many believe that speech recognition may be improved by building more psychological structure into the recognizer. Computational auditory scene analysis (CASA) is performed using a detailed discriminative model of well-documented psychological processes; many believe that CASA may be improved by using the methods of Bayesian classification.

The research communities active in both areas seem to be converging on a common space of possible solutions. The required solution in both cases seems to be a recognition model that is:

  • a probability distribution whose parameters can be trained from data, but with
  • internal structure capable of abstracting the perceptual response patterns of human listeners.

This research project addresses two broad themes:

Probabilistic Auditory Scene Analysis

Can probability models representing the pitch, envelope, and timing of an acoustic source be computed and integrated in a tractable manner? Auditory scene analysis depends on an accurate model of the spectral fine structure, or equivalently of the high-quefrency cepstrum. Exact probabilistic models are untrainable, but motivated by recent findings in acoustic phonetics, this research has explored a class of approximate models based on the formalism of the factorial HMM, a specialized type of dynamic Bayesian network (DBN). We have explored models of solo music based on explicit models of spectral fine structure, and have combined music models with speech models using MIXMAX-style analytical and sequential Monte Carlo methods.

Landmark-Based Speech Recognition

What are the theoretical and empirical requirements for the partitioning, training, and recognition scoring of probability models for landmark-based acoustic features? Landmark-based acoustic features are transcription-dependent, and the development of a Bayesian landmark-based speech recognizer was therefore impossible before the recent development of the theory of class-specific features. We have develop maximum likelihood training algorithms by adapting methods from the theory of class-specific features, resulting in a generalized nonlinear maximum likelihood transformation for the purpose of mapping two a feature space into a set of available PDF models. We have also developed regularized discriminative methods, especially support vector machines (SVM), and have demonstrated landmark-based speech recognition using hybrid SVM-HMM and SVM-DBN architectures.

[an error occurred while processing this directive]