MP4: Multimodal Speech Recognition (Audio/Visual Speech Recognition)

Due date: March 17 before class starts.


In this machine problem, you will get a chance to develop a simple bi-modal speech recognizer using Hidden Markov Model(HMM). Bi-modal speech recognition makes use of both the speech features and the tracking of lips to do the speech recognition. The feature extraction is already done. The data provided is av_data.tar. It is extracted from some video recordings of subjects speaking the digit 2 or 5.

  • Audio feature: the sequences of ceptral coefficients.
  • Visual feature: result of tracking of lips(width, height)

You are provided the code that can be used to learn the HMM ( However, you are required to write the code that can compute the likelihood of a sequence given the model parameters. That is, the code should be able to compute the probability that a particular sequence came from a particular HMM (e.g. using forward, backward algorithms). The data set consists of 10 sequences of digit 2 and 10 sequences of digit 5. For evaluating the speech recognizer, we use leave one out scheme which basically divided the 20 utterances into two set, one has 19 utterances for training and the other one utterance for testing. Repeat this procedure for 20 times, we can get average accuracy for the speech recognizer.


There are three parts of the MP.

  1. Use only the audio feature for speech recognition.
  2. Use only the visual feature for speech recognition.
  3. Concatenate the audio and visual feature, use the joint feature for speech recognition.


Compute the likelihood of test utterance via Viterbi algorithm for speech recognition. In more details, for each test utterance find the best state sequence via Viterbi algorithm, and compute the likelihood given that state sequence.

Things to note

  1. Use the left-to-right non skip HMM for speech recognition.
  2. Learn HMM with 5 hidden states.

What to submit:

  1. Result In tabular form.
  2. Explanation and analysis of the 3 methods.
  3. Matlab file (email them to me)
  4. README file to tell us how we run your code to obtain the same results as you did.

Compress your report, code, and README file as an or xxx.tar.gz file where xxx is your Net ID. For example, if your Net ID is chang87, then your compressed file should be named or chang87.tar.gz. Please send this file to the

ECE 417 (Multimedia Signal Processing) covers characteristics of speech and image signals; important analysis and synthesis tools for multimedia signal processing including subspace methods, Bayesian networks, hidden Markov models, and factor graphs; applications to biometrics (person identification), human-computer interaction (face and gesture recognition and synthesis), and audio-visual databases (indexing and retrieval). Emphasis on a set of MATLAB machine problems providing hands-on experience. Prerequisite: ECE 310 and ECE 313.