MP4: Multimodal Speech Recognition (Audio/Visual Speech Recognition)
Due date: March 17 before class starts.
Overview
In this machine problem, you will get a chance to develop a simple bi-modal speech recognizer using Hidden Markov Model(HMM). Bi-modal speech recognition makes use of both the speech features and the tracking of lips to do the speech recognition. The feature extraction is already done. The data provided is av_data.tar. It is extracted from some video recordings of subjects speaking the digit 2 or 5.
- Audio feature: the sequences of ceptral coefficients.
- Visual feature: result of tracking of lips(width, height)
You are provided the code that can be used to learn the HMM (learn_hmm.zip). However, you are required to write the code that can compute the likelihood of a sequence given the model parameters. That is, the code should be able to compute the probability that a particular sequence came from a particular HMM (e.g. using forward, backward algorithms). The data set consists of 10 sequences of digit 2 and 10 sequences of digit 5. For evaluating the speech recognizer, we use leave one out scheme which basically divided the 20 utterances into two set, one has 19 utterances for training and the other one utterance for testing. Repeat this procedure for 20 times, we can get average accuracy for the speech recognizer.
Experiment
There are three parts of the MP.
- Use only the audio feature for speech recognition.
- Use only the visual feature for speech recognition.
- Concatenate the audio and visual feature, use the joint feature for speech recognition.
BONUS
Compute the likelihood of test utterance via Viterbi algorithm for speech recognition. In more details, for each test utterance find the best state sequence via Viterbi algorithm, and compute the likelihood given that state sequence.
Things to note
- Use the left-to-right non skip HMM for speech recognition.
- Learn HMM with 5 hidden states.
What to submit:
- Result In tabular form.
- Explanation and analysis of the 3 methods.
- Matlab file (email them to me)
- README file to tell us how we run your code to obtain the same results as you did.
Compress your report, code, and README file as an xxx.zip or xxx.tar.gz file where xxx is your Net ID. For example, if your Net ID is chang87, then your compressed file should be named chang87.zip or chang87.tar.gz. Please send this file to the ece.417.spring.15@gmail.com.
Office Hours:
Friday 4-5:30pm, 3001 ECEB
Theater, because of its nature, both text, images, multimedia effects, has a wider base of communication with an audience. That's why I call it the most social of the various art forms. Wole Soyinka