MP3: Audiovisual Person Identification

Due date: March 3 before class starts.


In this machine problem, you will use the results developed in the first two MPs and use them to develop multi-modal person recognition system. The two modalities that you are going to work with are speech and vision. As is obvious, in case of noisy environment vision is better, but in general vision requires a frontal face which is difficult to obtain more over in low bandwidth operations, audio seems to be a better choice. When both modalities are noisy, we argue that it is beneficial to use both modalities.


In this MP, you are going to use probabilistic models to do the fusion of the two modalities. There are three parts to the problem.

  1. Modify the speech based person ID algorithm of MP1. This time we use Gaussian Mixture Model (GMM) to estimate the distribution for each person and then compute the probabilities for the observed data on the GMMs.
    1. Feature extraction: Compute the 12 cepstral coefficients according to the method in MP1. Use the window size of 500. You can represent each audio file as a data matrix whose dimensions are 12 by N where N is the number of windows. Please be noted that, do NOT stack the columns of the data matrix into a long vector this time. Save the matrix for each audio file.
    2. GMM estimation: Use the training feature matrices of each person (15 audio files per person) to estimate a GMM with two mixtures (code for learning the GMM is provided). You will assume each mixture has a diagonal covariance matrix.
    3. Probability calculation: Compute the probability of the test audio files using each GMM.
  2. Modify the face recognition algorithm of MP2 to output the probability of different classes. Use Principal Component Analysis (PCA) based features and use the K-NN(K-nearest neighbor) algorithm with K=10 for face recognition. However, instead of just picking the winner, compute the probability of each class. (i.e. if 3 of 10 nearest neighbors belong to class 1, then the probability of class 1 is 0.3)
  3. Fusion: Compute the probability of each class by multiplying the probabilities computed in the above two cases. The class with the highest probability is returned as the winner.


The data are divided into two parts: training set and test set. There are 4 people in total. For each person, there are 10 face images and 15 audio files in the training set, and 10 face images and 10 audio files in the testing set.


The parts of the MP are

  1. Work with audio alone. Use mixture of Gaussians with 2 Gaussians for each class. Report the percentage recognition rate for each person. Also report the average overall percentage recognition rate.
  2. Just work with images and report the results for the face recognition experiment for each person (percentage recognition rate). Also report the average percentage recognition rate of all the persons. Use K-NN with K=10 as the classification algorithm.
  3. Do the fusion of the audio and vision based recognition (without weights). You should test for each combination of audio-images in the test set of each person (10 face images x 10 audio files = 100 combinations per person). Report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.
  4. You are also asked to try different weights during fusion: P(audio, class i | image) = P(audio | class i) ^ w * P(class i | image) ^ (1-w), for w ? {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Identify and report the weight which gives best overall average recognition performance. For this (best) weight, report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.


  1. Audio training set; Image training set;
  2. Audio Testing set; Image Testing set;
  3. Mixture of Gaussian code.

What to submit:

  1. Narrative explanation and analysis of the 3 methods, including the introduction, theoretical basis, and results and discussion as listed in the MP rubric. The results and discussion section should include:
    • Result In tabular form.
    • Explain why weighting is needed for fusion.
  2. Code including
    • Compress your report, code, and README file as an or xxx.tar.gz file where xxx is your Net ID. For example, if your Net ID is chang87, then your compressed file should be named or chang87.tar.gz. Please upload this file on Compass.
    • A run.m file with the following format: function run(trainspeech_path,trainimg_path,testspeech_path,testimg_path) where each of these paths is a string, telling your run.m file where to find the unpacked zip files downloaded from the web.
    • I recommend that you debug your run.m file without the first line ("function ..."), then when you have it debugged, add the first line to make it a function, and then test to make sure it works like that.
    • The gmm_eval and gmm_train functions that were in the file should be sent back to us as part of your own zipped archive, in the same directory with your run.m file, modified by you so that they run correctly.

Matlab related tips

Some of the commands that you may find useful are imread, imagesc, imresize, reshape, double