MP3: Multimodal Person Identification (Face Recognition, Person ID)

Due date: March 3 before class starts.


In this machine problem, you will use the results developed in the first two MPs and use them to develop multi-modal person recognition system. The two modalities that you are going to work with are speech and vision. As is obvious, in case of noisy environment vision is better, but in general vision requires a frontal face which is difficult to obtain more over in low bandwidth operations, audio seems to be a better choice. When both modalities are noisy, we argue that it is beneficial to use both modalities.


In this MP, you are going to use probabilistic models to do the fusion of the two modalities. There are three parts to the problem.

  1. Modify the speech based person ID algorithm of MP1. This time we use Gaussian Mixture Model (GMM) to estimate the distribution for each person and then compute the probabilities for the observed data on the GMMs.
    1. Feature extraction: Compute the 12 cepstral coefficients according to the method in MP1. Use the window size of 500. You can represent each audio file as a data matrix whose dimensions are 12 by N where N is the number of windows. Please be noted that, do NOT stack the columns of the data matrix into a long vector this time. Save the matrix for each audio file.
    2. GMM estimation: Use the training feature matrices of each person (15 audio files per person) to estimate a GMM with two mixtures (code for learning the GMM is provided). You will assume each mixture has a diagonal covariance matrix.
    3. Probability calculation: Compute the probability of the test audio files using each GMM.
  2. Modify the face recognition algorithm of MP2 to output the probability of different classes. Use Principal Component Analysis (PCA) based features and use the K-NN(K-nearest neighbor) algorithm with K=10 for face recognition. However, instead of just picking the winner, compute the probability of each class. (i.e. if 3 of 10 nearest neighbors belong to class 1, then the probability of class 1 is 0.3)
  3. Fusion: Compute the probability of each class by multiplying the probabilities computed in the above two cases. The class with the highest probability is returned as the winner.


The data are divided into two parts: training set and test set. There are 4 people in total. For each person, there are 10 face images and 15 audio files in the training set, and 10 face images and 10 audio files in the testing set.


The parts of the MP are

  1. Work with audio alone. Use mixture of Gaussians with 2 Gaussians for each class. Report the percentage recognition rate for each person. Also report the average overall percentage recognition rate.
  2. Just work with images and report the results for the face recognition experiment for each person (percentage recognition rate). Also report the average percentage recognition rate of all the persons. Use K-NN with K=10 as the classification algorithm.
  3. Do the fusion of the audio and vision based recognition (without weights). You should test for each combination of audio-images in the test set of each person (10 face images x 10 audio files = 100 combinations per person). Report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.
  4. You are also asked to try different weights during fusion: P(audio+image | class i) = P(audio | class i) ^ w * P(image | class i) ^ (1-w), for w ? {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Identify and report the weight which gives best overall average recognition performance. For this (best) weight, report the percentage recognition rate for each person. Also report the average percentage recognition rate of all the persons.


  1. Audio training set; Image training set;
  2. Audio Testing set; Image Testing set;
  3. Mixture of Gaussian code.

What to submit:

  1. Result In tabular form.
  2. Explanation and analysis of the 3 methods. Explain why weighting is needed for fusion.
  3. Matlab file (email them to me)
  4. README file to tell us how we run your code to obtain the same results as you did.

Compress your report, code, and README file as an or xxx.tar.gz file where xxx is your Net ID. For example, if your Net ID is chang87, then your compressed file should be named or chang87.tar.gz. Please send this file to

Matlab related tips

Some of the commands that you may find useful are imread, imagesc, imresize, reshape, double

ECE 417 (Multimedia Signal Processing) covers characteristics of speech and image signals; important analysis and synthesis tools for multimedia signal processing including subspace methods, Bayesian networks, hidden Markov models, and factor graphs; applications to biometrics (person identification), human-computer interaction (face and gesture recognition and synthesis), and audio-visual databases (indexing and retrieval). Emphasis on a set of MATLAB machine problems providing hands-on experience. Prerequisite: ECE 310 and ECE 313.