MP5: Real-Time Speech-Driven Facial Animation

Due: Tuesday April 15, 2014

Overview

In this machine problem, you will get familiar with face modeling and animation techniques, learn to use ANNs (artificial neural networks) to map features of speech signals to facial animation parameters, and produce facial animation sequences from the audio tracks.

Useful Files

The Audio-Visual Database

The pre-processed database will be provided in the Matlab MAT file format, namely, ECE417_MP5_AV_DATA.mat. This file contains the following four Matlab variables:

  1. av_train

    av_train is a structure variable storing the audio-visual data for training the ANN. It has the following elements:

    • av_train.audio is a matrix of the audio features. The kth column of the audio feature matrix can be accessed as av_train.audio(: , k). It represents the audio feature vector of the frame k of the audio.
    • av_train.visual is a matrix of the visual features. The kth column of the visual feature matrix can be accessed as av_train.visual(:,k). It represents the visual feature vector of the frame k of the video. A visual feature vector contains three numbers. The first one is the Δ width of lips (Δw=w-w0). The last two numbers are the Δ height of the upper lip (Δh1=h1-h10), and the Δ height of the lower lip (Δh2=h2-h20) where w0, h10, h20 are the width and heights of the neutral lips (see Figure 1).

      Figure 1. The visual features.

  2. av_validate

    av_validate is a structure variable storing the audio-visual data for validation of the ANN. It has the following elements:

    • av_validate.audio is the audio feature matrix. Each column av_ validate.audio(:,k) represents the audio feature vector of frame k.
    • av_validate.visual is the visual feature matrix. Each column av_ validate.visual(:,k) represents the visual feature vector of frame k.
  3. testAudio

    testAudio is the test audio data matrix. Each column is an audio feature vector.

  4. silenceModel

    silenceModel will be used to decide if an audio frame corresponds to silence.

Data for Producing Animation

Figure 2: Mouth image (left); Mouth image with triangular mesh (right)

In this machine problem, facial animation is achieved by image warping. Two files are provided for image warping. In addition, a waveform file is provided as the sound track corresponding to testAudio for making the final movie file.

  1. mouth.jpg

    A neutral mouth image will be provided (see Figure 2). You will use it to generate a mouth animation image sequence.

  2. mesh.txt

    A triangular mesh that triangulates the mouth area in mouth.jpg (see Figure 2). You will use this mesh and the mouth image to generate new mouth images through image warping. The format of the file mesh.txt is:

    1. Number of vertices
    2. x coordinate of vertex 1, y coordinate of vertex 1
    3. x coordinate of vertex 2, y coordinate of vertex 2
    4. ...
    5. Number of triangles
    6. vertex 1, vertex 2, vertex 3 (of the 1st triangle)
    7. ...
  3. test.wav

    The waveform file corresponding to the audio feature matrix testAudio.

Tasks

  1. Write your image warping code in MATLAB. The code takes the visual features as input and synthesizes new mouth images. (We recommend you to do this part first.)
  2. Load pre-processed training data from ECE417_MP5_AV_DATA.mat.
  3. Use the training data set to train a set of ANNs as the mapping from audio features to visual features. The MATLAB code ECE417_MP5_AV_train is provided.
  4. Apply the mapping to the test audio features and obtain synthetic visual features. The MATLAB code ECE417_MP5_AV_test is provided.
  5. Produce image sequence for the synthesized visual features.

Detailed Description

  1. Image warping: First, you need to deform the mesh according to the visual features. The deformation of the mesh can be decided by interpolation from the visual features. A MATLAB function “interpVert” using linear interpolation will be provided. Then write a warpimg function to generate the deformed mouth images using the given mouth image (See Figure 2). For the pixels outside the mesh and pixels in the holes inside lips, leave them black.
  2. Load pre-processed data.
  3. ANNs training and testing:
    1. Matlab function ECE417_MP5_train will be provided. One parameter (number of hidden units) can be adjusted to get good mapping results.
    2. Matlab function ECE417_MP5_test will be provided.
  4. Use the estimated visual feature from test data, the triangular mesh, and the mouth image to generate mouth image sequences.
  5. Produce an animation movie file.
    1. Firstly generate face images from visual features.
    2. Save the images in JPEG format and name them as test_\#\#\#\#.jpg. \#\#\#\# is the frame number of the image, starting from 0. For example, for the 15th frame, the file name is test_0014.jpg.
    3. Use the provided executable DxBMP.exe to convert the image sequence into a movie file. If DxBMP.exe is in the same directory as the images, the command line is `DxBMP -framerate 30 test_*.jpg test.avi.' The output movie file is test.avi. More information about DxBMP.exe can be found in the provided DxBMP.htm.
    4. Open Windows Movie Maker in WindowsXP. Click import video to import test.avi created in (c). Click `import audio or music' to import the provided audio track `test.wav.' Then, drag `test.avi' and drop it in the video track of the timeline, and drag `test.wav' and drop it in the audio track. Finally, click `Finish Movie->save to my computer' to save the movie using file name `mp5.wmv.'

What to submit:

The movie file mp5.wmv you generate, your code, a README file, and a report briefly describing how you do this machine problem. Compress everything in a zip file named xxx.zip where xxx is your NetID. Send in your zip file to ece.ece.ece.417@gmail.com by 5pm on the due date.


ECE 417 (Multimedia Signal Processing) covers characteristics of speech and image signals; important analysis and synthesis tools for multimedia signal processing including subspace methods, Bayesian networks, hidden Markov models, and factor graphs; applications to biometrics (person identification), human-computer interaction (face and gesture recognition and synthesis), and audio-visual databases (indexing and retrieval). Emphasis on a set of MATLAB machine problems providing hands-on experience. Prerequisite: ECE 310 and ECE 313.