- When: Fall 2014, Thursdays 5:00-6:00pm
- What you gain: Knowledge about tools useful for audiovisual speech recognition, and audiovisual speech synthesis.
- What you give: At least an hour of your time, every week, to help us process data for the Dr. Avatar research project.
Learning: Topics to be covered
- Speech and Language Technology
- OpenFST: Knowledge Management for Regular Languages
- kaldi: Speech Recognition based on OpenFST
- HTK/HTS: Speech Recognition and Synthesis
- Multimedia Annotation and Conversion Software
Research: Goals and timeline
We will be working with about two hours of speech data, recorded by an M.D. who volunteered to help with ths research. The data will be made available at http://ifp-08.ifp.uiuc.edu/protected/avatar; data are already available there via sftp.
- Audio/Video Alignment and Segmentation at Silences will be done by one of the grad students, or Dr. Hasegawa-Johnson, in matlab. We'll then need everybody in the project to help with manual validation of the alignment, using ELAN.
- Audio/Text Alignment.
- I propose to use the methods in Automatic Long Audio Alignment and Confidence Scoring for Conversational Arabic by Elmahdy et al.
- First step: Train "standard English" acoustic models, using the tutorial methods for kaldi, and the Resource Management corpus
- Second step: Train a corpus-dependent language model. Prof. Hasegawa-Johnson has always done this by writing original code in Ruby, so if nothing changes before we get to this step, that is what he will teach you to do. Prof. Elmahdy has told me that it's better to do this using IRSTLM or KenLM, so we could try doing it that way.
- Third step: Create a pronunciation model, possibly using the ISLEX dictionary
- Fourth step: Apply kaldi to decode the audio data
- Fifth step: adapt LM and AM for each segment, and perform second-pass alignment.
- Audio Synthesis
- Concatenative Synthesis: Prof. Hasegawa-Johnson has only ever done this by writing original code in Ruby, so if nothing changes before we reach this point, that's what we'll do. It may be easier to use Festival; not sure
- HMM-Based Synthesis: Yang Zhang has done this using original matlab code, so we might use his code, or we might use HTS.
- Video Synthesis
- The most flexible way to do this is by animating a 3D model of the head. Kuangxiao Gu is working on this.
- HMM-based synthesis: features for the animated head could be generated using HTS
- Concatenative Synthesis: could be done using methods identical to concatenative synthesis of audio. This method may be useful if we have complete words or long audio segments in the training data, and if we can select them in order to get reasonable emotional variation
- Emotion Synthesis
- Extractive synthesis: label each segment of the training corpus for its emotional nuance, automatically, using matlab or kaldi. Then re-train HTS to generate text-to-speech with correct emotional nuance.
- Generative synthesis: TTS a target sentence, then modify its F0, segment durations, and facial expressions in matlab.