Key Results of this Research

Star Challenge: language-independent spoken word detection

SST and IFP jointly entered a UIUC team in the 2008 A*STAR Star Challenge, a multimedia retrieval competition held in Singapore. This competition included image retrieval, video shot retrieval, and unknown-language spoken term detection retrieval components. UIUC was the only United States team to make the finals, and took third place in the competition.

We wrote a CLIAWS paper based on our Star Challenge system. The system was trained in English, Russian and Spanish, then tested in Croatian. Acoustic models were either not adapted (AM0) or adapted (AMt) to the Croatian speech. The phoneme bigram language model was also either not adapted (LM0) or adapted (LMt). IR was conducted using queries specified using IPA notation, or as an audio example. Resulting scores (MAP=mean average precision) are shown as a function of the degree of query expansion (number of allowed phonological feature substitutions).

Prosody reduces the word error rate of a speech recognizer

In our Speech Communication paper and several conference papers leading up to it, we demonstrated that prosodic tags can reduce the word error rate of a speech recognizer by 13% relative (table below). The most interesting finding was that the benefits of a prosody-dependent acoustic model and of a prosody-dependent language model are super-additive. We believe that these two models serve as a sort of consistency check: if the prosody for a candidate transcription matches the acoustics but not the context, or vice versa, then it can be ruled out.

Word Error Rates Acoustic Model has no Prosody Acoustic Model has Prosody
Language Model has no Prosody 24.8% 24.0%
Language Model has Prosody 24.3% 21.7%

Distinctive features reduce the word error rate of a speech recognizer

The word error rate of a speech recognizer can be reduced slightly if its observations include estimates of the distinctive features of acoustic phonetic landmarks, computed using support vector machines. Sarah Borys demonstrated an HMM-based system in which telephone-band phone error rates were reduced from 63.9% to 62.8%. Using a better baseline system, Hasegawa-Johnson et al. (Karen Livescu ran the best experiment) demonstrated a DBN-based system in which telephone-band word error rates were reduced from 27.7% to 27.2%.