[an error occurred while processing this directive] [an error occurred while processing this directive]
Xuesong Yang

Automatic Pronunciation Error Identification based on TOEFL Junior Corpus

Xuesong Yang, 10/7/2014, 4:30-5:30pm, BI 2369

In this talk, I will report the investigations into the task of pronunciation error detection at both phone and word levels, the performance of which is heavily affected by the imbalanced distribution of the classes in a manually annotated data set of non-native English (Read-A-Loud responses from the TOEFL Junior Pilot assessment). In order to address problems caused by this extreme class imbalance, approaches for cost-sensitive learning and over-sampling of synthetic instances are explored to improve classification performance. Specifically, approaches which adjusted weights inversely proportional to class frequencies and synthetic minority over-sampling technique (SMOTE) were applied to a range of classifiers using features that consisted of acoustic phonetics, linguistics knowledge (stress, syllable structures, word-initial or word-medial, etc.), and word identity. Empirical experiments demonstrate that both imbalanced learning approaches lead to performance improvement (in terms of F1-score) over the baseline system based on the extremely imbalanced data. In addition, feature selection was performed by recursive elimination algorithm among the whole feature set, the best features subsets of which provide a good insight for further analysis of the nature of error pronunciations for each phone.

[an error occurred while processing this directive]