SBS Audio with Matched and Mismatched Transcripts

This directory contains matched (native) and mismatched (non-native) transcriptions of podcast audio published by the Special Broadcasting Service (SBS) of Australia. We are indebted to SBS for permitting us to crowdsource their audio in this way. Funding for the crowdsourcing and the research was provided by Microsoft, Google, Amazon, Mitsubishi, and MERL, in the form of a grant to the 2015 Jelinek Speech and Language Technology workshop, organized by Johns Hopkins University and the University of Washington. Additional thanks to the DARPA LORELEI program for the impetus to redistribute these data.

For more information about these data see Mark Hasegawa-Johnson, Preethi Jyothi, Daniel McCloy, Majid Mirbagheri, Giovanni di Liberto, Amit Das, Bradley Ekin, Chunxi Liu, Vimal Manohar, Hao Tang, Edmund C. Lalor, Nancy Chen, Paul Hager, Tyler Kekona, Rose Sloan, and Adrian KC Lee, "ASR for Under-Resourced Languages from Probabilistic Transcription", IEEE/ACM Trans. Audio, Speech and Language 25(1):46-59, 2017, 10.1109/TASLP.2016.2621659.


The podcasts are available directly from SBS. Here is a list of their URLs:

Matched Transcripts

Matched transcripts are available for nine of the languages. Seven were collected in 2015; Amharic and Dinka were collected by Amit Das for the paper Amit Das, Preethi Jyothi and Mark Hasegawa-Johnson, "Automatic speech recognition using probabilistic transcriptions in Swahili, Amharic and Dinka," Interspeech 2016, pp. 3524-3527, 10.21437/Interspeech.2016-657.

Mismatched Transcripts

Mismatched transcripts are available for 23 languages.

All Transcripts as a Single JSON File

In order to make it easier to compare across languages, the SBS and ADSC mismatched transcripts were compiled in 2018 to a single JSON file. Funding for the ADSC transcripts (Cantonese, Hokkien, and Vietnamese) was provided by the Advanced Digital Sciences Center of Singapore. Funding for the cross-language compilation was provided by NSF IIS 15-50145.