Qatari Arabic Corpus

    Parts of the Qatari Arabic Corpus are now available for download at

    Content Distributed Files
    Speech was recorded from four Qatari television programs in 2009-2011:
    • Al-Jazeera interviews: 207 minutes (multi-dialect recorded in Qatar; relatively formal)
    • LAKOM: 240 minutes (Moroccan dialect; not yet transcribed)
    • Sabah El-Doha: 110 minutes (multi-dialect recorded in Qatar; relatively informal)
    • Tesaneef 550 minutes (Qatari dialect, extremely informal)
    • Nineteen hours of monaural broadcast speech audio,
      • 16 bits/sample in WAV format,
      • recorded at 44.1kHz sampling rate, but
      • downsampled to 16kHz sampling rate for distribution.
    • Fifteen hours of phonetic transcription
      • Arabic script,
      • fully vowelized,
      • extended with Persian and Urdu characters in order to distinguish phonemes that are not part of the core Arabic orthography.
    • Fifteen hours of English gloss.