Progress in Arabic Broadcast News Transcription at BBN

Progress in Arabic Broadcast News Transcription at BBN Mohamed Afify, Long Nguyen, John Makhoul STT Workshop Philadelphia, PA, March 24, 2005

Overview • Problems in Arabic speech recognition • Arabic Treebank and Buckwalter morphological analyzer • Building phonetic systems in Arabic • Comparison of phonetic and grapheme models • Experimental results • Summary and future work

Problems in Arabic Speech Recognition • Lack of short vowels from existing corpora • Creates ambiguity for acoustic and language models • Most systems rely on grapheme-based acoustic models • No explicit models for short vowels • Therefore, no detailed phonetic acoustic models • Language models also ignore short vowels • Affixes create a large number of “words” • e.g., “and he will write it” is one word in Arabic • OOV rate is around 5% for 64K lexicon compared to around 0.5% for English • Morphological richness also adds to the large number of words

Possible Solutions • Short Vowels • Obtain vowelization of words in dictionary from Arabic Treebank and morphological analysis • Bootstrap acoustic-phonetic models for all phonemes, including short vowels • Expand vowelization process to language model • Affixes and morphological richness • Reduce OOV rate by increasing lexicon size • Use morphological analysis to decompose words into components • Current focus • Bootstrap acoustic models for short vowels • Build phonetic system • Available resources • No vowelized speech corpus • Arabic Treebank • Buckwalter morphological analyzer

LDC Arabic Treebank • Text only; no speech • Consists of three parts • The words in the articles in Parts 1 and 2 are vowelized in context • The unique words in Part 3 have multiple pronunciations based on the Buckwalter morphological analyzer

Buckwalter Morphological Analyzer • Available from LDC • Uses a lexicon and a set of rules for affixes to • Assign parts of speech to a word • Produce different vowelizations for each word • Version 2.0 was recently released • Several additional new features • Produces all possible ending vowelizations for input word • Can only analyze words whose stems are in its lexicon • Lexicon has about 40K stems • Does not include many foreign words • Does not deal with mis-spelled words

Building an Arabic Phonetic System • Use Arabic Treebank and Buckwalter morphological analyzer to bootstrap short vowels for acoustic training data and recognition lexicon • Method 1 • Search word in Treebank dictionary • If not found, pass to morphological analyzer • If both fail, discard word or manually vowelize • Method 2 • Pass word to morphological analyzer • If failed, lookup in Treebank dictionary • If both fail, discard word or manually vowelize • As a result, some acoustic training data and words in recognition lexicon were discarded • We found Method 2 to give more consistent vowelizations than Method 1

Arabic Phonetic System (cont’d) • Starting with 100 hrs of possible acoustic training data and a 64K recognition lexicon, we were able to keep: • 80 hrs (63K utterances) of data with short vowels • 62K recognition lexicon with short vowels • A 35-phoneme set (28 consonants + 6 vowels + “taa marbuuTa”) • Phonetic transcription rules are relatively straightforward starting from vowelized transcriptions • Built a conventional phonetic system and compared to grapheme system • No vowelization for language model

Initial Results • Dev 03, unadapted results, Method 1 vowelization • Normalization I : Normalize “hamza” at beginning of the word • Normalization II : Normalize “hamza” at beginning of the word, after popular prefixes, and also frequent “Y” and “y” confusions at end of word • Text normalization is much more important for phonetic system

Updated Development Results • Use Normalization II on acoustic and language training data, and for scoring • Use Method 2 to bootstrap short vowels • Expanded phonetic transcription rules to include assimilation of word-initial hamza and definite article • Dev03 test set, unadapted decoding • About 13% improvement for phonetic system

Experimental Results • About 80 hrs of net acoustic training data • ML models for un-adapted decoding • ML SAT models for adapted decoding • About 300M words of language training data • 3-gram language models • 60K recognition lexicon • Adapted decoding on different test sets

Next Immediate Steps • Use all 100 hrs for acoustic training • Phonetic models can automatically vowelize discarded sentences • Possibly manually vowelize missing words • Use 64K recognition lexicon • Manually vowelize missing words • Gain is about 1% absolute on Dev03 for grapheme system • Switch to MMI models for un-adapted and adapted decoding

Summary and Future Work • Quickly bootstrap phonetic system for Arabic • Text normalization and Buckwalter morphological analyzer version II are key to success • From 8%-13.5% improvement over grapheme system for different test sets • Further improvement can be obtained by straightforward upgrades • Future work • Using vowelization in language model • Increase lexicon size to reduce OOV rate • Statistical vowelization for missed words, mainly foreign names

Progress in Arabic Broadcast News Transcription at BBN