Novel Speech Recognition Models for Arabic

Novel Speech Recognition Models for Arabic The Arabic Speech Recognition Team JHU Workshop Final Presentations August 21, 2002

Arabic ASR Workshop Team Senior ParticipantsUndergraduate Students: Katrin Kirchhoff, UW Melissa Egan, Pomona College Jeff Bilmes, UW Feng He, Swarthmore College John Henderson, MITRE Mohamed Noamany, BBN Affiliates: Pat Schone, DoD Dimitra Vergyri, SRI Rich Schwartz, BBN Daben Liu, BBN Nicolae Duta, BBN Graduate StudentsIvan Bulyko, UW Sourin Das, JHU Mari Ostendorf, UW Gang Ji, UW

Gulf Arabic Egyptian Arabic Modern Standard Arabic (MSA) Levantine Arabic North- African Arabic “Arabic” Dialects used for informal conversation Cross-regional standard, used for formal communication

Arabic ASR: Previous Work • dictation: IBM ViaVoice for Arabic • Broadcast News: BBN TIDESOnTap • conversational speech: 1996/1997 NIST CallHome Evaluations • little work compared to other languages • few standardized ASR resources

Arabic ASR: State of the Art (before WS02) • BBN TIDESOnTap: 15.3% WER • BBN CallHome system: 55.8% WER • WER on conversational speech noticeably higher than for other languages (eg. 30% WER for English CallHome)  focus on recognition of conversational Arabic

Problems for Arabic ASR • language-external problems: • data sparsity, only 1 (!) standardized corpus of conversational Arabic available • language-internal problems: • complex morphology, large number of possible word forms (similar to Russian, German, Turkish,…) • differences between written and spoken representation: lack of short vowels and other pronunciation information (similar to Hebrew, Farsi, Urdu, Pashto,…)

Corpus: LDC ECA CallHome • phone conversations between family members/friends • Egyptian Colloquial Arabic (Cairene dialect) • high degree of disfluencies (9%), out-of-vocabulary words (9.6%), foreign words (1.6%) • noisy channels • training: 80 calls (14 hrs), dev: 20 calls (3.5 hrs), eval: 20 calls (1.5 hrs) • very small amount of data for language modeling (150K) !

MSA - ECA differences • Phonology: • /th/  /s/ or /t/ thalatha - talata (‘three’) • /dh/  /z/ or /d/ dhahab - dahab (‘gold’) • /zh/  /g/ zhadeed - gideed (‘new’) • /ay/  /e:/ Sayf - Seef (‘summer’) • /aw/  /o:/ lawn - loon (‘color’) • Morphology: • inflections yatakallamu - yitkallim (‘he speaks’) • Vocabulary: • different terms TAwila - tarabeeza (`table’) • Syntax: • word order differences SVO - VSO

Workshop Goals improvements to Arabic ASR through developing novel models to better exploit available data developing techniques for using out-of-corpus data Integration of MSA text data Automatic romanization Factored language modeling

Factored Language Models • complex morphological structure leads to large number of possible word forms • break up word into separate components • build statistical n-gram models over individual morphological components rather than complete word forms

Automatic Romanization • Arabic script lacks short vowels and other pronunciation markers • comparable English example • lack of vowels results in lexical ambiguity; affects acoustic and language model training • try to predict vowelization automatically from data and use result for recognizer training th fsh stcks f th nrth tlntc hv bn dpletd the fish stocks of the north atlantic have been depleted

Out-of-corpus text data • no corpora of transcribed conversational speech available • large amounts of written (Modern Standard Arabic) data available (e.g. Newspaper text) • Can MSA text data be used to improve language modeling for conversational speech? • Try to integrate data from newspapers, transcribed TV broadcasts, etc.

Recognition Infrastructure • baseline system: BBN recognition system • N-best list rescoring • Language model training: SRI LM toolkit with significant additions implemented during this workshop • Note: no work on acoustic modeling, speaker adaptation, noise robustness, etc. • two different recognition approaches: grapheme-based vs. phoneme-based

Baseline 59.0 Random 62.7% Base- line 55.8% Additional Callhome data 55.1% Language modeling 53.8% Automatic romanization 57.9% True romanization 54.9% Oracle 46% Summary of Results (WER) Grapheme-based reconizer Phone-based recognizer

Novel research • new strategies for language modeling based on morphological features • new graph-based backoff schemes allowing wider range of smoothing techniques in language modeling • new techniques for automatic vowel insertion • first investigation of use of automatically vowelized data for ASR • first attempt at using MSA data for language modeling for conversational Arabic • morphology induction for Arabic

Key Insights • Automatic romanization improves grapheme-based Arabic recognition systems • trend: morphological information helps in language modeling • needs to be confirmed on larger data set • Using MSA text data does not help • We need more data!

Resources • significant add-on to SRILM toolkit for general factored language modeling • techniques/software for automatic romanization of Arabic script • part-of-speech tagger for MSA & tagged text

Outline of Presentations • 1:30 - 1:45: Introduction (Katrin Kirchhoff) • 1:45 - 1:55: Baseline system (Rich Schwartz) • 1:55 - 2:20: Automatic romanization (John Henderson, Melissa Egan) • 2:20 - 2:35: Language modeling - overview (Katrin Kirchhoff) • 2:35 - 2:50: Factored language modeling (Jeff Bilmes) • 2:50 - 3:05: Coffee Break • 3:05 - 3:10: Automatic morphology learning (Pat Schone) • 3:15 - 3:30: Text selection (Feng He) • 3:30 - 4:00: Graduate student proposals (Gang Ji, Sourin Das) • 4:00 - 4:30: Discussion and Questions

Thank you! • Fred Jelinek, Sanjeev Khudanpur, Laura Graham • Jacob Laderman + assistants • Workshop sponsors • Mark Liberman, Chris Cieri, Tim Buckwalter • Kareem Darwish, Kathleen Egan • Bill Belfield & colleagues from BBN • Apptek

BBN Baseline System for Arabic Richard Schwartz, Mohamed Noamany, Daben Liu, Bill Belfield, Nicolae Duta JHU Workshop August 21, 2002

BBN BYBLOS System • Rough’n’Ready / OnTAP / OASIS system • Version of BYBLOS optimized for Broadcast News • OASIS system fielded in Bangkok and Aman • Real-Time operation with 1-minute delay • 10%-20% WER, depending on data

BYBLOS Configuration • 3-passes of recognition • Forward Fast-match uses PTM models and approximate bigram search • Backward pass uses SCTM models and approximate trigram search, creates N-best. • Rescoring pass uses cross-word SCTM models and trigram LM • All runs in real time • Minimal difference from running slowly

Use for Arabic Broadcast News • Transcriptions are in normal Arabic script, omitting short vowels and other diacritics. • We used each Arabic letter as if it were a phoneme. • This allowed addition of large text corpora for language modeling.

Initial BN Baseline • 37.5 hours of acoustic training • Acoustic training data (230K words) used for LM training • 64K-word vocabulary (4% OOV) • Initial word error rate (WER) = 31.2%

Speech Recognition Performance

Call Home Experiments • Modified OnTAP system to make it more appropriate for Call Home data. • Added features from LVCSR research to OnTAP system for Call Home data. • Experiments: • Acoustic training: 80 conversations (15 hours) • Transcribed with diacritics • Acoustic training data (150K words) used for LM • Real-time

Using OnTAP system for Call Home

Additions from LVCSR

Output Provided for Workshop • OASIS was run on various sets of training as needed • Systems were run either for Arabic script phonemes or ‘Romanized’ phonemes – with diacritics. • In addition to workshop participants, others at BBN provided assistance and worked on workshop problems. • Output provided for workshop was N-best sentences • with separate scores for HMM, LM, #words, #phones, #silences • Due to high error rate (56%), the oracle error rate for 100 N-best was about 46%. • Unigram lattices were also provided, with oracle error rate of 15%

Phoneme HMM Topology Experiment • The phoneme HMM topology was increased for the Arabic script system from 5 states to 10 states in order to accommodate a consonant and possible vowel. • The gain was small (0.3% WER)

OOV Problem • OOV Rate is 10% • 50% is morphological variants of words in the training set • 10% is Proper names • 40% is other unobserved words • Tried adding words from BN and from morphological transducer • Added too many words with too small gain

Use BN to Reduce OOV • Can we add words from BN to reduce OOV? • BN text contains 1.8M distinct words. • Adding entire 1.8M words reduces OOV from 10% to 3.9%. • Adding top 15K words reduces OOV to 8.9% • Adding top 25K words reduces OOV to 8.4%.

Use Morphological Transducer • Use LDC Arabic transducer to expand verbs to all forms • Produces > 1M words • Reduces OOV to 7%

Language Modeling Experiments Described in other talks • Searched for available dialect transcriptions • Combine BN (300M words) with CH (230K) • Use BN to define word classes • Constrained back-off for BN+CH

Autoromanization of Arabic Script Melissa Egan and John Henderson

Autoromanization (AR) goal • Expand Arabic script representation to include short vowels and other pronunciation information. • Phenomena not typically marked in non-diacritized script include: • Short vowels {a, i, u} • Repeated consonants (shadda) • Extra phonemes for Egyptian Arabic {f/v,j/g} • Grammatical marker that adds an ‘n’ to the pronunciation (tanween) • Example Non-diacritized form: ktb – write Expansions: kitab – book aktib – I write kataba – he wrote kattaba – he caused to write

AR motivation • Romanized text can be used to produce better output from an ASR system. • Acoustic models will be able to better disambiguate based on extra information in text. • Conditioning events in LM will contain more information. • Romanized ASR output can be converted to script for alternative WER measurement. • Eval96 results (BBN recognizer, 80 conv. train) • script recognizer: 61.1 WERG (grapheme) • romanized recognizer: 55.8 WERR (roman)

AR data CallHome Arabic from LDC Conversational speech transcripts (ECA) in both script and a roman specification that includes short vowels, repeats, etc. set conversations words asrtrain 80 135K dev 20 35K eval96(asrtest) 20 15K eval97 20 18K h5_new 20 18K Romanizer Testing Romanizer Training

Data format • Script without and with diacritics • CallHome in script and roman forms our task Script: AlHmd_llh kwIsB w AntI AzIk Roman: ilHamdulillA kuwayyisaB~ wi inti izzayyik

Autoromanization (AR) WER baseline • Train on 32K words in eval97+h5_new • Test on 137K words in ASR_train+h5_new Status portion error % total in train in test in test error unambig. 68.0% 1.8% 6.2% ambig. 15.5 13.9 10.8 unknown 16.5 99.8 83.0 total 100 19.9 100.0 Biggest potential error reduction would come from predicting romanized forms for unknown words.

AR “knitting” example unknown: tbqwA 1. Find close known word known: ybqwA known: y bqwA 2. Record ops required to make roman from known kn.roman: yibqu ops: ciccrd unknown: t bqwA 3. Construct new roman using same ops kn.roman: yibqu ops: ciccrd new roman: tibqu

Experiment 1 (best match) Observed patterns in the known short/long pairs: Some characters in the short forms are consistently found with particular, non-identical characters in the long forms. Example rule: A  a

Experiment 2 (rules) Environments in which ‘w’ occurs in training dictionary long forms: Env Freq C _ V 149 V _ # 8 # _ V 81 C _ # 5 V _ V 121 V _ C 118 Environments in which ‘u’ occurs in training dictionary long forms: Env Freq C _ C 1179 C _ # 301 # _ C 29 • Some output forms depend on output context. • Rule: • ‘u’ occurs only between two non-vowels. • ‘w’ occurs elsewhere. • Accurate for 99.7% of the instances of ‘u’ and ‘w’ in the training dictionary long forms. Similar rule may be formulated for ‘i’ and ‘y.’

Known long: H a n s A h a Known short: H A n s A h A input: H A m D y h A result: H a m D I h a Experiment 3 (local model) • Move to more data-driven model • Found some rules manually. • Look for all of them, systematically. • Use best-scoring candidate for replacement • Environment likelihood score • Character alignment score

Experiment 4 (n-best) • Instead of generating romanized form using the single best short form in the dictionary, generate romanized forms using top n best short forms. Example (n = 5)

Character error rate (CER) • Measurement of insertions, deletions, and substitutions in character strings should more closely track phoneme error rate. • More sensitive than WER • Stronger statistics from same data • Test set results • Baseline 49.89 character error rate (CER) • Best model 24.58 CER • Oracle 2-best list 17.60 CER suggests more room for gain.

Summary of performance (dev set) Accuracy CER Baseline 8.4% 41.4% Knitting 16.9% 29.5% Knitting + best match + rules18.4% 28.6% Knitting + local model 19.4% 27.0% Knitting + local model + n-best 30.0% 23.1% (n = 25)

Varying the number of dictionary matches

Novel Speech Recognition Models for Arabic

Novel Speech Recognition Models for Arabic

Presentation Transcript

Speech Recognition

Speech Recognition

Using Speech Recognition for Speech Therapy

Novel CI- Backoff Scheme for Real-time Embedded Speech Recognition

Dirichlet Class Language Models for Speech Recognition

Speech recognition

Combining Speech Attributes for Speech Recognition

Speech Recognition

Hidden Markov Models for Speech Recognition

EEL 6586-Automatic Speech Processing Hidden Markov Models for Speech Recognition

Speech Recognition

DTW for Speech Recognition

Hidden Markov Models for Automatic Speech Recognition

Speech Recognition and Hidden Markov Models

Speech Recognition

Enhanced Speech Models for Robust Speech Recognition

SPEECH RECOGNITION:

Language Models For Speech Recognition

Speech Recognition

Hidden Markov Models for Speech Recognition

Speech Recognition for Dummies

Novel Speech Recognition Models for Arabic