170 likes | 181 Vues
The 1980’s. Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second major (D)ARPA ASR project HMMs become ready for prime time. Standard Corpora Collection. Before 1984, chaos TIMIT RM (later WSJ)
E N D
The 1980’s • Collection of large standard corpora • Front ends: auditory models, dynamics • Engineering: scaling to large vocabulary continuous speech • Second major (D)ARPA ASR project • HMMs become ready for prime time
Standard Corpora Collection • Before 1984, chaos • TIMIT • RM (later WSJ) • ATIS • NIST, ARPA, LDC
Front Ends in the 1980’s • Mel cepstrum (Bridle, Mermelstein) • PLP (Hermansky) • Delta cepstrum (Furui) • Auditory models (Seneff, Ghitza, others)
Spectral vs Temporal Processing Analysis (e.g., cepstral) frequency Spectral processing Time Processing (e.g., mean removal) frequency Temporal processing
Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter
“Delta” impulse response .2 .1 0 -2 -1 0 1 2 frames -.1 -.2
HMM’s for ContinuousSpeech • Using dynamic programming for cts speech(Vintsyuk, Bridle, Sakoe, Ney….) • Application of Baker-Jelinek ideas to continuous speech (IBM, BBN, Philips, ...) • Multiple groups developing major HMMsystems (CMU, SRI, Lincoln, BBN, ATT) • Engineering development - coping with data, fast computers
2nd (D)ARPA Project • Common task • Frequent evaluations • Convergence to good, but similar, systems • Lots of engineering development - now up to 60,000 word recognition, in real time, on aworkstation, with less than 10% word error • Competition inspired others not in project -Cambridge did HTK, now widely distributed
Knowledge vs. Ignorance • Using acoustic-phonetic knowledge in explicit rules • Ignorance represented statistically • Ignorance-based approaches (HMMs) “won”, but • Knowledge (e.g., segments) becoming statistical • Statistics incorporating knowledge
Some 1990’s Issues • Independence to long-term spectrum • Adaptation • Effects of spontaneous speech • Information retrieval/extraction withbroadcast material • Query-style systems (e.g., ATIS) • Applying ASR technology to relatedareas (language ID, speaker verification)
Where Pierce Letter Applies • We still need science • Need language, intelligence • Acoustic robustness still poor • Perceptual research, models • Fundamentals of statistical patternrecognition for sequences • Robustness to accent, stress,rate of speech, ……..
Progress in 25 Years • From digits to 60,000 words • From single speakers to many • From isolated words to continuousspeech • From no products to many products,some systems actually saving LOTSof money
Real Uses • Telephone: phone company services(collect versus credit card) • Telephone: call centers for queryinformation (e.g., stock quotes, parcel tracking) • Dictation products: continuous recognition, speaker dependent/adaptive
But: • Still <97% accurate on “yes” for telephone • Unexpected rate of speech causes doublingor tripling of error rate • Unexpected accent hurts badly • Accuracy on unrestricted speech at 60% • Don’t know when we know • Few advances in basic understanding
ErrorRate Class 1 2 3 4 5 6 7 8 9 0 1 191 0 0 5 1 0 1 0 2 0 4.5 2 0 188 2 0 0 1 3 0 0 6 6.0 3 0 3 191 0 1 0 2 0 3 0 4.5 4 8 0 0 187 4 0 1 0 0 0 6.5 5 0 0 0 0 193 0 0 0 7 0 3.5 6 0 0 0 0 1 196 0 2 0 1 2.0 7 2 2 0 2 0 1 190 0 1 2 5.0 8 0 1 0 0 1 2 2 196 0 0 2.0 9 5 0 2 0 8 0 3 0 179 3 10.5 0 1 4 0 0 0 1 1 0 1 192 4.5 Overall error rate 4.85% Confusion Matrix for Digit Recognition
‘88 ‘89 ‘90 ‘91 ‘92 ‘93 ‘94 Large Vocabulary CSR ErrorRate% • 12 • 9 • Ø 1 • 6 • 3 Year --- RM ( 1K words, PP 60) ___WSJØ, WSJ1(5K, 20-60K words, PP 100) ~~ ~~