210 likes | 318 Vues
Explore the nuances of human speech recognition versus ASR systems, including experiments, signal processing, probabilities, and accuracy assessments. Discover the evolution, challenges, and potential improvements in speech recognition technologies.
E N D
Grammar RecognizedWords “zero” “three” “two” Cepstrum Probabilities“z” -0.81“th” = 0.15“t” = 0.03 Decoder Signal Processing ProbabilityEstimator ASR System Architecture Speech Signal Pronunciation Lexicon
A Few Points about Human Speech Recognition (See Chapter 18 for much more on this)
Human Speech Recognition • Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) • Statistics of CVC perception • Comparisons between human and machine speech recognition • A few thoughts
Assessing Recognition Accuracy • Intelligibility • Articulation - Fletcher experiments • CVC, VC, CV, syllables in carrier sentences • Tests over different SNR, bands • Example: “The first group is `mav’ (forced choice between mav and nav) • Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.
Results • S = vc2 • Articulation Index (the original “AI”) • Error independence between bands • Articulatory band ~ 1 mm along basilar membrane • 20 filters between 300 and 8000 Hz • A single zero error band -> no error! • Robustness to a range of problems • AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30
AI additivity • s(a,b) = phone accuracy for band from a to b, a<b<c • (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) • log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c)) • AI(s) = log10(1-s) / log10(1-smax) • AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))
Jont Allen interpretation:The Big Idea • Humans don’t use frame-like spectral templates • Instead, partial recognition in bands • Combined for phonetic (syllabic?) recognition • Important for 3 reasons: • Based on decades of listening experiments • Based on a theoretical structure that matched the results • Different from what ASR systems do
Questions about AI • Based on phones - the right unit for fluent speech? • Lost correlation between distant bands? • Lippmann experiments, disjoint bands • Signal above 8 kHz helps a lot in combination with signal below 800 Hz
Human SR vs ASR: Quantitative Comparisons • Lippmann compilation (see book): typically ~factor of 10 in WER • Hasn’t changed too much since his study • Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)
Human SR vs ASR: Quantitative Comparisons (2) Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now)
Human SR vs ASR: Qualitative Comparisons • Signal processing • Subword recognition • Temporal integration • Higher level information
Human SR vs ASR: Signal Processing • Many maps vs one • Sampled across time-frequency vs sampled in time • Some hearing-based signal processing already in ASR
Human SR vs ASR: Subword Recognition • Knowing what is important (from the maps) • Combining it optimally
Human SR vs ASR: Temporal Integration • Using or ignoring duration (e.g., VOT) • Compensating for rapid speech • Incorporating multiple time scales
Human SR vs ASR: Higher levels • Syntax • Semantics • Pragmatics • Getting the gist • Dialog to learn more
Human SR vs ASR: Conclusions • When we pay attention, human SR much better than ASR • Some aspects of human models going into ASR • Probably much more to do, when we learn how to do it right