1 / 21

From last time …

From last time …. Grammar. Recognized Words “zero” “three” “two”. Cepstrum. Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03. Decoder. Signal Processing. Probability Estimator. ASR System Architecture. Speech Signal. Pronunciation Lexicon. A Few Points about Human Speech Recognition.

astra
Télécharger la présentation

From last time …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From last time …

  2. Grammar RecognizedWords “zero” “three” “two” Cepstrum Probabilities“z” -0.81“th” = 0.15“t” = 0.03 Decoder Signal Processing ProbabilityEstimator ASR System Architecture Speech Signal Pronunciation Lexicon

  3. A Few Points about Human Speech Recognition (See Chapter 18 for much more on this)

  4. Human Speech Recognition • Experiments dating from 1918 dealing with noise, reduced BW (Fletcher) • Statistics of CVC perception • Comparisons between human and machine speech recognition • A few thoughts

  5. The Ear

  6. The Cochlea

  7. Assessing Recognition Accuracy • Intelligibility • Articulation - Fletcher experiments • CVC, VC, CV, syllables in carrier sentences • Tests over different SNR, bands • Example: “The first group is `mav’ (forced choice between mav and nav) • Used sharp lowpass and/or highpass filtered. For equal energy, crossover is 450 Hz; for equal articulation, 1550 Hz.

  8. Results • S = vc2 • Articulation Index (the original “AI”) • Error independence between bands • Articulatory band ~ 1 mm along basilar membrane • 20 filters between 300 and 8000 Hz • A single zero error band -> no error! • Robustness to a range of problems • AI = ∑k 1/K (SNRk / 30) where SNR saturates at 0 and 30

  9. AI additivity • s(a,b) = phone accuracy for band from a to b, a<b<c • (1-s(a,c)) = (1-s(a,b))(1-s(b,c)) • log10(1-s(a,c)) = log10(1-s(a,b)) + log10(1-s(b,c)) • AI(s) = log10(1-s) / log10(1-smax) • AI(s(a,c)) = AI(s(a,b)) + AI(s(b,c))

  10. Jont Allen interpretation:The Big Idea • Humans don’t use frame-like spectral templates • Instead, partial recognition in bands • Combined for phonetic (syllabic?) recognition • Important for 3 reasons: • Based on decades of listening experiments • Based on a theoretical structure that matched the results • Different from what ASR systems do

  11. Questions about AI • Based on phones - the right unit for fluent speech? • Lost correlation between distant bands? • Lippmann experiments, disjoint bands • Signal above 8 kHz helps a lot in combination with signal below 800 Hz

  12. Human SR vs ASR: Quantitative Comparisons • Lippmann compilation (see book): typically ~factor of 10 in WER • Hasn’t changed too much since his study • Keep in mind this caveat: “human” scores are ideal - under sustained real conditions people don’t pay perfect attention (especially after lunch)

  13. Human SR vs ASR: Quantitative Comparisons (2) Word error rates for 5000 word Wall Street Journal read speech task using additive automotive noise (old numbers – ASR would be a bit better now)

  14. Human SR vs ASR: Qualitative Comparisons • Signal processing • Subword recognition • Temporal integration • Higher level information

  15. Human SR vs ASR: Signal Processing • Many maps vs one • Sampled across time-frequency vs sampled in time • Some hearing-based signal processing already in ASR

  16. Human SR vs ASR: Subword Recognition • Knowing what is important (from the maps) • Combining it optimally

  17. Human SR vs ASR: Temporal Integration • Using or ignoring duration (e.g., VOT) • Compensating for rapid speech • Incorporating multiple time scales

  18. Human SR vs ASR: Higher levels • Syntax • Semantics • Pragmatics • Getting the gist • Dialog to learn more

  19. Human SR vs ASR: Conclusions • When we pay attention, human SR much better than ASR • Some aspects of human models going into ASR • Probably much more to do, when we learn how to do it right

More Related