1 / 80

Multimedia Communications (371) Speech and Image Communications (348)

Multimedia Communications (371) Speech and Image Communications (348). John Mason Engineering Swansea University. Features in speech. X 1 . . . . X i. Feature extraction. Acquisition. time. (frame: 20/30 ms & sampling F: 8khz). Features in speech. X 1 . . . . X i . . .

pete
Télécharger la présentation

Multimedia Communications (371) Speech and Image Communications (348)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multimedia Communications (371)Speech and Image Communications (348) John Mason Engineering Swansea University EG-348_371_09

  2. Features in speech X1 . . . . Xi . . . . . Feature extraction Acquisition time (frame: 20/30 ms & sampling F: 8khz) EG-348_371_09

  3. Features in speech X1 . . . . Xi . . . . . Feature extraction Acquisition (frame: 20/30 ms & sampling F: 8khz) EG-348_371_09

  4. Speech production Air from the lungs Vocal fold Vocal tract Speech EG-348_371_09

  5. Air from the lungs Vocal fold Vocal tract Speech H1(z) H2(z) synthesised Speech noise LPC Short and Long Spectral envelop reflects morphological characteristics of the vocal tract EG-348_371_09

  6. Features: building of statistical model T1 T2 T1 T2 T1 T2 T1 T2 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 T2 T1 EG-348_371_09

  7. VT Shape & Some Vowels - Ladefoged ‘62 EG-348_371_09

  8. Speech Processing - Applications • Why? • Communications • Synthesis • Recognition • Speech & Speaker • How? • Frame-based • Systems approach EG-348_371_09

  9. Some Books • Flanagan -’Speech Analysis, Synthesis and Perception’, Springer-Verlag, - a classic! • Furui - several books on recognition • Parsons - `Voice and Speech Processing’ - McGraw Hill, one of the first text books on computer speech processing • O’Shaughnessy - ‘Speech Comms - human and machine’ Addison-Wesley • Rabiner & Juang - ‘Fundamentals of Speech Recognition’ Prentice Hall, 1993 • Ramachandran & Mamone (eds) ‘Modern Methods of Speech Processing’ Kluer Academic, 1995 EG-348_371_09

  10. Speech Communications Person-to-Person Person-to-Machine speech/speaker recognition Machine-to-Person speech synthesis EG-348_371_09

  11. (Electronic)Speech Communications perhaps separated by long distance (or in time) EG-348_371_09

  12. Telephony & Broadcasting Acoustic Air Path l Transmission Path Acoustic Air Path Electronic Link EG-348_371_09

  13. Channel Transmission Path Electronic Link Speech Comms: Telephony Microphone ADC Analysis Coding Transmitter Receiver Decoding (re-)Synthesis DAC Loudspeaker EG-348_371_09

  14. Human Acoustic generation Transmission Message Creation Language Coding Speech Bit Rates hundreds thousands Tens of thousands tens Approx. bit rate in bps Acoustic Space Human Hearing Extraction Message Realisation Language decoding EG-348_371_09

  15. Excellent Quality Good ADPCM GSM Fair CELP Poor 4 8 16 32 64 kbps Criteria in Speech Comms. Quality versus Bit-rate 4 Quality Measures: intelligibility loudness naturalness ease-of-listening EG-348_371_09

  16. Low Bit Rate Speech CodingCompandent http://www.compandent.com/ EG-348_371_09

  17. Speech Processing The three main application areas are: • Speech Comms. (the ‘electronic link’) • Automatic Speech/Speaker recognition • Speech SynthesisMuch of the underlying analysis is common, eg linear predictive coding EG-348_371_09

  18. What does speech look like? EG-348_371_09

  19. What does speech look like? Dynamic Range - for flexibility and robustness Time-varying - to convey information EG-348_371_09

  20. Frame-based Analysis • To capture time variations: • 20-30 ms frames - ‘centi-second’ labeling • spectral analysis • FFT • Filter-bank • Linear Predictive Coding EG-348_371_09

  21. Excitation: voiced unvoiced sn speech en H(z) Speech Analysis/Coding • Two general cases: • Waveform coders • Source (voice) coders (vo-coders) • Source coders eg linear predictive coding (LPC): • Model the source ie the vocal tract (VT) • Linear, time varying model of VT, plus excitation EG-348_371_09

  22. Systems Approach Excitation Speech Vocal Tract Voiced Speech Model f0 Unvoiced Time Varying Parameters EG-348_371_09

  23. H(z) hn S(z) E(z) en sn E(z) S(z) 1/H(z) sn en LPC Analysis/Synthesis • Synthesis: • Input: Excitation • output: Speech • Analysis: • Input: Speech • output: Excitation EG-348_371_09

  24. S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en ‘Perfect’ Analysis/Synthesis Input sn and output sn are identical (within arithmetic limits) EG-348_371_09

  25. Practical Analysis/Synthesis EG-348_371_09

  26. S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en Transmission Sending Receiving Practical Analysis/Synthesis • Parameters for Transmission : • Input / Excitation en • Source model H(z) • Thus Analysis must derive these parameters, and • Synthesis must use them to re-generate speech EG-348_371_09

  27. a s s  a s  a s  a s . . . . . . . .  n p p n 1 n  1 n  2 3 n   2 3 Linear Predictive Coding - LPC Principle of linear prediction: • The next value (or sample) in a series, ie at time n, is predicted or estimated by a weighted sum of previous values, ie those at time n-1, n-2, ... • Thus for a predictor of order p, we have: EG-348_371_09

  28. Linear Prediction Transforming to the z-domain gives: EG-348_371_09

  29. LPC Error Terms Error is simply difference between predicted and actual values: sn en + - ˆ sn A’(z) EG-348_371_09

  30. en Synthesis sn H(z) Parameters updated at frame rate sn en +  + A’(z) NB ‘hat’ of approximation omitted for simplicity EG-348_371_09

  31. Synthesis en sn H(z) Analysis Analysis sn en S(z) + E(z) 1/H(z) sn - en A’(z) Analysis for Synthesis • The Analysis and Synthesis must match • what is needed for the Synthesis? • Answer: en - the excitation and H(z) - the system • Thus the Analysis must derive these terms (from sn ): • The speech signal, sn is analysed to give en and H(z) ie A’(z) parameters for transmission. EG-348_371_09

  32. Derivation of LPC Coefficients - A(z) Recall: where ai are the pprediction coefficients.The principle behind LPC is to find a set of pcoefficients, a1, a2, a3, ... ap, which in some sense minimizes the error signal en, over a frame of speech, N. This leads to a set p coefficients for each frame. EG-348_371_09

  33. for i = 1, 2, .… p From which: where: In matrix form: or Derivation of A(z) – (2) Minimisation of En is achieved by setting the ppartial derivatives to zero: The matrix [R] is Toepliz symmetric, offering numerically efficient inversion techniques - Durbin’s recursion algorithm being one of the most popular. EG-348_371_09

  34. Derivation of A(z) – (3) • When N very large r is the autocorrelation coefficients of s • S comes from e convolved with h (excitation & vocal tract) • we are interested here in separating e and h • the predictor order, p, is small to reflect the short-term periodicities (formants) • with higher predictor orders we will get the longer-term periodicities (pitch) • 2 practical problems with evaluating a: • matrix singularities in R-1 • unstable resultant H(z) • in practice both are solved by windowing - shaping frame - Hamming EG-348_371_09

  35. Speech Signal Characteristics • Duration • Dynamic Range • Periodicities: • vocal tract • pitch • Frame-based Analysis • frame size: quasi-stationary capture transition typically 20 - 30ms • frame rate: task dependent: more means moreband-width/computation - up to 100 frames/second EG-348_371_09

  36. Harmonic Structures and Periodicities • Harmonic Structures & Periodicities give potential for data reduction • LPC is one way of gaining this compression • Speech has two obvious separate structures • vocal tract resonances • pitch EG-348_371_09

  37. Harmonic Structures and Periodicities voiced or unvoiced sn speech en H(z) Vocal tract Short Term Tp p Short term prediction EG-348_371_09

  38. Harmonic Structures and Periodicities voiced unvoiced epn sn speech Hlt(z) Hst(z) en Pitch Vocal tract Tp P Long term prediction EG-348_371_09

  39. k Gain en epn sn Hlt(z) Hst(z) Harmonic Structures and Periodicities Two Structures: short-term (formants) & long-term - pitch (excitation) eg 20ms frame 160 samples @ 8Khz ai eg p=3 ai eg p=10 NB Representations of these parameters are transmitted EG-348_371_09

  40. Practical Coding Systems • Waveform & Source Coders (Vocoders) • 2 periodicities/redundancies in source • short-term (formants) • long-term - pitch • Excitation en en epn sn Hlt(z) Hst(z) EG-348_371_09

  41. S(z) E(z) E(z) S(z) 1/H(z) H(z) sn en sn en ‘Perfect’ Analysis/Synthesis (1) Input sn and output sn are identical (within arithmetic limits) EG-348_371_09

  42. S(z) E(z) E(z) E(z) S(z) S(z) 1 – A’(z) 1/H(z) H(z) sn sn en sn en en ‘Perfect’ Analysis/Synthesis (2) S(z) E(z) 1/(1–A’(z)) en sn en sn sn en 1/(1–A’(z)) 1 – A’(z) EG-348_371_09

  43. sn sn-1 a1 ai sn-i sn-p ‘Perfect’ Analysis/Synthesis (3) sn en sn en 1/(1–A’(z)) 1 – A’(z) Original Speech Residual sn en + -  sn Z-1 Z-1 Note – minus sign: in Matlab combined with ai What determines p? Z-1 ap EG-348_371_09

  44. sn en sn en 1/(1–A’(z)) 1 – A’(z) sn sn-1 a1 a1 ai ai sn-i sn-p ‘Perfect’ Analysis/Synthesis (4) Residual Re-Synth. Original Speech en en sn + + -   sn sn Z-1 Z-1 Note No minus sn-1 Z-1 Z-1 sn-i Z-1 Z-1 sn-p ap ap EG-348_371_09

  45.  S(z) E(z) E(z) S(z) 1/H(z) H(z) sn  en  sn en  Input sn and output sn are “similar” Practical System Transmitted Data Frame What does the Transmitted Data Frame Contain? EG-348_371_09

  46. Analysis-by-Synthesis: LPAS Integrated encoder & decoder at the encoder - sn Basic decoder Adaptive encoder + Weighted error LPAS Encoder EG-348_371_09

  47. Log Spectral Estimates • Comparisons between frames are very important in many situations • log spectral estimates are the most common (though in Comms. An approximation is used to reduce computation) In Comms, compuation is expensive and parameter vector approximations to D are used EG-348_371_09

  48. Some Standards GSM European Cellular RPE-LTP 13kb/s FS1016 Secure Voice CELP 4.8 IS54 NA Cellular VSELP 7.95 IS96 “ QCELP 1-8 JDC-FR Japanese Cellular VSELP 6.7 JDC-HR “ PSI-CELP 3.67 G.728 (terrestrial) LD-CELP 16 EG-348_371_09

  49. Low Bit Rate Speech CodingCompandent http://www.compandent.com/ EG-348_371_09

  50. Excellent Quality Good ADPCM GSM Fair CELP Poor 4 8 16 32 64 kbps Criteria in Speech Comms. Quality versus Bit-rate 4 Quality Measures: intelligibility loudness naturalness ease-of-listening EG-348_371_09

More Related