1 / 20

AMSP : Advanced Methods for Speech Processing

AMSP : Advanced Methods for Speech Processing. An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues Submitted by Marcos FAUNDEZ-ZANUY

elmer
Télécharger la présentation

AMSP : Advanced Methods for Speech Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AMSP : Advanced Methods for Speech Processing An expression of Interest to set up a Network of Excellence in FP6 Prepared by members of COST-277 and colleagues Submitted by Marcos FAUNDEZ-ZANUY Presented here by Gérard CHOLLETchollet@tsi.enst.fr GET-ENST/CNRS-LTCIhttp://www.tsi.enst.fr/~chollet

  2. Outline • Rationale of the proposition • Objectives • Approaches • Modeling • Recognition by synthesis • Robustness to environmental conditions • Evaluation paradigm • Excellence • Integration and structuring effect

  3. Rationale for the NoE-AMSP • The areas of Automatic Speech Processing (recognition, synthesis, coding, language identification, speaker verification) should be better integrated • Better models of Speech Production and Perception • Investigate Nonlinear Speech Processing • Understanding, Semantic interpretation

  4. Integrated platform for Automatic Speech Processing

  5. Levels of representations

  6. Features of Speech Models • Reflect auditory properties of human perception • Explain articulatory movements • Surpass the limitations of the source-filter model • Capture the dynamics of speech • Capable of natural speech restitution • Be discriminant for segmental information • Robust to noise and channel distortions • Adaptable to new speakers and new environments

  7. Time – Frequency distributions • Short Time Fourier Transform • Non-linear frequency scale (PLP, WLP), mel-cepstrum • Wavelets, FAMlets • Bilinear distributions (Wigner-Ville, Choi-Williams,...) • Instantaneous frequency, Teager operator • Time – dependent representations (parametric and non parametric) • Vector quantisation • Matrix quantisation, non linear prediction

  8. Time-dependent Spectral Models • Temporal Decomposition (B. Atal, 1983) • Vectorial Autoregressive models with detection of model ruptures (A. DeLima, Y. Grenier) • Segmental parameterisation using a time-dependent polynomial expansion (Y. Grenier)

  9. Modeling of segmental units • Hidden Markov Model • Markov Fields • Bayesian Networks, Graphical Models OR • Production models • Synthesis (concatenative or rule based) with voice transformation AND / OR • Non linear predictor

  10. Expected achievements in Speech Coding and Synthesis • Modeling the non-linearities in Speech Production and Perception will lead to more accurate and/or compact parametric representations. • Integrate segmental recognition and synthesis techniques in the coding loop to achieve bit rates as low as a few 100's bps with natural quality • Develop voice transformation techniques in order to : • Adapt segmental coders to new speakers, • Modify the characteristics of synthetic voices

  11. Expected achievements inSpeech Synthesis • Self-excited nonlinear feedback oscillators will allow to better match synthetic and human voices. • Current concatenative techniques should be supplemented (or replaced) by (nonlinear) model based generative techniques to improve quality, naturalness, flexibility, training and adaptation. • Model-based voice mimicry controled by textual, phonetic and/or parametric input should not only improve synthesis but also coding, recognition and speaker characterisation.

  12. Automatic Speech Recognition • Limitations of the HMM and hybrid HMM-ANN approaches • Keyword spotting (detection with SVM), noise robustness, adaptation • Large Vocabulary Speech Recognition (SIROCCO) http://perso.enst.fr/~sirocco/index-en.html • Markov Random Fields, Bayesian Networks and Graphical Models

  13. Markov Random Fields Bayesian Networks and Graphical Models • Speech modelling with state constrained • Markov Random Field over Frequency bands • (Guillaume Gravier and Marc Sigelle) • http://perso.enst.fr/~ggravier/recherche.html#these • Comparative framework to study MRF, • Bayesian Networks and Graphical Models. • http://www.cs.berkeley.edu/~murphyk/Bayes/bayes.html

  14. Recognition by Synthesis • If we could drive a synthesizer with meaningful units (phone sequences, words,...) to produce a speech signal that mimics the one to recognize, we may come close to transcription. • Analysis by Synthesis (which is in fact modeling) is a powerful tool in recognition and coding. • A trivial implementation is indexing a labelled speech memory

  15. A L I S P Automatic Language Independent Speech Processing Automatic discovery of segmental units for speech coding, synthesis, recognition, language identification and speaker verification.

  16. The robustness issue : • Mismatch between training and testing conditions • High Order Statistics are less sensitive to environment and transmission noise than autocorrelation • CMS, RASTA filtering • Independent Component Analysis • From Speaker Independent to Speaker Dependent recognition (Personalisation)

  17. Expected achievements inAutomatic Speech Recognition • Dynamic nonlinear models should allow to merge feature extraction and classification under a common paradigm • Such models should be more robust to noise, channel distortions and missing data (transmission errors and packet losses) • Indexing a speech memory may help in the verification of hypotheses (a technique shared with Very Low Bit Rate Coders) • Statistical language models should be supplemented with adapted semantic information (conceptual graphs)

  18. Voice technology in Majordome • Server side background tasks: continuous speech recognition applied to voice messages upon reception • Detection of sender’s name and subject • User interaction: • Speaker identification and verification • Speech recognition (receiving user commands through voice interaction) • Text-to-speech synthesis (reading text summaries, E-mails or faxes)

  19. Collaboration with COST-278 • COST-278: Vocal Dialogue is a continuation of COST-249 • High interest in Robust Speech Recognition, Word spotting, Speech to actions, Speaker adaptation,... • Some members contribute to the Eureka-MAJORDOME project • Could be the seed for a Network of Excellence in FP6

  20. Evaluation paradigm • DARPA • NIST • http://www.nist.gov/speech/tests/spk/index.htm Could we organize evaluation campaigns in Europe ? The 6th program of the EU is trying to promote Networks of Excellence. How should excellence be evaluated ? Should financial support be correlated with evaluation results ?

More Related