1 / 18

ASR and scalability

ASR and scalability. Dominique Vaufreydaz ESSLLI’02. ASR and scalability. State-of-the-art speech recognition general overview acoustic modelling language modelling Web-trained language models scalability of Web data ? Nespole! example results. State-of-the-art speech recognition

webb
Télécharger la présentation

ASR and scalability

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASR and scalability Dominique Vaufreydaz ESSLLI’02 Dominique Vaufreydaz, ESSLLI 2002

  2. ASR and scalability • State-of-the-art speech recognition • general overview • acoustic modelling • language modelling • Web-trained language models • scalability of Web data ? • Nespole! example • results • State-of-the-art speech recognition • general overview • acoustic modelling • language modelling • Web-trained language models • scalability of Web data ? • Nespole! example • results Dominique Vaufreydaz, ESSLLI 2002

  3. Phonetically labelled signals Training Text corpus Training Speech Acoustic parameters Acoustic models Language model(s) Recognition Decoding State-of-the-art speech recognition - general overview Automatic speech recognition • Acoustic parameters used: • - Mel-scaled Frequency Cepstral Coefficients (MFCC) • - Energy • - Zero crossing • - Linear Predictive Coding (LPC) • Perceptual Linear Predictive (PLP) et Rasta-PLP • etc. •  and   of these parameters Dominique Vaufreydaz, ESSLLI 2002

  4. State-of-the-art speech recognition - acoustic modelling Hidden Markov Models • Two different stochastic processes • X: a first order hidden Markov chain for temporal variability • Y: an observable process, for spectral variability • HMM can be described with  = (A, B, ): • Matrix A: transition probabilities from one state to another ai,j p(Xt = j | Xt-1 = i) • Matrix B: distribution probabilities of observations bi,j(y)  p(Yt = y | Xt-1 = i, Xt = j) In continuous speech recognition, these probabilities are multigaussian mixtures defined with: • the mean vector • the covariance matrix • the weights of each gaussian • Matrix : probabilities to reach a state from the initial state i  p(X0 = i) Dominique Vaufreydaz, ESSLLI 2002

  5. a22 a11 a33 a12 a23 S1 S3 S2 a13 State-of-the-art speech recognition - acoustic modelling Acoustic units • Different kinds of system • context independent systems: phonemes (or other units) • context dependent systems: allophones, i.e. units in context. More robust but use more memory and CPU. • The availability of enough training data determines the choice between context dependent/independent models and the number of different allophones. • HMM topology for each unit • usually, a bakis model (left/rigth first order model) with ai,j = 0 if j < i Dominique Vaufreydaz, ESSLLI 2002

  6. State-of-the-art speech recognition - acoustic modelling Train acoustic models • Estimation and iterative reestimation of the model parameters • need an acoustic corpus: • matching the future recognition condition (speech quality, noise environment, etc.) • annotated in acoustic units, i.e. a sequence of acoustic observations O. • use Baum-Welch or Expectation-Modification (EM) algorithms • find  = (A, B, ) to maximise P(O| ) Dominique Vaufreydaz, ESSLLI 2002

  7. State-of-the-art speech recognition - acoustic modelling Acoustic Model Adaptation • Having enough training data for these new acoustic condition • train a new model with these data • train a multicondition model with all your data • Having a numerical way to simulate new condition (from clean speech to G723 speech for example) • transcode your data and train a new or multicondition model • Having only few adaptation data • use adaptation algorithms like: • Maximum Likelihood Linear Regression (MLLR) • Maximum A Posteriori (MAP) • Bayesian Predictive Adaptation (BPA) • etc. Dominique Vaufreydaz, ESSLLI 2002

  8. State-of-the-art speech recognition - language modelling Statistical language models • Statistical language models • more robust than grammar for large vocabulary and dialog systems • not only a yes/no answer • n-gram models: considering n-1 words as context • mostly n is 3: • need text corpora to compute these probabilities Dominique Vaufreydaz, ESSLLI 2002

  9. 1 – « Wizard of Oz » experiments délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Transcriptions délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 language model délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 Pentat euque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Language model 2 – train a language model State-of-the-art speech recognition - language modelling Compute a language model LM tools Adaptation tools A third way using all the available data on the Web ??? Dominique Vaufreydaz, ESSLLI 2002

  10. ASR and scalability • State-of-the-art speech recognition • general overview • acoustic modelling • language modelling • Web-trained language models • scalability of Web data ? • Nespole! example • results Dominique Vaufreydaz, ESSLLI 2002

  11. Web-trained language models - scalability of Web data ? Scalability using the Web ? • Huge amount a data on many topics • ~200000 different French lexical forms • different kinds of text • well-written text in professional pages for example • pseudo dialog forms in personal Web pages « Euh... bonjour, euh... c'est l'Institut Macareux... euh... c'est pour un sondage (anonyme, quoi... hein) ! » • Size of the training set is steadily increasing with the vocabulary size Dominique Vaufreydaz, ESSLLI 2002

  12. Web-trained language models - Nespole! example Specific vocabulary definition • Recording real dialogs in real condition (see « Data Collection in Nespole! ») • 5 different scenarios recorded through NetMeeting • 191 dialogs in 4 languages including 31 French ones manually transcribed • extracted French vocabulary contains 2056 words • Add CStar-II vocabulary • a specific tourist vocabulary was previously defined for the CStar-II project • vocabulary grows up to 2500 words Dominique Vaufreydaz, ESSLLI 2002

  13. délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 BDLex délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 ABU WebFr4 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Words frequency Specific vocabulary 20K vocabulary Web-trained language models - Nespole! example Increase vocabulary coverage- lexical OOV - 1 - computewordscounts + 2 – add most frequent words Dominique Vaufreydaz, ESSLLI 2002

  14. délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Multi-words WebFr4 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 20K vocabulary 20K vocabulary Final vocabulary Web-trained language models - Nespole! example Increase vocabulary coverage- short words - 3 - compute5-gram on short words (5 letters and 3 phonemes maximum) + 5 – add most frequent multi-words Dominique Vaufreydaz, ESSLLI 2002

  15. délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Final vocabulary (20,540 words) délissasses 1 croquantes 42 emmènerais 9 emmènerait 26 badgé 19 badge 3439 faillirent 52 pentateuque 309 tabloïde 17 tabloïds 117 attendriraient 5 agatisé 1 portiques 1165 accusais 18 accusait 662 bioclimats 4 circonscriras 2 Final LM 1,960,813 bigrams 6,413,376 trigrams WebFr4 Web-trained language models - Nespole! example Trigram language model 5 - compute3-gram Language Models Minimal block length filter (length=5) Il mordait en ce moment de fort bon appétit dans un morceau de pain. Il en arracha un peu de mie pour faire une boulette. Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait. Bien dirigée, la boulette rebondit presque à la hauteur de la croisée. Cet inconnu traversait la cour d'une maison située rue Vivienne, où. Cette exclamation échappait à un clerc appartenant au genre de ceu. Il mordait en ce moment de fort bon appétit dans un morceau de pain. Il en arracha un peu de mie pour faire une boulette. Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait. Il en arracha un peu de mie pour faire une boulette. Il la lança railleusement par le vasistas d'une fenêtre sur laquelle il s'appuyait. Bien dirigée, la boulette rebondit presque à la hauteur de la croisée, Cet inconnu traversait la cour d'une maison. 1,587,142,200 words corpus Adapted LM tools Dominique Vaufreydaz, ESSLLI 2002

  16. Web-trained language models - results Results • On the CStar-II task (~3000 words) • On the Nespole! Task (20524 words) Dominique Vaufreydaz, ESSLLI 2002

  17. <!DOCTYPE HTML SYSTEM "HTML.dcl" []> <HTML VERSION = "2.0"> <HEAD> <TITLE> Laboratoire CLIPS</TITLE> </HEAD> <!--changement de couleur de fond -- DS, 19 mai 1997 --> <BODY BGCOLOR = "#FFFFFF"> <TABLE LANG = "en_us" COLSPEC = "C158C215C170" UNITS = "EM" ALIGN = "CENTER" CLEAR = "no"> <TR LANG = "en_us" VALIGN = ""> <TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP"> <IMG SRC = "clip-arts/logos/logo.gif" ALIGN = "TOP" HEIGHT = "120"> </TD> <TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP"> <TABLE LANG = "en_us" COLSPEC = "C207" UNITS = "EM" ALIGN = "CENTER" CLEAR = "no"> <TR LANG = "en_us" VALIGN = "TOP"> <TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP"> <H1> <B><I>CLIPS</I> </B> </H1> </TD> </TR> <TR LANG = "en_us" VALIGN = "TOP"> <TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP"> Communication Langagi&egrave;re et<BR>Interaction Personne Syst&egrave;me </TD> </TR> <TR LANG = "en_us" VALIGN = "TOP"> <TD LANG = "en_us" COLSPAN = "1" ROWSPAN = "1" VALIGN = "TOP"> <I>F&eacute;d&eacute;ration IMAG</I> […] Dominique Vaufreydaz, ESSLLI 2002

  18. rue de la bibliothèque b </s> est un laboratoire de grenoble <s> le centre national de la <s> un laboratoire et un centre <s> vous pouvez également faire des de mots sur tout le <s> nous avons aussi un peu si vous ne trouvez pas ce que vous cherchez ici également la liste de nos organisée par le laboratoire clips est de plus en plus important </s> mais aussi à toute personne <s> tout savoir sur le programme <s> la sélection de la semaine </s> sur le site web de la sur le site de la <s> pour profiter de ce site il est <s> sinon vous pouvez visiter une de haut niveau dans les domaines <s> chaque année un programme est <s> pour accéder directement au programme et la chimie de la matière juillet à grenoble saint martin semaine de juillet à grenoble saint martin Dominique Vaufreydaz, ESSLLI 2002

More Related