1 / 67

Automatic Speech Recognition Introduction

Automatic Speech Recognition Introduction. Jan Odijk Utrecht, Dec 9, 2010. Overview. What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications. Overview. What is ASR? Why is it difficult? How does it work?

isonm
Télécharger la présentation

Automatic Speech Recognition Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Speech RecognitionIntroduction Jan Odijk Utrecht, Dec 9, 2010

  2. Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

  3. Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

  4. ASR • Automatic Speech Recognition is the process by which a computer maps an acoustic signal containing speech to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

  5. ASR-related • Automatic speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Automatic speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples

  6. Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

  7. Why is ASR difficult? • All occurrences of a speech sound differ from each other • even when part of the same word type • And when pronounced by the same person • (‘b’ in ‘boom’ is never pronounced twice in exactly the same way) • Each speaker has his own voice characteristics

  8. Why is ASR difficult? • Other problems caused by: • Language: Dutch vs. English vs. … • Accent/Dialect: Flemish vs. NL Dutch, etc. • Gender: Male vs. female • Age: child vs. adult vs. senior • Health: cold, flu, sore throat, etc.

  9. Why is ASR difficult? • Other problems caused by: • Environment: home, office, in-car, in station, etc. • Channel : fixed telephone, mobile phone, multimedia channel, etc. • Microphone(s): telephone mike, close-talk mike, far mike, array microphone, etc.; different mike qualities

  10. Why is ASR difficult? • Confusables: • Zeven vs. negen • Ambiguity • [sã] = cent, (je) sens, sans (French) • Variation • Yes, yeah, yep, ok, okido, fine, etc.

  11. Why is ASR difficult? • Assimilation, deletions, etc • Een => [n], [m], [ŋ] (auto, boek, kast) • Natuurlijk => tuurlijk • Coarticulation • Pronunciation of a sound depends on its environment (sounds preceding/following) • Koel vs. kiel [k] vs. [k’] • Filled pauses, stuttering, repetitions

  12. Why is ASR difficult? • Other sounds • Background noise, music, other people talking, channel noise • Reverberation, echo • Speaker of language X pronouncing words from language Y • Esp. with names (persons, places, …)

  13. How are these problems reduced? • Separate ASR system • for each language • For each accent/dialect (Dutch / Flemish) • For each environment • For each channel and microphone(s) • Use close-talk mike to reduce other sounds and influence of environment • For each speaker (speaker-adaptive/dependent ASR)

  14. How are these problems reduced? • Restricted Vocabulary • Only a limited number of words can be ‘recognized’ by any specific system • Ranging from a dozen to 64k different word forms • Dozen: application in which digits, yes/no and simple commands are sufficient (banking applications, number dialing)

  15. How are these problems reduced? • Restricted Vocabulary • In between: reverse directory application • employee name => phone number • 64k: ‘large vocabulary systems’ • dictation, • (topographic) name recognition

  16. How are these problems reduced? • Small Vocabularies • Is that enough? No, generally not! • Use dialogue to change restricted vocabulary in each dialogue state (dynamic active vocabularies) • Yes/no answer is expected => activate yes/no vocabulary • Digit expected => activate digit vocabulary • Name expected => activate name vocabulary

  17. How are these problems reduced? • 64k Vocabulary (“Large Vocabulary) • Is that enough? • No, generally not • Languages with compounds • Languages with a lot of inflection • Agglutinative languages • => require special measures

  18. Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

  19. How does ASR work? • Not possible (yet?) to characterize the different sounds by (hand-crafted) rules • Instead: • A large set of recordings of each sound is made • Using statistical methods a model for each sound is derived (acoustic model) • Incoming sound is compared, using statistics, with acoustic model of a sound

  20. Elements of a Recognizer

  21. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  22. Feature Extraction • Turning speech signal into something more manageable • Sampling of a signal: transforming into a digital form • For each short piece of speech (10ms) • Compression

  23. Feature Extraction • Extract relevant parameters from the signal • Spectral information, energy, frequency,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)

  24. 10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors

  25. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  26. Acoustic Model (AM) • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability (see earlier) • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples

  27. Acoustic Model (AM) • Representation of speech signal • Waveform • Horizontal: time • Vertical: amplitude • Spectogram • Horizontal: time • Vertical: frequency • Color: amplitude of frequency

  28. f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13

  29. S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start

  30. ,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop

  31. Acoustic Model: Units • Other possible units • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above

  32. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  33. Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features • Search algorithm: looks for the best scoring word or word sequence

  34. increase [, I n k R+ I s ,]

  35. Include [, I n k l u: d ,]

  36. Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

  37. Language Model (LM) • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations

  38. Language Model (LM) • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large text corpus

  39. Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module

  40. Result • N-Best List: • Lists of word sequences with a score • Based on AM and LM • Sorted descending by this score • Maximally N words

  41. Post Processing • Re-ordering of N-best list using other criteria: e.g. credit card numbers, telephone numbers • If one result is needed, select top element • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents”  “$ 3.50” • “doctor Jones”  “Dr. Jones” • Etc.

  42. Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

  43. How to get AM and LM • AM • Annotated speech database, and • Pronunciation dictionary • LM • Handwritten grammar, or • Large text corpus

  44. Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model

  45. Annotated Speech Database • Must contain speech covering • all units: phonemes, context dependent phonemes • population • Region, dialect, age, gender, …) • relevant environment(s) • car, office,.. • Relevant channel(s) • Fixed phone, mobile phone, desktop computer, …

  46. Annotated Speech Database • Must contain transcription of speech • At least orthographic • Must include markers for • Speech by others • Other non-speech sounds • Unfinished words, mispronunciations, stuttering, etc.

  47. Pronunciation Dictionary • List of all words occurring in speech database • With one or more phonetic transcriptions • Or: Grapheme-To-Phoneme (G2P) module • Graphemes => phonemes • E.g. boek => [,b u k ,]

  48. For all utterances in database: Make phonetic transcription of a utterance Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models

More Related