690 likes | 827 Vues
Automatic Speech Recognition Introduction. Jan Odijk Utrecht, Dec 9, 2010. Overview. What is ASR? Why is it difficult? How does it work? How to make a speech recognizer? Example Applications. Overview. What is ASR? Why is it difficult? How does it work?
E N D
Automatic Speech RecognitionIntroduction Jan Odijk Utrecht, Dec 9, 2010
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
ASR • Automatic Speech Recognition is the process by which a computer maps an acoustic signal containing speech to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.
ASR-related • Automatic speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Automatic speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
Why is ASR difficult? • All occurrences of a speech sound differ from each other • even when part of the same word type • And when pronounced by the same person • (‘b’ in ‘boom’ is never pronounced twice in exactly the same way) • Each speaker has his own voice characteristics
Why is ASR difficult? • Other problems caused by: • Language: Dutch vs. English vs. … • Accent/Dialect: Flemish vs. NL Dutch, etc. • Gender: Male vs. female • Age: child vs. adult vs. senior • Health: cold, flu, sore throat, etc.
Why is ASR difficult? • Other problems caused by: • Environment: home, office, in-car, in station, etc. • Channel : fixed telephone, mobile phone, multimedia channel, etc. • Microphone(s): telephone mike, close-talk mike, far mike, array microphone, etc.; different mike qualities
Why is ASR difficult? • Confusables: • Zeven vs. negen • Ambiguity • [sã] = cent, (je) sens, sans (French) • Variation • Yes, yeah, yep, ok, okido, fine, etc.
Why is ASR difficult? • Assimilation, deletions, etc • Een => [n], [m], [ŋ] (auto, boek, kast) • Natuurlijk => tuurlijk • Coarticulation • Pronunciation of a sound depends on its environment (sounds preceding/following) • Koel vs. kiel [k] vs. [k’] • Filled pauses, stuttering, repetitions
Why is ASR difficult? • Other sounds • Background noise, music, other people talking, channel noise • Reverberation, echo • Speaker of language X pronouncing words from language Y • Esp. with names (persons, places, …)
How are these problems reduced? • Separate ASR system • for each language • For each accent/dialect (Dutch / Flemish) • For each environment • For each channel and microphone(s) • Use close-talk mike to reduce other sounds and influence of environment • For each speaker (speaker-adaptive/dependent ASR)
How are these problems reduced? • Restricted Vocabulary • Only a limited number of words can be ‘recognized’ by any specific system • Ranging from a dozen to 64k different word forms • Dozen: application in which digits, yes/no and simple commands are sufficient (banking applications, number dialing)
How are these problems reduced? • Restricted Vocabulary • In between: reverse directory application • employee name => phone number • 64k: ‘large vocabulary systems’ • dictation, • (topographic) name recognition
How are these problems reduced? • Small Vocabularies • Is that enough? No, generally not! • Use dialogue to change restricted vocabulary in each dialogue state (dynamic active vocabularies) • Yes/no answer is expected => activate yes/no vocabulary • Digit expected => activate digit vocabulary • Name expected => activate name vocabulary
How are these problems reduced? • 64k Vocabulary (“Large Vocabulary) • Is that enough? • No, generally not • Languages with compounds • Languages with a lot of inflection • Agglutinative languages • => require special measures
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
How does ASR work? • Not possible (yet?) to characterize the different sounds by (hand-crafted) rules • Instead: • A large set of recordings of each sound is made • Using statistical methods a model for each sound is derived (acoustic model) • Incoming sound is compared, using statistics, with acoustic model of a sound
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Feature Extraction • Turning speech signal into something more manageable • Sampling of a signal: transforming into a digital form • For each short piece of speech (10ms) • Compression
Feature Extraction • Extract relevant parameters from the signal • Spectral information, energy, frequency,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)
10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Acoustic Model (AM) • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability (see earlier) • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples
Acoustic Model (AM) • Representation of speech signal • Waveform • Horizontal: time • Vertical: amplitude • Spectogram • Horizontal: time • Vertical: frequency • Color: amplitude of frequency
f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13
S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start
,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop
Acoustic Model: Units • Other possible units • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features • Search algorithm: looks for the best scoring word or word sequence
increase [, I n k R+ I s ,]
Include [, I n k l u: d ,]
Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text
Language Model (LM) • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations
Language Model (LM) • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large text corpus
Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module
Result • N-Best List: • Lists of word sequences with a score • Based on AM and LM • Sorted descending by this score • Maximally N words
Post Processing • Re-ordering of N-best list using other criteria: e.g. credit card numbers, telephone numbers • If one result is needed, select top element • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents” “$ 3.50” • “doctor Jones” “Dr. Jones” • Etc.
Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications
How to get AM and LM • AM • Annotated speech database, and • Pronunciation dictionary • LM • Handwritten grammar, or • Large text corpus
Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model
Annotated Speech Database • Must contain speech covering • all units: phonemes, context dependent phonemes • population • Region, dialect, age, gender, …) • relevant environment(s) • car, office,.. • Relevant channel(s) • Fixed phone, mobile phone, desktop computer, …
Annotated Speech Database • Must contain transcription of speech • At least orthographic • Must include markers for • Speech by others • Other non-speech sounds • Unfinished words, mispronunciations, stuttering, etc.
Pronunciation Dictionary • List of all words occurring in speech database • With one or more phonetic transcriptions • Or: Grapheme-To-Phoneme (G2P) module • Graphemes => phonemes • E.g. boek => [,b u k ,]
For all utterances in database: Make phonetic transcription of a utterance Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models