Unraveling Automatic Speech Recognition Challenges and Solutions

Automatic Speech RecognitionIntroduction Jan Odijk Utrecht, Dec 9, 2010

Overview • What is ASR? • Why is it difficult? • How does it work? • How to make a speech recognizer? • Example Applications

ASR • Automatic Speech Recognition is the process by which a computer maps an acoustic signal containing speech to text. • Automatic speech understanding is the process by which a computer maps an acoustic speech signal to some form of abstract meaning of the speech.

ASR-related • Automatic speaker recognition is the process by which a computer recognizes the identity of the speaker based on speech samples. • Automatic speaker verification is the process by which a computer checks the claimed identity of the speaker based on speech samples

Why is ASR difficult? • All occurrences of a speech sound differ from each other • even when part of the same word type • And when pronounced by the same person • (‘b’ in ‘boom’ is never pronounced twice in exactly the same way) • Each speaker has his own voice characteristics

Why is ASR difficult? • Other problems caused by: • Language: Dutch vs. English vs. … • Accent/Dialect: Flemish vs. NL Dutch, etc. • Gender: Male vs. female • Age: child vs. adult vs. senior • Health: cold, flu, sore throat, etc.

Why is ASR difficult? • Other problems caused by: • Environment: home, office, in-car, in station, etc. • Channel : fixed telephone, mobile phone, multimedia channel, etc. • Microphone(s): telephone mike, close-talk mike, far mike, array microphone, etc.; different mike qualities

Why is ASR difficult? • Confusables: • Zeven vs. negen • Ambiguity • [sã] = cent, (je) sens, sans (French) • Variation • Yes, yeah, yep, ok, okido, fine, etc.

Why is ASR difficult? • Assimilation, deletions, etc • Een => [n], [m], [ŋ] (auto, boek, kast) • Natuurlijk => tuurlijk • Coarticulation • Pronunciation of a sound depends on its environment (sounds preceding/following) • Koel vs. kiel [k] vs. [k’] • Filled pauses, stuttering, repetitions

Why is ASR difficult? • Other sounds • Background noise, music, other people talking, channel noise • Reverberation, echo • Speaker of language X pronouncing words from language Y • Esp. with names (persons, places, …)

How are these problems reduced? • Separate ASR system • for each language • For each accent/dialect (Dutch / Flemish) • For each environment • For each channel and microphone(s) • Use close-talk mike to reduce other sounds and influence of environment • For each speaker (speaker-adaptive/dependent ASR)

How are these problems reduced? • Restricted Vocabulary • Only a limited number of words can be ‘recognized’ by any specific system • Ranging from a dozen to 64k different word forms • Dozen: application in which digits, yes/no and simple commands are sufficient (banking applications, number dialing)

How are these problems reduced? • Restricted Vocabulary • In between: reverse directory application • employee name => phone number • 64k: ‘large vocabulary systems’ • dictation, • (topographic) name recognition

How are these problems reduced? • Small Vocabularies • Is that enough? No, generally not! • Use dialogue to change restricted vocabulary in each dialogue state (dynamic active vocabularies) • Yes/no answer is expected => activate yes/no vocabulary • Digit expected => activate digit vocabulary • Name expected => activate name vocabulary

How are these problems reduced? • 64k Vocabulary (“Large Vocabulary) • Is that enough? • No, generally not • Languages with compounds • Languages with a lot of inflection • Agglutinative languages • => require special measures

How does ASR work? • Not possible (yet?) to characterize the different sounds by (hand-crafted) rules • Instead: • A large set of recordings of each sound is made • Using statistical methods a model for each sound is derived (acoustic model) • Incoming sound is compared, using statistics, with acoustic model of a sound

Elements of a Recognizer

Elements of a Recognizer Acoustic Action Model Natural Feature Pattern Speech Post Processing Language Meaning Extraction Matching Data Understanding Language Display Model text

Feature Extraction • Turning speech signal into something more manageable • Sampling of a signal: transforming into a digital form • For each short piece of speech (10ms) • Compression

Feature Extraction • Extract relevant parameters from the signal • Spectral information, energy, frequency,... • Eliminate undesirable elements (normalization) • Noise • Channel properties • Speaker properties (gender)

10.3 1.2 -0.9 . 0.2 Feature Extraction: Vectors • Signal is chopped in small pieces (frames), • Spectral analysis of a speech frame produces a vector representing the signal properties. • => result = stream of vectors

Acoustic Model (AM) • Split utterance into basic units, e.g. phonemes • The acoustic model describes the typical spectral shape (or typical vectors) for each unit • For each incoming speech segment, the acoustic model will tell us how well (or how badly) it matches each phoneme • Must cope with pronunciation variability (see earlier) • Utterances of the same word by the same speaker are never identical • Differences between speakers • Identical phonemes sound differently in different words => statistical techniques: creation via a lot of examples

Acoustic Model (AM) • Representation of speech signal • Waveform • Horizontal: time • Vertical: amplitude • Spectogram • Horizontal: time • Vertical: frequency • Color: amplitude of frequency

f-r--ie--n--d--l--y- c--o--m--p---u----t--e--r---s S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13

S S6 T S7 A S8 R S9 T S10 S1 S T S2 S3 O S4 P Acoustic Model: Units • Phoneme: share units that model the same sound • Word: series of units specific to the word Stop Start Stop Start

,S ST TO OP P, S|,|T T|S|O O|T|P P|O|, Acoustic Model: Units • Context dependent phoneme Stop • Diphone Stop • Other sub-word units: consonant clusters ST O P Stop

Acoustic Model: Units • Other possible units • Words • Multi words: example: “it is”, “going to” • Combinations of all of the above

Pattern matching • Acoustic Model: returns a score for each incoming feature vector indicating how well the feature corresponds to the model. = Local score • Calculate score of a word, indicating how well the word matches the string of incoming features • Search algorithm: looks for the best scoring word or word sequence

increase [, I n k R+ I s ,]

Include [, I n k l u: d ,]

Language Model (LM) • Describes how words are connected to form a sentence • Limit possible word sequences • Reduce number of recognition errors by eliminating unlikely sequences • Increase speed of recognizer => real time implementations

Language Model (LM) • Two major types • Grammar based !start <sentence>; <sentence>: <yes> | <no>; <yes>: yes | yep | yes please ; <no>: no | no thanks | no thank you ; • Statistical • Probability of single words, 2/3-word sequences • Derived from frequencies in a large text corpus

Active Vocabulary • Lists words that can be recognized by the acoustic model • That are allowed to occur given the language model • Each word associated with a phonetic transcription • Enumerated, and/or • Generated by a Grapheme-to-Phoneme (G2P) module

Result • N-Best List: • Lists of word sequences with a score • Based on AM and LM • Sorted descending by this score • Maximally N words

Post Processing • Re-ordering of N-best list using other criteria: e.g. credit card numbers, telephone numbers • If one result is needed, select top element • Applying NLP techniques that fall outside the scope of the statistical language model • E.g. “three dollars fifty cents”  “$ 3.50” • “doctor Jones”  “Dr. Jones” • Etc.

How to get AM and LM • AM • Annotated speech database, and • Pronunciation dictionary • LM • Handwritten grammar, or • Large text corpus

Training of Acoustic Models Annotated Speech Database Pronunciation Dictionary Training Program Acoustic Model

Annotated Speech Database • Must contain speech covering • all units: phonemes, context dependent phonemes • population • Region, dialect, age, gender, …) • relevant environment(s) • car, office,.. • Relevant channel(s) • Fixed phone, mobile phone, desktop computer, …

Annotated Speech Database • Must contain transcription of speech • At least orthographic • Must include markers for • Speech by others • Other non-speech sounds • Unfinished words, mispronunciations, stuttering, etc.

Pronunciation Dictionary • List of all words occurring in speech database • With one or more phonetic transcriptions • Or: Grapheme-To-Phoneme (G2P) module • Graphemes => phonemes • E.g. boek => [,b u k ,]

For all utterances in database: Make phonetic transcription of a utterance Use models to segment the utterance file: assign a phoneme to each speech frame Collect statistical information: Count prototype-phoneme occurrences Training of Acoustic Models Create New Models

Unraveling Automatic Speech Recognition Challenges and Solutions

Unraveling Automatic Speech Recognition Challenges and Solutions

Presentation Transcript

Automatic Speech Recognition

Automatic Speech Recognition: An Overview

Automatic Speech Recognition

Speech Recognition Introduction II

Automatic Speech Recognition

Automatic Speech Recognition (ASR)

Automatic speech recognition

Automatic Speech Recognition II

Automatic Speech Recognition and Audio Indexing

Automatic Speech Recognition System

Confidence Measures for Automatic Speech Recognition

Automatic Speech Recognition

Automatic Continuous Speech Recognition

Automatic Speech Recognition Studies

Automatic Speech Recognition Introduction

Automatic Speech Recognition

Automatic Speech Recognition - Edukite

Automatic Speech Recognition Introduction

Introduction to Automatic Speech Recognition

Automatic Speech Recognition