Spoken dialogs with computers

Spoken dialogs with computers Krzysztof Marasek

Man-Machine communication: How can it work? • Man-machine interaction by graphic • Man-machine communication by speech • multi-modal man-machine communication • is that all? - No: haptic, structure changes (Logitech mouse) Lecture’s topic

Lecture overview • encapsulated phonetics • sound propagation • acoustic properties of speech • basic signal forms and distinctive features of speech sounds • phonemes and allophones, grapheme-to-phoneme conversion • speech parameterization • speech data collection • speech modeling • language modeling and basics of natural language processing • speech recognition techniques- Viterbi decoding • model and pronunciation adaptation • speech synthesis methods and Text-To-Speech synthesis • speech understanding • dialog design • applications and challenges of speech technology

Architecture of spoken dialog Our topics Speech recognition Speech interpretation Dialog manager Text generation Speech synthesis De Mori, 99

Human Language Technology worldwide • USA • Oregon • Cornegie Mellon • MIT • Japan • Kyoto • Tokyo • Europe • Germany • France • Great Britain • Scandinavia • Italy • wild East

Wszystko jasne • Bdanaia na pweynm anelgiksim uneruwstytecie wyzakały, że nie ma znczeania, w jaikej kloejności napsziemy lietry wenątrz wryazu, blye tlkyo pirwesza i otstaina lreita błyy na soiwch mijsecach. Rtszea mżoe być dolnwoie poszamienina, a mmio to bedzięmy w stniae pczyrzetać tkest bez wikszęego prleobmu. Diezje się tak dlteago, że nie cztaymy kżdeaj z lteir odelndziie, ale wrayz jkao cłoaść. Eric Campbell

Speech signal characterization time signal energy spectrogram pitch duration

Man-Machine communication by speech coding phrases sentences words letters Czesc! decoding • Man-machine communication: exchange of information coded in a way suitable for transmission through a physical medium • Coding: the process of producing a representation of what has to be communicated • knowledge sources: constraints for building symbolic version of the message and transmission through a physical channel • decoding: models of KSs used by computer: deterministic but often imprecise

Knowledge sources: acoustic models coding phrases words syllables phones Czesc! • Coding of acoustic speech events: • phones: representation of basic speech units • coding alphabet: e.g. IPA codes, but also other exist (SAMPA)

Speech Recognition: A simple decoder model Information source Information channel W X • Modern systems are based on probabilistic scores for candidate hypotheses • model of hypotheses scoring: let the sequence of acoustic observations X=x1..xN be the output of the information channel. If the intention of the speaker was the sequence of words W=W1…WK then X is coded version of W • The objective of recognition is to reconstruct W based on the observation of X Utterance -> Speech signal -> PCM -> window ->coefficients -> X

Knowledge sources: acoustic models • How phones can be modeled? • Let’ s assume that speech signal is parameterized as a sequence of feature vectors computed for equally spaced speech frames • parameters are statistically independent • Hidden Markov Models: • easy training • robust modeling • also other models are used: ANN, kernel methods, hybrid systems, but HMM are currently state-of-the-art

Knowledge sources: HMMs Typical phone model topology • Nodes of graph correspond to states of Markov chain, while directed arcs correspond to allowed transtions aij • A sequence of observation is regarded as an emission of the system which at each time instant makes a transition form one to another node randomly chosen according to a node-specific probability density and generates a random vector according to arc-specific probability density. A number of states and set of arcs is usually called model topology. • In ASR it is common to have left-to-right topologies, in which aij=0 for j<I • also usually first and last states are not-emitting, i.e. source and final states are for setting initial and final probabilities, HMM

Knowledge sources: HMM linking

Knowledge sources: language model • Language model: set of constraints on the sequence of words acceptable for a given language • rules of generative grammarG to produce sentences of a language LG(G) • 4-tuple , where Vt is a set of all words of LG(G), VN is a set of non-terminal symbols representing abstractions of language components (ex. syntax), s - category of all sentences in LG(G), P-set of rules a->b, where a is sequnce of symbols with at least one belongs to VN and b (VT u Vn) • if a is only one symbol in VN then grammar G is context-free • for natural languages it is impossible to conceive a grammar G capable of generating all and only sentences of a language: no formal models of NL • heuristic solution - stochastic finite state automata: over-generating grammar for word pairs plus probabilities of generated sentences (bigrams) -HMMs • integrated network: automata for each word combined of lexical and acoustic models describing pronunciation variants (phonemes) and distribution of acoustic parameters of phonemes

Statistical modeling approach for ASR computed as a distance to trained models A priori probability of the word string W A priori probability of the acoustic sequence A given a word sequence W Bayesian approach Most probable word sequence W given the acoustic input A A priori probability of the acoustic sequence A Traditional HMM Output: (W -word model)

What can be recognized? (!ENTER{_SIL_}( Kutno | Sopot | Pozna\\361 | Lubin | £uk\\363w | aleja Solidarnoœci | Beskidy | Rzesz\\363w )(!ENTER{_SIL_} I=172 W=Jana I=173 W=Jura I=174 W=Kazimierz J=0 S=1 E=0 J=1 S=1 E=1 J=2 S=2 E=0 J=3 S=2 E=1 J=4 S=3 E=0 J=5 S=3 E=1 J=6 S=4 E=0 J=7 S=4 E=1 J=8 S=5 E=0 J=9 S=5 E=1 J=10 S=6 E=0 Vocabulary Lattice of models Kalisz k a l i S Kamienna k a m j e n n a Kaszuby k a S u b I Katowice k a t o v i ts e Kazimierz k a zi i m j e Z Kielce k j e l ts e Klakson k l a k s o n Kolor k o l o r Konopnickiej k o n o p ni i ts k j e j Konstytucji k o n s t I t u ts j i Koszalin k o S a l i n Kościuszki k o si tsi u S k i Krakowska k r a k o f s k a Krakowsko k r a k o f s k o Kraków k r a k u f Krzyki k S I k i Kujaw k u j a f Kutno k u t n o Dictionary

Speaker-independent, continuous-speech ASR now possible • Digit recognition over the telephone with word error rate of 0.3% • Error rate cut in half every two years for moderate vocabulary tasks • Error for spontaneous speech are more than twice that of read speech • Conversational speech, involving multiple speakers and poor acoustic environment, remains a challenge • Tens of hours of training data to port to a different domain • Statistical modelling using automatic training achieves significant advances digits 1k read 2k spontaneous 20k read 64 k broadcast 10k conversational 100 10 1 0.1 MIT,2005

Text-To-Speech Text preprocessing Prosody generation Acoustic output Word descriptions • Festival Speech Synthesis - steps to synthesize a sentence • Text • Token_POS • Token • POS • Word • Phrasify • Pauses • Intonation • PostLex • Duration • Int_Targets • Wave_Synth

Speech synthesis • Acoustic output: • pre-recorded speech • articulatory synthesis (formant synthesis) - tries to mimic human voice generation • Frankfurt • concatenative synthesis - build utterances using stored units • phonems • diphones: trasitions between two phonemes • Festival • unit selection: units of different length, context depended selection (maximum length of natural speech sequence) • RealSpeak • ATR Japan

Example: Tokens, syllables and phones

Phrasing and Intonation

Dialog system MIT 2005

Dialog systems • Application dependent (dialog structure and content) • finite-state dialog system (usually domain-dependent) • chatter-bot systems (domain-independent?) • initiative possession (machine, human, mixed) • concept detection and spotting (find important staff in the utterance and conclude) • concept and text generation (generate context-dependent answer) • Examples: IVR, reservation systems, but sometimes still not perfect...

Going beyond… • Add new dimensions to MMI (para- and extralinguistic features) • avatars • personality of the dialog partners • speaker`s profile • reaction on speaker`s emotion and emotional synthesis (rad)

Generation of word hypotheses: Speech recognition De Mori, 99

Part II HUMAN TO HUMAN COMMUNICATION

Dialog architecture De Mori, 99

SPEECH SPEAKER LISTENER Domains of verbal communication PSYCHOLINGUSITCS Utterance forming Understanding PHYSILOGY Articulation Hearing ACOUSTICS Speech acoustics Psychoacoustics Generation of speech Perception of speech

Eyes – visual information • Ears – sound information • Nose –smell information • Tongue –taste information • Skin, muscles, touch receptors – touch and proprio-kinesthetic information • Proprio-kinesthetic feedback mechanisms include awareness of the movement and location of the fingers in space, • internal monitoring of rhythm and rate, and a grip What and how

Articulatory organs – sounds (speech) • Movement and action organs –gestures, writing, mimics, mechanic actions etc. Organs involved in the production of infomation by humans

Organs which may be involved in human-to-human communication Articulatory organs – hearing: speech Articulatory organs –seeing : lips reading Move and action organs –seeing: writing, gestures, sign language, Braille’ writing

Levels of communication Lingustic information Articulatory infos (phonetics) Emotional information Personal information Information on organic speech disorders Information on neurogenic speech disorders Culture, habitats, social information

Speech – spoken language Writings – written language Signs – sign language (polish, german, english etc.) Language – a system of charaters and phonological, semantic and syntactic rules which allow to combine this characters Language is a basis for all human to human communication

Sentence generation scheme Syntax component Phrasing rules Lexicon Deep structure Semantic component Semantic sentence interpretation Phonological sentence interpretation Surface structure Phonological component Transformation rules

Beep Keep Leep Example of meaning changes by changed phonological structure of the word Phonology, part of natural language processing, describes phonemes and relation between phonemes

Articulation

nasal cavity Main articulatory elements of the vocal tract lips tongue glottis

Types of sounds • Sound classification is based on manner and place of articulation – where the consttriction in the vocal tract is and where the sound is generated • Manner of articulation: • Vowels • Plosives /p/, /g/ • Nasals - /m/, /n/ • Taps or trills /r/ • Fricatives -/s/, /f/, /v/ • Approximants - /j/ /w/ • Place of articulation: • Bilabial • Labiodental • Dental, Alveolar, Postalveolar • Retroflex • Palatal • Velar • Uvular • Pharyngeal • Glottal -- resonants – obstruents -- affricates -- diphtongs

Typical vocal tract configurations vowel articulation front -back high - low plosives articulation front (of the tongue) back (of the tongue)

Typical vocal tract configurations Front fricative Front lateral approximant

Comparison of airflow over nose and mouth

Tongue profiles

Phonetic transcription

IPA code - vowels

IPA consonants

SAMPA American English Consonants:24 Symbol Word Transcription p pin pIn b bin bIn t tin tIn d din dIn k kin kIn g give gIv tS chin tSIn dZ gin dZIn f fin fIn v vim vIm T thin TIn D this DIs s sin sIn z zing zIN S shin SIn Z measure "mEZ@` h hit hIt m mock mAk n knock nAk N thing TIN r wrong rON l long lON w wasp wAsp j yacht jAt Vowels:17 I pit pIt E pet pEt { pat p{t A pot pAt V cut kVt U put pUt i ease iz e raise rez u lose luz o nose noz O cause kOz aI rise raIz OI noise nOIz aU rouse raUz 3` furs f3`z @ allow @"laU @` corner "kOrn@`

Transcription by hand

Spoken dialogs with computers

Spoken dialogs with computers

Presentation Transcript

AEM TEMPLATED DIALOGS

Communicating with Computers

5S with Computers

Dialogs and Wizards

Dialogs

Chemistry with Computers

Intelligent Help (or lack thereof) in Spoken Dialog Systems Dialogs on Dialogs discussion

Thematic Alignment of Static Documents with Meeting Dialogs

QT – Dialogs

Graphing with Computers

Improving Security Decisions with Polymorphic and Audited Dialogs

THE LEARNING DIALOGS

Chemistry with Computers

Reminder Dialogs

Chapter 16 Working with Dialogs and Controls

Programming with Android: Animations, Menu, Toast and Dialogs

Dialogs

JOptionPane Dialogs