Introduction to Speech Signal Processing

Introduction to Speech Signal Processing Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, China 2014/9/22

Introduction • Sampling and quantization • Speech coding • Features and Analysis • Main features • Some transformations • Speech-to-Text • State of the art • Main approaches • Text-to-Speech • State of the art • Main approaches • Applications • Human-machine dialogue systems

Some useful websites for ASR Tools http://htk.eng.cam.ac.uk Free, available since 2000, relation with MS Over 12000 users, ver. 2.1, 3.0, 3.1, 3.2 Include source code and HTK books A set of tools for training, decoding, evaluation Steve Young in Cambridge University http://www.cs.cmu.edu Free for research and education Sphinx 2 and 3 Tools, source code, speech database Reddy in CMU

Research on speech recognition in the world

Carnegie Mellon University CMU SCS Speech Group Interact Lab Oregon Graduate Institute Center for Spoken Language Understanding MIT Lab for Computer Science, Spoken Language Systems Acoustics & Vibration Lab AI LAB Lincoln Lab, Speech Systems Technology Group Stanford University Center for Computer Research in Music and Acoustics Center for the Study of Language and Information

University of California Berkeley, Santa Cruz, Los Angeles Boston University Signal Processing and Interpretation Lab Georgia Institute of Technology Digital Signal Processing Lab Johns Hopkins University Center for Language and Speech Processing Brown University Lab for Engineering Man-Machine Systems Mississippi State University Colorado University Cornell University

Cambridge University speech Vision and Robotics Group Edinburgh University human Communication Research Center center for Speech Technology Research University College London Phonetics and Linguistics University of Essex Dept. Language and Linguistics

LIMSI, France INRIA Institut National de Recherche en Informatique et Automatique University of Karlsruhe, Germany Interractive Systems Lab DFKI German Research Center for Artificial Intelligence KTH Speech Communication & Music Acoustics CSELT, Italy Centro Studi e Laboratori Telecommunicazioni, Torino IRST Istituto per la Ricerca Scientifica e Tecnologica, Trento ATR, Japan

AT&T, Advanced Speech Product Group Lucent Technologies, Bell Laboratories IBM , IBM VoiceType Texas Instruments Incorporated National Institute of Standards and Technology Apple Computer Co. Digital Equipment Corporation (DEC) SRI International Dragon systems Co. Sun Microsystems Lab. , speech applications Microsoft Corporation, Speech technology SAPI Entropic Research Laboratory, Inc.

Important conferences and journals IEEE trans. on ASSP ICASSP (every year) EUROSPEECH (every odd year) ICSLP (every even year) STAR Speech Technology and Research at SRI

Brief history and state-of-the-art of the research on speech recognition

50'S ISOLATED DIGIT RECOGNITION (BELL LAB) 60'S : HARDWARE SPEECH SEGMENTATOR (JAPAN) DYNAMIC PROGRAMMING (U.S.S.R) 70'S : CLUSTERING ALGORITHM (SPEAKER INDEPENDECY) DTW 80'S: HMM, DARPA, SPHINX 90'S : ADAPTION, ROBUSTNESS ASR Progress Overview

1952 Bell Labs Digits • First word (digit) recognizer • Approximates energy in formants (vocal tract resonances) over word • Already has some robust ideas(insensitive to amplitude, timing variation) • Worked very well • Main weakness was technological (resistorsand capacitors)

The 60’s • Better digit recognition • Breakthroughs: Spectrum Estimation (FFT, cepstra, LPC), Dynamic Time Warp (DTW), and Hidden Markov Model (HMM) theory • HARDWARE SPEECH SEGMENTATOR (JAPAN)

1971-76 ARPA Project • Focus on Speech Understanding • Main work at 3 sites: System DevelopmentCorporation, CMU and BBN • Other work at Lincoln, SRI, Berkeley • Goal was 1000-word ASR, a few speakers,connected speech, constrained grammar,less than 10% semantic error

Results • Only CMU Harpy fulfilled goals - used LPC, segments, lots of high levelknowledge, learned from Dragon *(Baker) • * The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s

Achieved by 1976 • Spectral and cepstral features, LPC • Some work with phonetic features • Incorporating syntax and semantics • Initial Neural Network approaches • DTW-based systems (many) • HMM-based systems (Dragon, IBM)

Dynamic Time Warp • Optimal time normalization with dynamic programming • Proposed by Sakoe and Chiba, circa 1970 • Similar time, proposal by Itakura • Probably Vintsyuk was first (1968)

HMMs for Speech • Math from Baum and others, 1966-1972 • Applied to speech by Baker in theoriginal CMU Dragon System (1974) • Developed by IBM (Baker, Jelinek, Bahl,Mercer,….) (1970-1993) • Extended by others in the mid-1980’s

The 1980’s • Collection of large standard corpora • Front ends: auditory models, dynamics • Engineering: scaling to large vocabulary continuous speech • Second major (D)ARPA ASR project • HMMs become ready for prime time

Standard Corpora Collection • Before 1984, chaos • TIMIT • RM (later WSJ) • ATIS • NIST, ARPA, LDC

Front Ends in the 1980’s • Mel cepstrum (Bridle, Mermelstein) • PLP (Hermansky) • Delta cepstrum (Furui) • Auditory models (Seneff, Ghitza, others)

Dynamic Speech Features • temporal dynamics useful for ASR • local time derivatives of cepstra • “delta’’ features estimated over multiple frames (typically 5) • usually augments static features • can be viewed as a temporal filter

HMM’s for Continuous Speech • Using dynamic programming for cts speech(Vintsyuk, Bridle, Sakoe, Ney….) • Application of Baker-Jelinek ideas to continuous speech (IBM, BBN, Philips, ...) • Multiple groups developing major HMMsystems (CMU, SRI, Lincoln, BBN, ATT) • Engineering development - coping with data, fast computers

2nd (D)ARPA Project • Common task • Frequent evaluations • Convergence to good, but similar, systems • Lots of engineering development - now up to 60,000 word recognition, in real time, on aworkstation, with less than 10% word error • Competition inspired others not in project -Cambridge did HTK, now widely distributed

Some 1990’s Issues • Independence to long-term spectrum • Adaptation • Effects of spontaneous speech • Information retrieval/extraction withbroadcast material • Query-style systems (e.g., ATIS) • Applying ASR technology to relatedareas (language ID, speaker verification)

Real Uses • Telephone: phone company services(collect versus credit card) • Telephone: call centers for queryinformation (e.g., stock quotes, parcel tracking) • Dictation products: continuous recognition, speaker dependent/adaptive

Tremendous technical advances in the last few years From small to large vocabularies 5,000 - 10,000 word vocabulary 10,000-60,000 word vocabulary From isolated word to spontaneous talk Continuous speech recognition Conversational and spontaneous speech recognition From speaker-dependent to speaker-independent Modern ASR is fully speaker independent State-of-the-art of ASR

IBM, Via Voice Speaker independent, continuous command recognition Large vocabulary recognition Text-to-speech confirmation Barge in (The ability to interrupt an audio prompt as it is playing) Microsoft, Whisper, Dr Who SOTA ASR Systems

DARPA 1982 GOAL HIGH ACCURACY REAL-TIME PERFORMANCE UNDERSTANDING CAPABILITY CONTINUOUS SPEECH RECOGNITION DARPA DATABASE 997 WORDS (RM) ABOVE 100 SPEAKERS TIMID SOTA ASR Systems

SPHINX II CMU HMM BASED SPEECH RECOGNITION BIGRAM, WORD PAIR GENERALIZED TRIPHONE DARPA DATABASE 97% RECOGNITION (PERPLEXITY 20) SPHINX III CHMM BASED WER, about 15% on WSJ SOTA ASR Systems

2005 all speakers of the language including foreign wherever speech occurs 2000 regional accents native speakers competent foreign speakers vehicle noise radio cell phones 1995 speaker independent and adaptive normal office various microphones telephone quiet room fixed high –quality mic USER POPULATION speaker-dep. NOISE ENVIRONMENT 1985 application– specific speech and language careful reading expert years to create app– specific language model SPEECH STYLE COMPLEXITY planned speech some application– specific data and one engineer year natural human-machine dialog (user can adapt) all styles including human-human (unaware) application independent or adaptive ASR Advances

But • Still <97% accurate on “yes” for telephone • Unexpected rate of speech causes doublingor tripling of error rate • Unexpected accent hurts badly • Accuracy on unrestricted speech at 60% • Don’t know when we know • Few advances in basic understanding

What benchmarks? DARPA NIST (hub-4, hub-5, …) What was training? What was the test? Were they independent? The vocabulary and the sample size? Was the noise added or coincident with speech? What kind of noise? How to Measure the Performance?

ASR Performance Word Error Rate (WER) Conversational Speech 40% • Spontaneous telephone speech is still a “grand challenge”. • Telephone-quality speech is still central to the problem. • Broadcast news is a very dynamic domain. 30% Broadcast News 20% Read Speech 10% Letters and Numbers Continuous Digits Digits Command and Control 0% Level Of Difficulty

Machine vs Human Performance Word Error Rate • Human performance exceeds machine • performance by a factor ranging from • 4x to 10x depending on the task. • On some tasks, such as credit card number recognition, machine performance exceeds humans due to human memory retrieval capacity. • The nature of the noise is as important as the SNR (e.g., cellular phones). • A primary failure mode for humans is inattention. • A second major failure mode is the lack of familiarity with the domain (i.e., business terms and corporation names). 20% Wall Street Journal (Additive Noise) 15% Machines 10% 5% Human Listeners (Committee) 0% Quiet 10 dB 16 dB 22 dB Speech-To-Noise Ratio

Core technology for ASR

Why is ASR Hard? • Natural speech is continuous • Natural speech has disfluencies • Natural speech is variable over:global rate, local rate, pronunciationwithin speaker, pronunciation acrossspeakers, phonemes in differentcontexts

Why is ASR Hard?(continued) • Large vocabularies are confusable • Out of vocabulary words inevitable • Recorded speech is variable over:room acoustics, channel characteristics,background noise • Large training times are not practical • User expectations are for equal to orgreater than “human performance”

Speech - correlated noisereverberation, reflection Uncorrelated noiseadditive noise(stationary, nonstationary) Environment Attributes of speakersdialect, gender, age Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativeness Speaker Microphone (Transmitter) Distance from microphone Filter Transmission systemdistortion, noise, echo Recording equipment InputEquipment Main Causes of Speech Variability

ASR Dimensions • Speaker dependent, independent • Isolated, continuous, keywords • Lexicon size and difficulty • Task constraints, perplexity • Adverse or easy conditions • Natural or read speech

Telephone Speech • Limited bandwidth (F vs S) • Large speaker variability • Large noise variability • Channel distortion • Different handset microphones • Mobile and handsfree acoustics

Related area’s: Who is the talker (speaker recognition, identification) What language did he speak? (language recognition) What is his meaning? (speech understanding) Words Speech Recognition “How are you?” Speech Signal What is Speech Recognition?

A tractable reformulation of the problem is: Acoustic model Language model Daunting search task What is the problem? Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A

Recognition Front End Decoder Best Word Sequence Analog Speech Observation Sequence Acoustic Model Language Model Dictionary View ASR as Pattern Recognition

SpeechWaveform Feature Extraction(Signal Processing) SpectralFeatureVectors Neural Net Phone LikelihoodEstimation (Gaussiansor Neural Networks) PhoneLikelihoodsP(o|q) N-gram Grammar Decoding (Viterbior Stack Decoder) HMM Lexicon Words View ASR in Hierarchy

Front-End Processing Dynamic features K.F. Lee

GOAL : LESS COMPUTATION & MEMORY SIMPLE REPRESENTATION OF SIGNAL METHODS : FOURIER SPECTRUM BASED MFCC (mel frequency ceptrum coeffcient) LFCC (linear frequency ceptrum coefficient) filter-bank energy LINEAR PREDICTION SPECTRUM BASED LPC (linear predictive coding) LPCC (linear predictive ceptrum coefficeint) OTHERS ZERO CROSSING, PITCH, FORMANT, AMPLITUDE Feature Extraction

Cepstrum is the inverse Fourier transform of the log spectrum Cepstrum Computation IDFT takes form of weighted DCT in computation, see in HTK

Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples Filter-bank, under 1k hz, linear, above 1k hz, log Motivated by human auditory response characteristics FFT and log DCT transform Mel Cepstral Coefficients

Introduction to Speech Signal Processing