Design of Keyword Spotting System Using Segmental Time Warping

Speech Processing Laboratory Temple University Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized featuresPresented by:PiushKarmacharyaThesis Advisor:Dr. R. YantornoCommittee Members: Dr. Joseph PiconeDr. Dennis Silage Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Outline • Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

Introduction • Branch of a more sophisticated stream – Speech Recognition • Identify keyword in a stream of written document or an audio (recorded or real time) • Confusion Matrix • True Positive - Hits • True Negative • False Negative – Misses • False Positive – False Alarms (FA) • Location if Present Results

Introduction.. • Speaker dependent • High accuracy, limited application • Speaker Independent • Lower accuracy, Wide application • Performance Evaluation – hits, misses and false alarms • Receiver Operating Characteristic – Hits vs FA • Design Objective – maximize hits while keeping false alarms low • Accuracy -

Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques • Related Research • System Design • Results • Future Work

Motivation • Speech – Most general form of human communication • Information - embedded in redundant words • I would like to have french toast for breakfast. • Non-intentional sounds – cough, exclamation, noise • Efficient human-machine interface • Applications: Audio Document retrieval, Surveillance Systems, Voice commands/dialing

Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work

Challenges • Similar (1, 3, 6; 13,14) and different keywords • Variation in length (4 - around 1500 samples, 14 - 4100 samples) Different instance of keyword REALLY from different speakers /KINGLISHA/ /Hey you have to speak English for some English/

Common Approaches • Template Based Approach • Hidden Markov Models • Neural Network • Hybrid Methods • Discriminative Methods

Template Matching • Started 1970’s • One (or more) keyword templates available • Search string – keyword; Search Space - the utterance • Flexible time search - Dynamic Time Warping • 1971, H Sakoe, S Chiba • Suitable for small scale applications • Drawback • Segment the utterance into isolated words • Fails to learn from the existing speech data

Dynamic Time Warping • Time stretch/compress one signal so that it aligns with the other signal • Extremely efficient time-series similarity measure • Minimizes the effects of shifting and distortion • Prototype of the test keyword stored as a template; compared to each word in the incoming utterance

Dynamic Time Warping • Reference and test keyword arranged along two side of the grid • Template keyword – vertical axis, test keyword – horizontal • Each block in the grid – distance between corresponding feature vectors • Best match – path through the grid that minimizes cumulative distance • But number of possible path increases exponentially with length!!

DTW • Constraints • Monotonic condition: no backward • Continuity condition: no break in path • Adjustment window: optimal path does not wander away from diagonal • Boundary condition: starting/ending fixed • Constraint manipulated as desired (e.g. for connected word recognition [Myers, C.; Rabiner, L.; Rosenberg, A.; 1980]

Hidden Markov Models • 1988 – Lawrence R. Rabiner • Statistical model – Hidden States/Observable Outputs • Emission probability – p(x|q1) • Transition probability – p(q2|q1) • First order Markov Process – probability of next state depend only on current state • Infer output given the underlying system • Estimate most likely system from observed sequence of output

HMM • KWS Implementation • Large Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models • Limitation • Large amount of training data required • Training data has to be transcribed in word level and/or phone level • Transcribed data costs time and money • Not available in all languages

Neural Networks • Late 90’s • Classifier – learns from existing data • Multi-layer of interconnected nodes (neurons) • Different weights assigned to inputs; updated in every iteration • Requires large amount of transcribed data for training • Hybrid Systems – HMM/NN • Discriminative Approaches – Support Vector Machines

Related Work • Segmental Dynamic Time Warping – Alex S. Park, James R. Glass, 2008 • Segmentation into isolated words not required • Choose starting point and adjustment window size • Proposed breaking words into smaller Acoustic Units • Speech – sequence of sounds • Acoustic units • Phonemes – Timothy J. Hazen, Wade Shen, Christopher White, 2009 • Gaussian Mixture Models (GMMs) – Yaodong Zhang, James R. Glass, 2009

Related Work • Phonetic Posteriorgrams • Phonemes as Acoustic unit • Gaussian Posteriorgrams • Acoustic unit - GMMs • Posteriorgram - Probability vector representing the posterior probabilities of a set of classes for a speech frame • Every speech frame associated with one or more phonemes /SH/ /AA/ /ER/

Research Methodology • Acoustic Unit – Mean of the cluster • Simple – K-means clustering • Likelihood – Euclidean distance from the cluster centroid • Segmental Dynamic Time Warping – Keyword Detection • Covariance information not required • Corpus • Call-Home Database • 30 minutes of stereo conversations over telephone channel • Switchboard Database • 2,400 two-sided telephone conversations among 543 speakers

Steps • Training • Keyword Template Processing • Keyword Detection Training Speech Feature Extraction Trained Cluster K-means Clustering Distance Matrix • Training Speech • Diverse sound • Diverse speakers

Speech Processing Feature vector – MFCC Speech Signal Pre-Emphasis Windowing FFT MFCC DCT Log Mel-Scaling • Model Human Perception • Multiplying with Filter banks • High Pass Filter • š[n] = s[n] - αs[n - 1]; α = 0.95 • Speech spectrum falls off at high frequencies • Emphasizes higher formants • Short-time stationary • Divide speech into short frames (20ms with 5ms spacing)

K-Means Clustering • Feature space populated by entire training data. Select k random cluster centers Each data-point finds center it is closest to and associates itself with Each cluster now finds the centroid of the points it owns. Centroid updated with new means Repeat step 2 to 5 until convergence

Distance Matrix • Feature far away from the centroid might fall into adjacent cluster • Likelihood Measure – Euclidean Distance • Vectors in region 3, 4 and 5 are closer to region 1 than region 6 • 2-D distance matrix optimize detection process Distance Matrix - D

Keyword Templates • MFCC Feature Vector • Each frame associated to a cluster • 1-D template(s) stored into a folder Keywords Feature Extraction Vector Quantization 1-D string of cluster indices

Keyword Detection • Speech utterance divided into overlapping segments (not isolated words) • Warping distance for each segment computed separately Speech Feature Extraction Vector Quantization 1-D cluster index Keyword Detection Decision Logic Segmental DTW

Distance Plot • Kwd- C1-C2-C4-C6-C1 • Utterance – C2-C4-C5-C6-C1-C3-C4 • Keyword – vertical axis; utterance – horizontal axis • Each cell – distance measure • Grayscale • Dark – Low distance • Bright – Large distance • Minimum distance path – candidate keyword

Segment -1 Segment -3 Segment -2 Segmental DTW • Speech utterance divided into overlapping segments • Choose the starting point • Adjustment window constraint – Segment Span ± R (=3) • Segment Width – 2 R +1 • Segment Overlap – R • Each segment has its own warping distance score • Candidate Keyword – ones with low warping distance • Precision Error – 2 R Distance S1=(0+0+0+7+9)/5 =3.2 S2= (0+5+0+7+9)/5 = 4.2

Results • Some templates fail to produce low distance at keyword location • Average score can be used with a Threshold Distortion Score for keyword UNIVERSITY

Decision Logic • N/2 Voting based Approach, N – No. of templates available • Top ten lowest distance segments for each keyword template • Frequency of occurrence for each segment • Top ten scorers for more than half the keywords – considered the keyword

Experimental Setup • Feature Vector • 13MFCC, 13∆ and 13 ∆ ∆ = 39 Features • 24 Filterbank • 20ms frame with ¾ overlap • Cluster Size – 64, 128, 256 • Training Data – 14 speakers (10 male, 4 female) * 5 mins = 70 mins • Segment span R = 2 to 20 • Number of keywords – 14 • Test Utterance – 10 sec to 2 min • Keyword Location - Cut-off Precision Error – 30%

Keyword Statistics • Long length keywords – easier to detect? • Higher variance – lower detection rate • Question: Syllable vs Phoneme [http://www.howmanysyllables.com/, http://www.speech.cs.cmu.edu/cgi-bin/cmudict]

Operation Characteristic • Hits vs Segment Span – R • Smaller R – Restrictive • Large R – More flexible • Larger R – Large precision error • Maximum Hits at R = 5-7 • Compared to result on S-DTW on Gaussian Posteriorgram for Speech Pattern discovery [Y. Zhang, J. R. Glass, 2010]

Operation Characteristic • Misses vs R • Small R – restrictive • Larger R – Flexible, more noise • Minimum misses at R = 5-7 • False Alarm vs R • Small R – Less false alarm • Large R – Flexible, more FA

Operation Characteristic • Speed vs R No. of Segments = (UL-margin-1)/R + 1 • Smaller R – More segments/Processing time • Larger R – Fewer segments/Less time • For R=5, 1 minute of utterance – 5 secs per keyword template ≈ 12 templates possible in real time • 1 hr speech - 10 mins on 200 CPUs using GP and graph clustering on SDTW segments [ Y. Zhang and J. R. Glass 2010] Execution time per keyword template per minute of utterance

Results • Results vary for different keywords • Frequency of use of the word more important than length (University/Relationship vs. Something) • Pronunciation – context dependent [ H. Ketabdar, J. Vepa, S Bengio and H. Bourlard, 2006]

Future Work • Implement relevance feedback technique so that generic templates are assigned higher weights after every iteration [Hazen T.J., Shen W., White C.M., 2009] • Retraining the cluster for different environment • Testing on more data with refined keyword templates (isolation of keywords from the speech data was time consuming and required several iteration) • Using model keyword instead of several keyword templates [*Olakunle]

Thank You

Backup Slides

Model Keyword • Develop a model keyword from all available keyword templates • Implement Self Organizing Maps (SOMs) • Cluster grouping is random in K-means clustering • Data belonging to same clusters are grouped into one in SOM

System Design • Vector Quantization • Quantize data into finite clusters – training data for populating the feature space need not be transcribed. • Feature for same sound fall into same cluster • Reduce dimension – feature vector reduced to codebook • Likelihood Estimation • Account for data that might fall just outside the cluster • Segmental Dynamic Time Warping • DTW requires fixed ends – utterance segmentation into isolated words • Divide the utterance into segments (not necessarily words) and compute distortion score for each segment using DTW

Hidden Markov Models • HMM for Speech Recognition • Each word – sequence of unobservable states with certain emission probabilities (features) and transition probabilities (to next state) • Estimate the model for each word in the training vocabulary • For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm • Grammatical constraints are applied to improve recognition accuracy • Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models

Phonetic Posteriorgrams • Each element represents the posterior probability of a specific phonetic class for a specific time frame. • Can be computed directly from the frame-based acoustic likelihood scores for each phonetic class at each time frame. • Time vs Class matrix representation

Gaussian Posteriorgrams • Each dimension of the feature vector approximated by sum of weighted Gaussian – GMM • Parameterized by the mixture weights, mean vectors, and covariance matrices • Gaussian posteriorgram is a probability vector representing the posterior probabilities of a set of Gaussian components for a speech frame • GMM can computed over unsupervised training data instead of using a phonetic recognizer

GP • S = (s1,s2,…,sn) • GP(S) = (q1,q2,…qn) • qi = ( P(C1|Si), P(C2|si), …. , P(Cm|si) ) • Ci - ith Gaussian component of a GMM • m - number of Gaussian components • Difference betn two GP • D(p,q) = - log (p . q) • DTW is used to find low distortion segment

Design of Keyword Spotting System Using Segmental Time Warping

Design of Keyword Spotting System Using Segmental Time Warping

Presentation Transcript

The Air Force Research Laboratory (AFRL)

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

The Air Force Research Laboratory (AFRL)

Information Directorate Air Force Research Laboratory

CRESP Sponsored Research

Prof. Tom McLaughlin Aeronautics Research Center US Air Force Academy

US Air Force

The Air Force Research Laboratory (AFRL)

Air Force Research Laboratory

US AIR FORCE

Air Force Research Laboratory

Work sponsored by the Space Vehicles Directorate of the Air Force Research Laboratory

Research Laboratories

The Air Force Research Laboratory (AFRL)

*Supported by AFOSR and Air Force Research Lab, Rome NY

November 3 - 6, 2002 Sponsored by Air Force Research Laboratory, Office of Naval Research

Air Force Office of Scientific Research

Information Directorate Air Force Research Laboratory

Welcomes the Civil Air Patrol Naval Research Laboratory Air Force Research Laboratory

The Air Force Research Laboratory (AFRL)

Welcomes the Civil Air Patrol Naval Research Laboratory Air Force Research Laboratory

November 3 - 6, 2002 Sponsored by Air Force Research Laboratory, Office of Naval Research