470 likes | 565 Vues
Speech Processing Laboratory Temple University. Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized features Presented by: Piush Karmacharya Thesis Advisor: Dr. R. Yantorno Committee Members: Dr. Joseph Picone Dr. Dennis Silage.
E N D
Speech Processing Laboratory Temple University Design of Keyword Spotting System Based on SEGMENTAL Time warping of quantized featuresPresented by:PiushKarmacharyaThesis Advisor:Dr. R. YantornoCommittee Members: Dr. Joseph PiconeDr. Dennis Silage Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY
Outline • Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work
Introduction • Branch of a more sophisticated stream – Speech Recognition • Identify keyword in a stream of written document or an audio (recorded or real time) • Confusion Matrix • True Positive - Hits • True Negative • False Negative – Misses • False Positive – False Alarms (FA) • Location if Present Results
Introduction.. • Speaker dependent • High accuracy, limited application • Speaker Independent • Lower accuracy, Wide application • Performance Evaluation – hits, misses and false alarms • Receiver Operating Characteristic – Hits vs FA • Design Objective – maximize hits while keeping false alarms low • Accuracy -
Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques • Related Research • System Design • Results • Future Work
Motivation • Speech – Most general form of human communication • Information - embedded in redundant words • I would like to have french toast for breakfast. • Non-intentional sounds – cough, exclamation, noise • Efficient human-machine interface • Applications: Audio Document retrieval, Surveillance Systems, Voice commands/dialing
Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work
Challenges • Similar (1, 3, 6; 13,14) and different keywords • Variation in length (4 - around 1500 samples, 14 - 4100 samples) Different instance of keyword REALLY from different speakers /KINGLISHA/ /Hey you have to speak English for some English/
Common Approaches • Template Based Approach • Hidden Markov Models • Neural Network • Hybrid Methods • Discriminative Methods
Template Matching • Started 1970’s • One (or more) keyword templates available • Search string – keyword; Search Space - the utterance • Flexible time search - Dynamic Time Warping • 1971, H Sakoe, S Chiba • Suitable for small scale applications • Drawback • Segment the utterance into isolated words • Fails to learn from the existing speech data
Dynamic Time Warping • Time stretch/compress one signal so that it aligns with the other signal • Extremely efficient time-series similarity measure • Minimizes the effects of shifting and distortion • Prototype of the test keyword stored as a template; compared to each word in the incoming utterance
Dynamic Time Warping • Reference and test keyword arranged along two side of the grid • Template keyword – vertical axis, test keyword – horizontal • Each block in the grid – distance between corresponding feature vectors • Best match – path through the grid that minimizes cumulative distance • But number of possible path increases exponentially with length!!
DTW • Constraints • Monotonic condition: no backward • Continuity condition: no break in path • Adjustment window: optimal path does not wander away from diagonal • Boundary condition: starting/ending fixed • Constraint manipulated as desired (e.g. for connected word recognition [Myers, C.; Rabiner, L.; Rosenberg, A.; 1980]
Hidden Markov Models • 1988 – Lawrence R. Rabiner • Statistical model – Hidden States/Observable Outputs • Emission probability – p(x|q1) • Transition probability – p(q2|q1) • First order Markov Process – probability of next state depend only on current state • Infer output given the underlying system • Estimate most likely system from observed sequence of output
HMM • KWS Implementation • Large Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models • Limitation • Large amount of training data required • Training data has to be transcribed in word level and/or phone level • Transcribed data costs time and money • Not available in all languages
Neural Networks • Late 90’s • Classifier – learns from existing data • Multi-layer of interconnected nodes (neurons) • Different weights assigned to inputs; updated in every iteration • Requires large amount of transcribed data for training • Hybrid Systems – HMM/NN • Discriminative Approaches – Support Vector Machines
Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work
Related Work • Segmental Dynamic Time Warping – Alex S. Park, James R. Glass, 2008 • Segmentation into isolated words not required • Choose starting point and adjustment window size • Proposed breaking words into smaller Acoustic Units • Speech – sequence of sounds • Acoustic units • Phonemes – Timothy J. Hazen, Wade Shen, Christopher White, 2009 • Gaussian Mixture Models (GMMs) – Yaodong Zhang, James R. Glass, 2009
Related Work • Phonetic Posteriorgrams • Phonemes as Acoustic unit • Gaussian Posteriorgrams • Acoustic unit - GMMs • Posteriorgram - Probability vector representing the posterior probabilities of a set of classes for a speech frame • Every speech frame associated with one or more phonemes /SH/ /AA/ /ER/
Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work
Research Methodology • Acoustic Unit – Mean of the cluster • Simple – K-means clustering • Likelihood – Euclidean distance from the cluster centroid • Segmental Dynamic Time Warping – Keyword Detection • Covariance information not required • Corpus • Call-Home Database • 30 minutes of stereo conversations over telephone channel • Switchboard Database • 2,400 two-sided telephone conversations among 543 speakers
Steps • Training • Keyword Template Processing • Keyword Detection Training Speech Feature Extraction Trained Cluster K-means Clustering Distance Matrix • Training Speech • Diverse sound • Diverse speakers
Speech Processing Feature vector – MFCC Speech Signal Pre-Emphasis Windowing FFT MFCC DCT Log Mel-Scaling • Model Human Perception • Multiplying with Filter banks • High Pass Filter • š[n] = s[n] - αs[n - 1]; α = 0.95 • Speech spectrum falls off at high frequencies • Emphasizes higher formants • Short-time stationary • Divide speech into short frames (20ms with 5ms spacing)
K-Means Clustering • Feature space populated by entire training data. Select k random cluster centers Each data-point finds center it is closest to and associates itself with Each cluster now finds the centroid of the points it owns. Centroid updated with new means Repeat step 2 to 5 until convergence
Distance Matrix • Feature far away from the centroid might fall into adjacent cluster • Likelihood Measure – Euclidean Distance • Vectors in region 3, 4 and 5 are closer to region 1 than region 6 • 2-D distance matrix optimize detection process Distance Matrix - D
Keyword Templates • MFCC Feature Vector • Each frame associated to a cluster • 1-D template(s) stored into a folder Keywords Feature Extraction Vector Quantization 1-D string of cluster indices
Keyword Detection • Speech utterance divided into overlapping segments (not isolated words) • Warping distance for each segment computed separately Speech Feature Extraction Vector Quantization 1-D cluster index Keyword Detection Decision Logic Segmental DTW
Distance Plot • Kwd- C1-C2-C4-C6-C1 • Utterance – C2-C4-C5-C6-C1-C3-C4 • Keyword – vertical axis; utterance – horizontal axis • Each cell – distance measure • Grayscale • Dark – Low distance • Bright – Large distance • Minimum distance path – candidate keyword
Segment -1 Segment -3 Segment -2 Segmental DTW • Speech utterance divided into overlapping segments • Choose the starting point • Adjustment window constraint – Segment Span ± R (=3) • Segment Width – 2 R +1 • Segment Overlap – R • Each segment has its own warping distance score • Candidate Keyword – ones with low warping distance • Precision Error – 2 R Distance S1=(0+0+0+7+9)/5 =3.2 S2= (0+5+0+7+9)/5 = 4.2
Introduction • What – Problem Definition • Why – Research Motivation • How – Techniques and Challenges • Related Research • System Design • Results • Future Work
Results • Some templates fail to produce low distance at keyword location • Average score can be used with a Threshold Distortion Score for keyword UNIVERSITY
Decision Logic • N/2 Voting based Approach, N – No. of templates available • Top ten lowest distance segments for each keyword template • Frequency of occurrence for each segment • Top ten scorers for more than half the keywords – considered the keyword
Experimental Setup • Feature Vector • 13MFCC, 13∆ and 13 ∆ ∆ = 39 Features • 24 Filterbank • 20ms frame with ¾ overlap • Cluster Size – 64, 128, 256 • Training Data – 14 speakers (10 male, 4 female) * 5 mins = 70 mins • Segment span R = 2 to 20 • Number of keywords – 14 • Test Utterance – 10 sec to 2 min • Keyword Location - Cut-off Precision Error – 30%
Keyword Statistics • Long length keywords – easier to detect? • Higher variance – lower detection rate • Question: Syllable vs Phoneme [http://www.howmanysyllables.com/, http://www.speech.cs.cmu.edu/cgi-bin/cmudict]
Operation Characteristic • Hits vs Segment Span – R • Smaller R – Restrictive • Large R – More flexible • Larger R – Large precision error • Maximum Hits at R = 5-7 • Compared to result on S-DTW on Gaussian Posteriorgram for Speech Pattern discovery [Y. Zhang, J. R. Glass, 2010]
Operation Characteristic • Misses vs R • Small R – restrictive • Larger R – Flexible, more noise • Minimum misses at R = 5-7 • False Alarm vs R • Small R – Less false alarm • Large R – Flexible, more FA
Operation Characteristic • Speed vs R No. of Segments = (UL-margin-1)/R + 1 • Smaller R – More segments/Processing time • Larger R – Fewer segments/Less time • For R=5, 1 minute of utterance – 5 secs per keyword template ≈ 12 templates possible in real time • 1 hr speech - 10 mins on 200 CPUs using GP and graph clustering on SDTW segments [ Y. Zhang and J. R. Glass 2010] Execution time per keyword template per minute of utterance
Results • Results vary for different keywords • Frequency of use of the word more important than length (University/Relationship vs. Something) • Pronunciation – context dependent [ H. Ketabdar, J. Vepa, S Bengio and H. Bourlard, 2006]
Future Work • Implement relevance feedback technique so that generic templates are assigned higher weights after every iteration [Hazen T.J., Shen W., White C.M., 2009] • Retraining the cluster for different environment • Testing on more data with refined keyword templates (isolation of keywords from the speech data was time consuming and required several iteration) • Using model keyword instead of several keyword templates [*Olakunle]
Model Keyword • Develop a model keyword from all available keyword templates • Implement Self Organizing Maps (SOMs) • Cluster grouping is random in K-means clustering • Data belonging to same clusters are grouped into one in SOM
System Design • Vector Quantization • Quantize data into finite clusters – training data for populating the feature space need not be transcribed. • Feature for same sound fall into same cluster • Reduce dimension – feature vector reduced to codebook • Likelihood Estimation • Account for data that might fall just outside the cluster • Segmental Dynamic Time Warping • DTW requires fixed ends – utterance segmentation into isolated words • Divide the utterance into segments (not necessarily words) and compute distortion score for each segment using DTW
Hidden Markov Models • HMM for Speech Recognition • Each word – sequence of unobservable states with certain emission probabilities (features) and transition probabilities (to next state) • Estimate the model for each word in the training vocabulary • For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm • Grammatical constraints are applied to improve recognition accuracy • Vocabulary Continuous Speech Recognizer (LVSCR) • Model non-keywords using Garbage/Filler Models
Phonetic Posteriorgrams • Each element represents the posterior probability of a specific phonetic class for a specific time frame. • Can be computed directly from the frame-based acoustic likelihood scores for each phonetic class at each time frame. • Time vs Class matrix representation
Gaussian Posteriorgrams • Each dimension of the feature vector approximated by sum of weighted Gaussian – GMM • Parameterized by the mixture weights, mean vectors, and covariance matrices • Gaussian posteriorgram is a probability vector representing the posterior probabilities of a set of Gaussian components for a speech frame • GMM can computed over unsupervised training data instead of using a phonetic recognizer
GP • S = (s1,s2,…,sn) • GP(S) = (q1,q2,…qn) • qi = ( P(C1|Si), P(C2|si), …. , P(Cm|si) ) • Ci - ith Gaussian component of a GMM • m - number of Gaussian components • Difference betn two GP • D(p,q) = - log (p . q) • DTW is used to find low distortion segment