Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Speech Processing Laboratory Temple University DESIGN OF A KEYWORD SPOTTING SYSTEM USING MODIFIED CROSS-CORRELATION IN THE TIME AND MFCC DOMAINPresented by:OlakunleAnifowoseThesis Advisor:Dr. Robert YantornoCommittee Members: Dr. Joseph PiconeDr. Dennis Silage Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY

Outline • Introduction to keyword spotting • Motivation for this work • Experimental Conditions • Common approach to Keyword Spotting • Method used • Time Domain • MFCC Domain • Conclusions • Future Work

Keyword Spotting • Identify keyword in spoken utterance or written document • Determine if keyword is present in utterance • Location of keyword in utterance • Possible operational results • Hits • False Alarms • Miss

Keyword Spotting • Speaker dependent/Independent (speech recognition) • Speaker Dependent • Single speaker • Lacks flexibility and not speaker adaptable • Easier to develop • Speaker Independent • Multi-Speaker • Flexible • Harder to develop

Applications • Monitor conversations for flag words. • Automated response system. • Security. • Automatically search through speeches for certain words or phrases. • Voice command/dialing.

Motivation for this work • Typical Large Vocabulary Continous Speech Recognizer (LVCSR) / Hidden Markov Model (HMM) based approaches requires a garbage model • To train the system for non-keyword speech data. • The better the garbage model, the better the keyword spotting performance • Use of LVCSR techniques can introduce • Computational load, complexity. • Need for training data.

Research Objectives • Development of a simple keyword spotting system based on cross-correlation. • Maximize hits while keeping false alarms and misses low.

Speech Database Used • Call Home Database • Contains more than 40 telephone conversation between male and female speakers. • 30 minutes long conversation. • Switchboard Database • Two sided conversations collected from various speakers in the United States.

Experimental Setup • Conversations are split into single channels • Call home database • 60 utterances ranging from 30secs to 2mins. • Keyword of interest were college, university, language, something, student, school, zero, relationship, necessarily, really, think, English, program, tomorrow, bizarre, conversation and circumstance. • Switchboard database • 30 utterances ranging from 30secs to 2mins. • Keyword of interest always, money and something.

Common Approaches • Hidden Markov Model • Statistical model – • hidden states / observable outputs • Emission probability – p(x|q1) • Transition probability – p(q2|q1) • First order Markov Process – probability of next state depends only on current state. • Infer output given the underlying system.

Hidden Markov Models • HMM for Speech Recognition • Each word – sequence of unobservable states with certain emission probabilities (features) and transition probabilities (to next state). • Estimate the model for each word in the training vocabulary. • For each test keyword, model that maximizes the likelihood is selected as a match – Viterbi Algorithm. • KWS directly built using HMM based Large Vocabulary Continuous Speech Recognizer (LVCSR).

HMM • Limitation • Large amount of training data required. • Training data has to be transcribed in word level and/or phone level. • Transcribed data costs time and money. • Not available in all languages.

Various keyword Spotting system • HMM • Context dependent state of the art phoneme recognizer • Keyword model. • Garbage model. • Evaluated on the Conversational Telephone Speech database. • Accuracy varies with keyword • 52.6% with keyword “because”. • 94.5% with keyword “zero”. Ketadbaretal, 2006

Various keyword Spotting system • Spoken Term Detection using Phonetic Posteriogram • Trained on acoustic phonetic models. • Compared using dynamic time warping. • Trained on switchboard cellular corpus. • Tested on Fisher english development test from NIST. • Average precision for top 10hits was 63.3%. Hazen etal, 2009

Various keyword Spotting system • S-DTW • Evaluated on the switchboard corpus. • 75% accuracy for all keywords tested. Jansen etal 2010

Contributions • We have proposed a novel approach to keyword spotting in both the time and MFCC domain using cross-correlation. • The Design of a Global keyword for cross-correlation in the time domain.

Cross-correlation • Measure of similarity between two signals. • Two signals compared by: • Sliding one signal by a certain time lag • Multiplying both the overlapping regions and taking the sum • Repeating the process and adding the products until there is no more overlap • If both signals are exactly the same, there’s a maximum peak at the time = 0, and the rest of the correlation signal tapers off to zero.

Research Using Cross-correlation • The identification of cover songs • Search musical database and determine songs that similar but performed by different artist with different instruments • Features of choice - chroma features • Representation for music • Entire spectrum projected onto 12 bins representing 12 distinct semitones. • Method used is cross-correlation • Cross-correlation is used to determine similarities betweeen songs based on their chroma features

Cases Considered • Time Domain • Initial approach • Modified approach • MFCC Domain

Time Domain Initial Approach 2. Observe position of peak to see if it’s around the zero lag. Yes: Keyword No: Not keyword 3. Shift observed portion by a small amount and repeat process If a portion is reached where the peak is close to the zero lag, then that’s where the keyword is. If not, the utterance does not contain the keyword. Let the length of the keyword or phrase be n. The cross correlation of the keyword and the first nsamples of the utterance is computed. xcorr

ZRR-Zero lag to Rest Ratio • The power around the “zero” lag is obtained and compared to the power in the rest of the correlation signal. This ratio is referred to as Zero lag to Rest Ratio (ZRR). • If the ZRR is greater than a certain threshold(2.5) then that segment of the utterance contains the keyword or phrase. • The test utterance is shifted and the process is repeated • If there is no segment with a ZRR greater than 2.5, the utterance does not contain the keyword

Test Cases • Same Speaker • Keyword part of the utterance • Different Speaker • Keyword from different speaker

Results(utterance-male1 keyword-male1)

Result(utterance-male1 keyword-male2)

Results( utterance-female1 keyword-female1 )

Results( utterance-female1 keyword-female2 )

Result Speaker Dependent Initial Time Domain Approach • Tested on 30 utterances • single instances of the following keyword bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from the same speaker

Result Speaker Independent Initial time Domain approach • Tested on 40 utterances • Multiple instances of the following keyword bizarre, conversation, something, really, necessarily, relationship, think, tomorrow extracted from various speakers.

Challenge Keyword → REALLY * 13 and 14 → same gender (female)

Time Domain Modified Utterance Global Keyword from Quantized Dynamic Time Warping Pitch Smoothening Cross-Correlate both signals and Computer zero lag to Rest Ratio (ZRR) on a frame by frame basis Highest Zero Lag ratio is the location of the keyword

Pitch • Measure of frequency level • Change in pitch results in a change in the fundamental frequency of speech • Difference in pitch between keyword and utterance increases detection errors.

Pitch Normalization • Pitch is a form of speaker information • Limit the effects pitch has on a speech system • Kawahara Algorithm • STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum) algorithm to modify pitch. • It reduces periodic variation in time caused by excitation

Utterance Pitch Normalization • Straight Algorithm • Elimination of periodicity interference • Temporal interference around peaks can be removed by constructing a new timing window based on a cardinal B-spline basis function that is adaptive to the fundamental period. • F0 Extraction • Natural speech is not purely periodic • Speech resythensis • The extracted F0 is then used to resynthesize speech

Modeling a Global Keyword • Compute MFCC features for each keyword • Perform Quantized Dynamic Time Warping (DTW) on several keyword templates.

MFCC • Take the Fourier transform of (a windowed portion of) a signal. • Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. • Take the log of the power at each of the mel frequencies. • Take the Discrete Cosine Transform (DCT) of the mel log powers, as if it were a signal. • The MFCCs are the amplitudes of the resulting spectrum.

Dynamic Time Warping • Time stretching and contracting one signal so that it aligns with the other signal. • Time-series similarity measure. • Reference and test keyword arranged along two side of the grid. • Template keyword – vertical axis, test keyword – horizontal. • Each block in the grid – distance between corresponding feature vectors. • Best match – path through the grid that minimizes cumulative distance.

Quantized Dynamic Time Warping • The MFCC features extracted from various instances of a keyword will be divided into 2 sets: A and B . Each reference template Ai will be paired with only one Bi. • For each pair Ai and Bi the optimal path will be computed (using the classic DTW algorithm). • The new vector Ci= (c1, c2,…cNc) will be generated • Repeat the process considering the pair (Ci, Ci+1) as a new Ai and Bi pair . • Result is a single reference vector Cy • Invert the vector into a time domain signal

Sample Results

Result Using a Global Keyword and Pitch Normalized utterances and keywords. • Tested on 60 utterances • Used a global keyword • 10 utterances associated with each keyword • keyword of interest bizarre, conversation, something, really, necessarily, relationship, think, tomorrow, computer, college, university, zero, student, school, language, program.

Results Analysis • Result differ from keyword to keyword • The best performing keyword was bizarre which had a hit rate of 60% • Time domain is not suitable due to uneven statistical behavior of signals.

MFCC Domain • Steps for cross-correlating the keyword and utterance in the MFCC domain. • Step 1: Pitch normalized utterances and keywords • Using the straight Algorithm • Step 2: Estimate the length of the keyword (n) and computed its MFCC feature • Step3: Compute the MFCC feature of the first n samples of the utterance • Step 4: Normalize the MFCC features of the utterance and keyword and cross-correlate them. • Step 5: Store a single value from the cross-correlation result in a matrix and shift along the utterance by a couple of sample and repeat steps 3-5 until the end of the utterance. • Step 6: Identify the maximum value in the matrix as the location of the keyword.

Normalizing MFCC Features • Divide the features by the square root of the sum of the squares of each vector • Similar to dividing a vector by its unit norm to obtain a unit vector. • Reason so MFCC features ranges from zero to one

Interpreting cross-correlation result of MFCC features • Similar to cosine similarity measure. • If two vector are exactly the same there is an angle of zero between them and the cosine of that would be a one. • The closer the cross-correlation result of two vector is to one. The more likely they are to be a match. • Vectors that are dissimilar will have a wider angle and their cross-correlation results will be a lot less than one.

Distance Between MFCC Features for Different Keywords

Speaker Dependent • Test were conducted on 30 utterances • Keywords were extracted from the same speaker • College, university, student, school, bizarre The maximum matching score corresponds to the location of the keyword.

Speaker Independent Test • Test Samples • Average of 5 utterances associated with each keyword • Average of 5 version of keyword • 20-25 trials • 13 keywords

More Results The maximum matching score corresponds to the location of the keyword.

More Results Maximum matching score location of keyword University Second Highest score is the word Universities

Research effort partially sponsored by the US Air Force Research Laboratories, Rome, NY