200 likes | 320 Vues
This paper presents an algorithm for estimating pitch in speech signals using C++. It describes the frame/buffer approach and the implementation of a silent detector, employing correlation techniques to determine candidates for optimal pitch. The algorithm includes steps for assessing and improving pitch accuracy through bias correction and delay adjustments. Testing reveals potential enhancements for performance and real-time execution. By examining silent and non-silent frames, the approach aims to achieve reliable pitch estimation while addressing common computational challenges in speech recognition.
E N D
Pitch Estimation Speech Recognition Raymond Sastraputera
Outline • Introduction • Frame/Buffer • Algorithm • Silent Detector • Estimate Pitch • Correlation and Candidate • Optimal Candidate • Buffer Delay • Added Bias • Test and Result • Conclusion
Introduction • Estimates the pitch on a speech • Written in C++
Frame/Buffer • Frame segment are shifted with no overlap Frame segment Buffer
Silent Detector • Initial detection of silent • |max(x)| + |max(y)| + |max(z)| + |min(x)| + |min(y)| + |min(z)| • Threshold Value (50dB) X Y Z
Estimate Pitch (Correlation) • Correlation of two vectors
Estimate Pitch (Correlation) • Correlation P(x,y) • Calculate for different window size (nm) • Window size will be the pitch value (in sample) • Correlation value above threshold become candidate with score 1 Vector x Vector y X Y Z nm nm
Estimate Pitch (Correlation) • Correlation P(y,z) • Calculate for different nm • Only for window size in candidate score 1 • Correlation value above threshold become candidate with score 2 Vector y Vector z X Y Z nm nm
Estimate Pitch (Correlation) • Correlation Q(n,m) • Calculate for different nm • nMAX is maximum nm in the candidate • Optimal Candidate • if current candidate Qnm*0.77 is higher than preceeding candidate’s Qnm Vector x Vector z X Y Z nMAX nm nMAX
Estimate Pitch (Candidate) • Candidate score 1 Correlation P(x,y) • No candidate silence • Single candidate compute P(y,z) • Score stays at 1 hold • Score 2 estimated pitch • Multi candidate compute P(y,z) • Candidate score 2 Correlation P(y,z) • No candidate compute Q(n,m) candidate score1 • Single candidate estimated pitch • Multi candidate compute Q(n,m) • Optimal Pitch Correlation Q(n,m)
Estimate Pitch (Optimal Candidate) • Single candidate with score 2 • From Q(n,m) of • Candidate score 2 • Candidate score 1 • On hold, and next frame estimated pitch is neither silence nor on hold.
Buffer Delay • Delay the returning value of estimated pitch • Needed to limit the duration of on hold
Bias • Conditions: • Two previous frame is not silent • Previous frame is not on hold • Previous frame pitch is between 5/8 and 7/4 of the preceding frame pitch
Bias • P(x,y) is doubled
Test Parameter • correlation_threshold_silent(0.88) • Qnm_optimal_multiplier(0.77) • sample_rate(20000.0F) • max_pitch(400) • min_pitch(50) • pitch_buffer_size(20) • bias_max_frequency(7/4) • bias_min_frequency(5/8) • silent_threshold(50.0F)
Conclusion • Some improvement can be done to increase the performance of the estimated pitch. • Reduce the search space • Adding 1st order derivaiton of the pitch • Filtering the outlier / noise • Current algorithm might not be fast enough to perform in real time
Reference • Bagshaw, Paul Christopher. Automatic Prosodic Analysis for Computer Aider Pronunciation Teaching. The University of Edinburgh (1994).