170 likes | 302 Vues
This presentation by Roy Wallace at Queensland University of Technology details a novel phonetic-based indexing method utilizing a lattice-spotting technique to achieve open-vocabulary spoken term detection. The algorithm harnesses a two-tier database and applies dynamic-match rules and several algorithmic optimizations to improve performance in keywords verification and search accuracy. Key techniques include mapping phonetic classes and employing minimum edit distance strategies for efficient sequence generation. The presentation discusses innovations, advantages, limitations, and future work directions in spoken document retrieval.
E N D
Dynamic Match Lattice Spotting Spoken Term Detection Evaluation Queensland University of Technology Roy Wallace, Robbie Vogt, Kishan Thambiratnam, Prof Sridha Sridharan Presented by Roy Wallace
Overview • Phonetic-based index open-vocabulary • Based on lattice-spotting technique • Two-tier database • Dynamic-match rules • Algorithmic optimisations NOTE: Patented technology
g r ax s iy th ay n r nx ow d m nx ae … … … … … Concept greasy Phone decomposition ?
Concept Target sequence: Dynamic matching Observed sequences: ax ih Costs
Indexing Feature Extraction Segmentation Audio Sequence Generation Hyper- Sequence Generation Speech Recognition Lattices Sequence DB Hyper- Sequence DB
Hyper-sequence Mapping • Map individual phones to “parent” classes • We use Vowels, Fricatives, Glides, Stops and Nasals • Simple example • Parent classes: Vowels, Consonants • Map each phone to parent class to create hyper-sequence Sequence DB Hyper- Sequence DB
Hyper-sequence Mapping Search term: Sequence DB Hyper-sequence: Hyper-sequence DB
Searching Term Phone decomp. Split long terms Results Hyper- mapping Dynamic Matching Merge long terms Keyword Verification Hyper- Sequence DB Sequence DB
Dynamic Matching • Minimum Edit Distance (MED) • i.e. Levenshtein Distance • Insertions, deletions, substitutions • Finds minimum cost of transformation
Dynamic Matching • Substitution costs • Derived from phone confusion statistics
Optimisations • Prefix sequence optimisation • Early stopping optimisation • Linearised MED search approximation
Long Term Merging olympic sites Search Search Merge Results
Keyword Verification • Acoustic • Use acoustic score from lattice to boost occurrences with high confidence • Neural Network • Produce a confidence score by fusing • MED score and Acoustic score • Term phone length • Term phone classes
Results Maximum Term-Weighted Value on EvalSet terms
Conclusion • Open-vocabulary and phone-based • Patented technology utilises • sequence and hyper-sequence databases • optimisations for rapid searches • Advantages • Other languages • Economy of scale
Conclusion • Limitations • Indexing speed and size • Need to split long sequences • Future work • Keyword Verification • Word-level information (e.g. LVCSR) • Acoustic features (e.g. prosody) • Indexing/searching frameworks • Spoken Document Retrieval and other semantic applications
References • A. J. K. Thambiratnam, “Acoustic keyword spotting in speech with applications to data mining”, Ph.D. dissertation, Queensland University of Technology, Qld, March 2005 • K. Thambiratnam and S. Sridharan, “Rapid Yet Accurate Speech Indexing Using Dynamic Match Lattice Spotting”, IEEE Transactions on Audio, Speech and Language Processing : Accepted for future publication • CMU Speech group (1998). The Carnegie Mellon Pronouncing Dictionary. [Online]. Available: http://www.speech.cs.cmu.edu/cgi-bin/cmudict • S. J. Young, P.C. Woodland, W.J. Byrne (2002). “HTK: Hidden Markov Model Toolkit V3.2”, Cambridge University Engineering Department, Speech Group and Entropic Research Laboratories Inc. • V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals”, Soviet Physics Doklady, 10(8), 1966, pp. 707-710.