1 / 1

Given a query (text or audio ) Detection the query within target audio

Explorations in Zero-Resource Spoken Term Detection Justin Chiu Language Technology Institute, Carnegie Mellon University. What is STD?. Preliminary Results. Given a query (text or audio ) Detection the query within target audio Common Approach: Recognized -> Search.

vic
Télécharger la présentation

Given a query (text or audio ) Detection the query within target audio

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Explorations in Zero-Resource Spoken Term Detection Justin Chiu Language Technology Institute, Carnegie Mellon University What is STD? Preliminary Results • Given a query (text or audio) • Detection the query within target audio • Common Approach: Recognized -> Search Evaluation Measurement: ATWV Why Zero Resource? • Common approach rely on high quality recognizer • Language have not enough knowledge to construct recognizer • Dictionary • Language Model DET Graph: Development Set Preliminary Attempt 1.Extract the MFCC feature from audio DET Graph: Evaluation Set • 2.Clustering • Goal: Get a better higher level representation for audio • K-mean Clustering • GMM Clustering • 3.Representation • Hard representation • Vectors to fixed label (Available for both clustering) • Soft representation • Vector to different vector (Available for GMM clustering) Proposed Approach Using Successive State Splitting algorithm to train HMM Initially using 1 state HMM to describe the training audio Splitting the HMM state, then prune the state according to Maximum Likelihood Deciding the number of iteration for splitting and pruning Train an “acoustic model” to model these sub-word units Representing queries and audios with decoding Indexing and searching to perform term detection Advantage: Increasing the system’s robustness by giving stronger assumption compare to preliminary approach 3.Segmental Dynamic Time Warping Conclusions We have some explorations on the Zero-Resource STD Pattern matching does not work on speaker independent case We proposed a modeling approach to address this issue Stronger assumptions might make the system more robust • Hard distance: mismatch = 1, match = 0 • Hard distance can use Inverted Frequency Weighting • Soft distance: -log(a∙q) • Distance below certain Threshold treated as detection

More Related