1 / 19

Speech and Language Technologies for Audio Indexing and Retrieval

Speech and Language Technologies for Audio Indexing and Retrieval. JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000. Outline. Introduction

emery-burt
Télécharger la présentation

Speech and Language Technologies for Audio Indexing and Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY LEEK, DABEN LIU, LONG NGUYEN, RICHARD SCHWARTZ, AND AMIT SRIVASTAVA, MEMBER, IEEE PROCEEDINGS OF THE IEEE, VOL. 88, NO. 8, AUGUST 2000 Chin-Kai Wu, CS, NTHU

  2. Outline • Introduction • Indexing and Browsing with Rough’n’Ready • Rough’n’Ready System • Indexing and Browsing • Statistical Modeling Paradigm • Speech Recognition • Speaker Recognition • Segmentation • Clustering • Identification Chin-Kai Wu, CS, NTHU

  3. Introduction • Much of information will be in the form of speech from various source. • It’s now possible to start building automatic content-based indexing and retrieval tools. • The Rough’n’Ready system provides a rough transcription of the speech that is ready for browsing. • The technologies incorporated in the system include speech/speaker recognition, name spotting, topic classification, story segmentation and information retrieval. Chin-Kai Wu, CS, NTHU

  4. Dual P733-MHz MP3 Collect/Manage Archive Interact with browser ActiveX controls Rough’n’Ready system ActiveX controls Chin-Kai Wu, CS, NTHU

  5. Indexing and Browsing Chin-Kai Wu, CS, NTHU

  6. Place People Speaker Topic Labels Organization Indexing and Browsing (Cont’d) Chin-Kai Wu, CS, NTHU

  7. Indexing and Browsing (Cont’d) Selected from over 5500 topic labels Chin-Kai Wu, CS, NTHU

  8. Statistic Modeling Paradigm (desired recognized sequence of the data) Maximize P(output|input, model) Chin-Kai Wu, CS, NTHU

  9. Speech Recognition • Statistic model: acoustic models, language models • Acoustic model • Describe the time-varying evolution of feature vectors for each sound or phoneme • Employ hidden Markov models (HMM) • Gaussian mixture models the feature vector for each HMM states • Special acoustic models for nonspeech events: music, silence/noise, laughter, breath, and lip-smack. • Language model: N-gram language model Chin-Kai Wu, CS, NTHU

  10. Speech Recognition (Cont’d) • Multipass recognition search strategy • Fast-match pass • Narrows search space • Followed by other passes with more accurate models operate on smaller search space • Backward pass • Generate top-scoring N-best word sequences (100 <= N <= 300) • N-best rescoring pass: Tree Rescoring algorithm Chin-Kai Wu, CS, NTHU

  11. Speech Recognition (Cont’d) • Speedup algorithms • Fast Gaussian Computation (FGC) • Grammar Spreading • N-Best Tree Rescoring • Word error rate • PII 450-MHz processor, 60000-word vocabulary • 3 x RT => 21.4% • 10 x RT => 17.5% • 230 x RT => 14.8% Chin-Kai Wu, CS, NTHU

  12. Speaker Recognition • Speaker segmentation • Segregate audio streams based on the speaker • Speaker clustering • Groups together audio segments that are from the same speaker • Speaker identification • Recognizes those speakers of interest whose voices are known to the system Chin-Kai Wu, CS, NTHU

  13. Speaker Segmentation • Two-stage approach to speaker change detection • First: Detects speech/nonspeech boundaries • Second: Perform actual speaker segmentation within the speech segments • First stage • Collapse the phoneme into three broad classes (vowels, fricatives, and obstruents) • Include five nonspeech models (music, silence/noise, laughter, breath, and lip-smack) • 5-states HMM • Detection reliability over 90% of the time Chin-Kai Wu, CS, NTHU

  14. λ<= t Same speaker Nonspeech region λ> t λ<= t + α Speech region otherwise λ> t + α Speaker Segmentation (Cont’d) • Second stage • Hypotheses a speaker change boundary at every phone boundary located in the first stage • Speaker change decision takes the form of a likelihood ratio (λ) test Chin-Kai Wu, CS, NTHU

  15. K: number of clusters for any particular cut of tree Nj: number of feature vectors in cluster j Compensation for the previous term Log of determinant of the within-cluster dispersion matrix Speaker Clustering • The likelihood ratio test is used repeatedly to group cluster pairs that are deemed most similar until all segments are grouped into one cluster and a complete cluster tree is generated • To find the cut of the tree that is optimal based on criterion Chin-Kai Wu, CS, NTHU

  16. Speaker Clustering (Cont’d) • The algorithm performs well regardless of the true number of speakers, producing clusters of high purity • The purity is defined as the percentage of frames that are correctly clustered, measured as 95.8% Chin-Kai Wu, CS, NTHU

  17. Speaker Identification • Every speaker cluster created in the speaker clustering stage is identified by gender • The gender of a speaker segment is then determined by computing the log likelihood ratio between the male and female models • This approach has resulted in a 2.3% error in gender detection Chin-Kai Wu, CS, NTHU

  18. Speaker Identification (Cont’d) • In the DARPA Broadcast News corpus, 20% of the speaker segments are from 20 known speakers • The problem is what is known as an open set problem in that the data contains both known and unknown speakers and the system has to determine the identity of the known-speaker segments and reject the unknown-speaker segments Chin-Kai Wu, CS, NTHU

  19. Speaker Identification (Cont’d) • The system resulted in three types of errors • False identification rate of 0.1%, a known-speaker segment was mistaken to be from another known speaker • False rejection rate of 3.0%, where a known-speaker segment was classified as unknown • False acceptance rate of 0.8%, where an unknown-speaker segment was classified as coming from one of the known speakers Chin-Kai Wu, CS, NTHU

More Related