100 likes | 229 Vues
The Automatic Speech Attribute Transcription (ASAT) project, running from October 2004 to September 2008, aimed to advance the transcription of speech by utilizing a variety of detection methodologies. Led by a team of experts across institutions, the project developed prototypes for attribute detectors, focusing on temporal and spatial event combinations to enhance recognition systems. Key methodologies include biologically motivated and perceptually driven processors, culminating in innovative hybrid frameworks for continuous speech recognition. The ASAT initiative emphasizes team collaboration and excellence in the pursuit of cutting-edge speech technology.
E N D
Automatic Speech Attribute Transcription (ASAT) • Project Period: 10/01/04 – 9/30/08 • The ASAT Team • Mark Clements (clements@ece.gatech.edu) • Sorin Dusan (sdusan@speech.rutgers.edu) • Eric Fosler-Lussier (fosler@cse.ohio-state.edu) • Keith Johnson (kjohnson@ling.ohio-state.edu) • Fred Juang (juang@ece.gatech.edu) • Larry Rabiner (lrr@caip.rutgers.edu) • Chin Lee (Coordinator, chl@ece.gatech.edu) • NSF HLC Program Director: (mharper@nsf.gov)
ASAT Paradigm and SoW 1 2 3 4 5. Overall System Prototypes and Common Platform
Bank of Speech Attribute Detectors • Each detected attribute is represented by a time series (event) • An example: frame-based detector (0-1 simulating posterior probability) • ANN-based Attribute Detectors • An example: nasal and stop detectors • Sound-specific parameters and feature detectors • An example: “VOT” for V/UV stop discrimination • Biologically-motivated processors and detectors • Analog detectors, short-term and long-term detectors • Perceptually-motivated processors and detectors • Converting speech into neural activity level functions • Others?
An Example: More Visible than Spectrogram? j+ve d+ing z+ii j+i g+ong h+e g+uo d+e m+ing +vn Stop XX Nasal Vowel Early acoustic to linguistic mapping !!
Event Merger • Merge multiple time series into another time series • Maintaining the same detector output characteristics • Combine temporal events • An example: combining phones into words (word detectors) • Combine spatial events • An example: combining vowel and nasal features into nasalized vowels • Extreme: Build a 20K-word recognizer by implementing 20K keyword detectors • Others: OOV, partial recognition
Evidence Verifier • Provide confidence measures to events and evidences • Utterance verification algorithms can be used • Output recognized evidences (words and others) • Hypothesis testing is needed in every stage • Prune event and evidence lattices • Pruning threshold decisions • Minimum verification error (MVE) verifiers • Many new theories can be developed • Others?
Knowledge Sources: Definition & Evaluation • Explore large body of speech science literature • Define training, evaluation and testing databases • Develop Objective Evaluation Methodology • Defining detectors, mergers, verifiers, recognizers • Defining/collecting evaluation data for all • Document all pieces on the web
Prototype ASR Systems and Platform • Continuous Phone Recognition: TIMIT? • Continuous Speech Recognition • Connected digit recognition • Wall Street Journal • Switchboard? • Establishment of a collaborative platform • Implementing divide-’n’-conquer strategy • Developing a user community
Summary • ASAT Goal: Go beyond state-of-the-art • ASAT Spirit: Work for team excellence • ASAT team member responsibilities • MAC: Event Fusion • SD: Perception-based processing • EF: Knowledge Integration (Event Merger) • KJ: Acoustic Phonetics • BHJ: Evidence Verifier • LRR: Attribute Detector • CHL: Overall