250 likes | 398 Vues
This paper presents innovative models emphasizing the dynamics of speech production to enhance Automatic Speech Recognition (ASR). By utilizing intermediate linear representations, we address challenges posed by noisy environments, conversational speech, and speaker adaptation. The study explores articulatory trajectories and multi-level segmental Hidden Markov Models (HMMs), demonstrating their efficacy in achieving optimal recognition performance. We employ techniques such as radial basis function networks for acoustic mappings and evaluate performance on datasets like TIMIT and MOCHA, paving the way for future research in unified speech production models.
E N D
Models of speech dynamics for ASR, using intermediate linear representations Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering http://web.bham.ac.uk/p.jackson/balthasar/
INTRODUCTION Abstract
INTRODUCTION Speech dynamics into ASR • dynamics of speech production to constrain recognizer • noisy environments • conversational speech • speaker adaptation • efficient, complete and trainable models • for recognition • for analysis • for synthesis
INTRODUCTION Articulatory trajectories from West (2000)
INTRODUCTION Articulatory-trajectory model
INTRODUCTION Articulatory-trajectory model Level surface source dependent intermediate finite-state
INTRODUCTION Multi-level Segmental HMM • segmental finite-state process • intermediate “articulatory” layer • linear trajectories • mapping required • linear transformation • radial basis function network
INTRODUCTION Linear-trajectory model acoustic layer articulatory-to-acoustic mapping intermediate layer segmental HMM 1 2 3 4 5
THEORY Linear-trajectory equations Defined as where Segment probability:
THEORY Linear mapping Objective function with matched sequences and
THEORY Trajectory parameters Utterance probability, and, for the optimal (ML) state sequence
THEORY Non-linear (RBF) mapping acoustic layer . . . . . . . . . formant trajectories
THEORY Trajectory parameters With the RBF, the least-squares solution is sought by gradient descent:
METHOD Tests on TIMIT • N. American English, at 8kHz • MFCC13 acoustic features (incl. zero’th) • F1-3: formants F1, F2 and F3, estimated by Holmes formant tracker • F1-3+BE5: five band energies added • PFS12: synthesiser control parameters
RESULTS TIMIT baseline performance • Constant-trajectory SHMM (ID_0) • Linear-trajectory SHMM (ID_1)
RESULTS Performance across feature sets
METHOD Phone categorisation
METHOD Discrete articulatory regions
RESULTS Performance across groupings
RESULTS Results across groupings
METHOD Tests on MOCHA • S. British English, at 16kHz • MFCC13 acoustic features (incl. zero’th) • articulatory x- & y-coords from 7 EMA coils • PCA9+Lx: first nine articulatory modes plus the laryngograph log energy
RESULTS MOCHA baseline performance
RESULTS Performance across mappings
DISCUSSION Model visualisation Original acoustic data Constant- trajectory model Linear- trajectory model, (F) PFS12 (c)
SUMMARY Conclusions • Theory of Multi-level Segmental HMMs • Benefits of linear trajectories • Results show near optimal performance with linear mappings • Progress towards unified models of the speech production process • What next? • unsupervised (embedded) training, to derive pseudo-articulatory representations • implement non-linear mapping (i.e., RBF) • include biphone language model, and segment duration models