Speech P rocessing

Speech Processing František Hrdina

Presentation parts: • Speech processing theory • Available commercial software • SDK SAPI 5.1

Speech processing includes different technologies and applications: • Speech encoding • Speaker separation • Speech enhancement • Speaker identification (biometrics) • Language identification • Keyword spotting • Automaticspeech recognition (ASR) problem, intelligent human computer interface (IHCI)

Intelligent human computer interface

Theory of everything F(X)=0 ...for suitable values of F, and suitable interpretations of X.You can come up with another theory, but it will merely be a special case of this one.

Automaticspeech recognition (ASR) Systems can be divided by • Vocabulary size • From tens of words to hundreds of thousands of words • The speaking format of the system • Isolated or connected words (phone dialing), continuous speech • The degree of speaker dependence of the system • Speaker-dependent, independent • The constraints of the task • as the vocabulary size increases, the possible combinations of words to be recognized grows exponentially. Some form of task constraint, such as formal syntax and formal semantics, is required to make the task more manageable.

ASR Phases • Signal pre-processing - filtering, 16bit/20kHz sampling (65536 range every 0.05ms) • Signal-processing phase – to reduce rate of data, 10-30ms segments, windowing – to prevent discontinuities and spectrum distortion • Pattern matching. Matching the feature vector to already existing ones and finding the best match There are four major ways to do this: (1) template matching, (2) hidden Markov models, (3) neural networks, and (4) rule-based systems. • Time-alignment phase - A sequence of vectors recognized over a time are aligned to represent a meaningful linguistic unit (phoneme, word). Different methods can be applied, for example, the Viterby method, dynamic programming, and fuzzy rules. • Language analysis - The recognized language units recognized over time are further combined and recognized from the point of view of the syntax, the semantics, and the concepts of the language used in the system.

Automaticspeech recognition

Signal pre-processing • Filtering • 16bit/20kHz sampling (range of 65536 values every 0.05ms) • Soundcard can be accessed via DirectX in Windows • http://www.ymec.com/products/dssf3e/index.htm

Signal-processing • to reduce rate of data and to gain feature vectors • 10-30ms segments • windowing – to prevent discontinuities and spectrum distortion • http://www.ymec.com/products/dssf3e/index.htm

Signal-processing Speech can be represented on the: · Time scale (waveform) representation · Frequency scale (spectrum) · Both a time and frequency scale (spectrogram) Features: loudness, pitch, cepstrum (Fourier analysis of the logarithmic amplitude spectrum of the signal) ~ autocorrelation, formants – freq. with the highest energy Digital Filter Banks model Based on human auditory system. quasi-linear until about 1 kHz quasi-logarithmic above 1 kHz http://mi.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html

Pattern matching • Matching the feature vector to already existing ones and finding the best match There are four major ways to do this: • (1) template matching, • (2) hidden Markov models, • (3) neural networks, • (4) rule-based systems.

Pattern matchingWord segmentation • The relationship between the segmentation of sensory input (e.g., the speech signal) into chunks and the recognition of those chunks. Two Processes or One? – chicken-and-egg problem • One approach to resolving the paradox is to assume that segmentation and recognition are two aspects of a single process— that tentative hypotheses about each issue are developed and tested simultaneously, and mutually consistent hypotheses are reinforced. • A second approach is to suppose that there are segmentation cues in the input that are used to give at least better-than-chance indications of what segments may correspond to identifiable words. • bottom-up vs. top-down, word segmentation by children

ASR with MLP

Pattern matchingMLP - features • Connectionism (brain-like) vs. Turing machine • The main pros of MLPs over otherstatistical modeling methods are (1) MLP implementations typicallyrequire fewer assumptions and can be optimized in a datadrivenfashion, (2) backpropagation training can be generalized toany optimization criterion, including maximum likelihood and allforms of discriminative training, and (3) MLP modules can easilybe integrated in nonadaptive architectures • Most industrial speech recognizers to date count very few MLP components. • The main disadvantage: MLP training time is typically much greater than that of nonconnectionistmodels for which closed-form or fast iterative solutionscan be derived.

Pattern matchingMLP - future • MLP-based models have outperformed stateof-the-art traditional recognition systems on some of the most challengingrecognition tasks. • Despite the recent advances in multimodule architectures and gradient-based learning, several key questions are still unanswered, and many problems are still out of reach. How much has to be built into the system, and how much can be learned? How can one achieve true transformation-invariant perception with NNs? • New concepts (possibly inspired by biology) will be required for a completesolution. Theaccuracy of the bestNN/HMM hybrids for written or spoken sentences cannot even becompared with human performance. Topics such as the recognitionof three-dimensional objects in complex scenes are totally out ofreach. Human-like accuracy on complex PR tasks such as handwritingand speech recognition may not be achieved without a drasticincrease in the available computing power. Several importantquestions may simply resolve themselves with the availability ofmore powerful hardware, allowing the use of brute-force methodsand very large networks.

Time-alignment • A sequence of vectors recognized over a time are aligned to represent a meaningful linguistic unit (phoneme, word). Different methods can be applied, for example, the Viterby method, dynamic programming, and fuzzy rules.

Language analysis • The recognized language units recognized over time are further combined and recognized from the point of view of the syntax, the semantics, and the concepts of the language used in the system.

ASR Problems Speech signal is highly variable according to the speaker, speaking rate, context, and acoustic conditions. People use huge vocabulary of over 300,000 words. Ambiguity of speech: • Homophones: "hear" and "here" • Word boundaries: /greiteip/"gray tape"or "great ape" • Syntactic ambiguity"the boy jumped over the stream with the fish"

Commercial Software • Extremely complicated task: No place for new players. • Nuance Communicationsdominates server-based telephony and PC applications market. • IBM- command and control (grammar-constrained) and dictation. Claims to reach human SR quality by 2010 , MS 2011. • Microsoft– SpeechServer. • Growing market segment – mobile phones. (Operators such as Vodafone, et cetera)

Allows using speech synthesis (TTS – text to speech) and speech recognition in custom applications • C&C or natural speaking • User independent, can be trained • 60,000 English words • Free

My experience: • Not based on .NET (maybe in Vista?), chaotic documentation (compared to .NET’s), but still easy to use • Training greatly improves performance • Good accuracy on constrained vocabulary , grammar is defined as a state machine • Too sensitive (recognizes nearly any sound as a word), better results probably with user dependant system. • Does not provide any way how to view score of recognition – for setting treshold • Useless for controlling Windows (because of 2 previous points) • Some features are not implemented yet (but interface is provided), for example: add digit “3” to vocabulary and recognized as “three” • Can be very useful when combined with IR remote control (supplies parameters to commands) • Dictation mode not tested, used in MS Word. • TTS sounds very artificial

Literature • Foundations of Neural Networks, Fuzzy Systems, andKnowledge Engineering Nikola K. KasabovThe MIT Press 1998, 581pgs • The Handbook ofBrain Theoryand Neural Networks - Second Edition, edited by Michael A. Arbib, The MIT Press 2003, 1290pgs • http://en.wikipedia.org/wiki/Speech_recognition/ • Training Neural Networks for Speech Recognition John-Paul Hosom, Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong Yan, Wei WeiCenter for Spoken Language Understanding (CSLU) Oregon Graduate Institute of Science and TechnologyFebruary 2, 1999 • http://mi.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html • http://www.phon.ucl.ac.uk/courses/spsci/matlab/lect10.html

Speech P rocessing