Computer speech

Computer speech Recording and sampling Speech recognition

Recording • Digital recording: The process of converting speech waves into computer-readable format is called digitization, or A/D conversion.

Sampling • In order to transform sound into a digital format, you must sample the sound. The computer takes a snapshot of the sound level at small time intervals while you are recording. • The number of samples taken each second is called the sampling rate. The more samples that are taken, the better sound quality. But we also need more storage space for higher quality sound. • For speech recordings, in most cases a sampling rate of 10k Hz is enough. 44100 Hz 22050 Hz 11025 Hz 8000 Hz 5000 Hz

Sampling • Nyquist-Shannon theorem: When sampling a signal (e.g., converting from an analog signal to digital), the sampling frequency must be greater than twice the highest frequency in the input signal in order to be able to reconstruct the original perfectly from the sampled version. • Aliasing: If the sampling frequency is less than twice the highest frequency component, then frequencies in the original signal that are above half the sampling rate will be "aliased" and will appear in the resulting signal as lower frequencies. • Anti-Aliasing filter: typically a low-pass filter that is applied before sampling to ensure that no components with frequencies greater than half the sample frequency remain.

Audio file formats • There are a number of different types of Audio files. • “.wav” files are commonly used for storing uncompressed sound files, which means that they can be large in size - around 10MB per minute of music. • “.mp3” files use the "MPEG Layer-3" codec (compressor-decompressor). “mp3” files are compressed to roughly one-tenth the size of an equivalent .wav file while maintaining good audio quality. • “.aiff” is the standard audio file format used by Apple. It is like a wav file for the Mac.

Praat: doing phonetics by computer

Speech recognition • Goal: to convert an acoustic signal O into a word sequence W. • Statistics-based approach: What is the most likely sentence out of all sentences in the language L given some acoustic input O? • Treat acoustic input O as sequence of individual observations • O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: • W = w1,w2,w3,…,wn

Speech recognition architecture • Solution: search through all possible sentences. Pick the one that is most probable given the waveform/observation. Bayes’ rule P(O) is the same for each W

Speech recognizer components • Acoustic modeling: Describes the acoustic patterns of phones in the language. 1. Feature extraction 2. Hidden Markov Model • Lexicon (pronouncing dictionary): Describes the sequences of phones that make up words in the language. • Language modeling: Describes the likelihood of various sequences of words being spoke in the language.

Acoustic modeling • A vector of 39 features is extracted at every 10 ms from 20-25 ms of speech. • Each phone is represented as an Hidden Markov Model (HMM) that consists of three states: the beginning part (s1), the middle part (s2), and the end part (s3). Each state is represented by a Gaussian model on the 39 features.

Lexicon • The CMU pronouncing dictionary: a pronunciation dictionary for American English that contains over 125,000 words and their phone transcriptions. http://www.speech.cs.cmu.edu/cgi-bin/cmudict • CMU dictionary uses 39 phonemes (in ARPABET), word stress is labeled on vowels: 0 (no stress); 1 (primary stress); 2 (secondary stress). PHONETICS F AH0 N EH1 T IH0 K S COFFEE K AA1 F IY0 COFFEE(2) K AO1 F IY0 RESEARCH R IY0 S ER1 CH RESEARCH(2) R IY1 S ER0 CH

Language Modeling • We want to compute the probability of a word sequence, p(w1,w2,w3,…,wn). • Using the Chain rule, we have, for example: p(speech, recognition, is, very, fun) = p(speech)*p(recognition|speech)*p(is|speech, recognition)*p(very|speech, recognition, is)*p(fun|speech, recognition, is, very) • Learn p(fun|speech, recognition, is, very) from data? - we’ll never be able to get enough data to compute the probabilities of long sentences. • Instead, we need to make some Markov assumptions: • Zeroth order: p(fun|speech, recognition, is, very) = p(fun) - unigram • First order: p(fun|speech, recognition, is, very) = p(fun|very) - bigram • Second order: p(fun|speech, recognition, is, very) = p(fun|is, very) - trigram …

Some Challenges in ASR • Robustness and Adaptability – to changing conditions (different mic, background noise, new speaker, different speaking rate, etc.) • Out-of-Vocabulary (OOV) Words –Systems must have some method of detecting OOV words, and dealing with them in a sensible way. • Spontaneous Speech – disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem. • Prosody –Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). • Accent, dialect and mixed language – non-native speech is a huge problem, especially where code-switching is commonplace.

Computer speech

Computer speech

Presentation Transcript

The Speech Speech

Speech 450 Professor Rosati Computer Hardware

Computer Peripherals Speech 450 Professor Rosati

Harnessing Speech Prosody for Human-Computer Interaction

Indirect Speech (Reported Speech)

Computer speech

SPEECH

Speech

Speech

REPORTED SPEECH / INDIRECT SPEECH

Speech

Ways to generate computer speech

Speech, Language and Human-Computer Interaction

speech in, speech out

Reported speech / Indirect speech

Speech

Human/Computer Communications Using Speech

Speech

Speech

Introduction to Computer Speech Processing

Speech

Speech 204/ Speech 205