Natural Language Processing A host of technologies touching many interdisciplinary areas • Automatic Speech Recognition • Speech Coding • Speaker identification • Speech Transformation • Speech Synthesis • Speech Mining • Dialog Systems • Talking Heads • Language Translation • Hearing aids • Speech Enhancements • Mobile devices • Gaming • Signal Processing • Acoustics • Physics • Engineering • Linguistics • Psychology • Mathematics • Computer Science • Communication • Cognition
Human vs. Machine • Turing Test: Cannot distinguish if we are communicating with a human or machine • Telephone Automated Systems • How many of us are fooled? • Why? • Speech has many ambiguities • Humans change words on-the-fly while speaking • Colloquial speech does not follow a strict grammar • Co-articulations and sloppy pronunciation • Humans are good a filtering noise • Humans understand world view and context • Humans recognize individual characteristics • Prosody contains information like emotion and emphasis
One sentence, eight possible meanings I made her duck • I cooked waterfowl for her. • I stole her waterfowl and cooked it. • I used my abilities to create a living waterfowl for her. • I caused her to bid low in the game of bridge. • I created the plastic duck that she owns. • I caused her to quickly lower her head or body. • I waved my magic wand and turned her into waterfowl. • I caused her to avoid the test.
Ambiguities in pronunciation “haya dun” “ay d ih s h er d s ah m th in ng ah b aw m uh v ih ng r ih s en l ih” Ambiguities in articulation (Coarticulation) tee, tree, city, beaten, steep this car, this ship Ambiguities in meaning We will review that in the near future He lives near the station two, to, and too Ambiguities in speech
Semantic Problems • “I called my mother on the television and did not understand the door. It was too breakfast, but they came from far to near. My mother is not too old for me to be young." (Wernecke’s aphasia) • “we went up the river sunk." • "John I believe Sally said Bill believed Sue saw."
Could a computer infer the meaning? I cdnuolt blveiee that I cluod aulaclty uesdnatnrd what I was rdgnieg. The phaonmneal pweor of the hmuan mnid Aoccdrnig to rscheearch at Cmabridgde Uinervtisy, it deosn't mttaer in what oredr the ltteers in a word are, the olny iprmoatnt tihng is that the frist and lsat ltteer be in the rghit pclae. The rset can be a taotl mses and you can still raed it wouthit a problem. This is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the word as a wlohe. Amzanig huh? Yaeh and I awlyas thought slpeling was ipmorantt!
Robot-human dialog 99% accuracy Robot: “Hi, my name is Robo. I am looking for work to raise funds for Natural Language Processing research.” Person: “Do you know how to paint?” Robo: “I have successfully completed training in this skill.” Person: “Great! The porch needs painting. Here are the brushes and paint.” Robot rolls away efficiently. An hour later he returns. Robo: “The task is complete.” Person: “That was fast, here is your salary; good job, and come back again.” Robo speaks while rolling away with the payment. Robo: “The car was not a Porche; it was a Mercedes.” Moral: You need a sense of humor to work in this field.
State-of-the-art • Recognition • Large vocabulary recognition with 98% accuracy • Difficulty: filtering background noise • Synthesis • Produce clear computer generated speech that is easily understood • Difficulty: incorporate prosody to achieve natural-sounding speech • Coding • Compress 256k bps (bits per second) audio to 4k bits per second • Research: compress to as low as 600 bps (Human brain: 50 bps) • Examples of on-going research • Match speech to talking heads • Transform speech from one person to sound like another • Authentication by voice • Automated analysis of audio • Human-machine dialog • Language independent algorithms
A bit of history • Talking machines go back to the middle ages • Vocoder (Based on Kempelen’s speaking machine) • Theremin (Precursor to digital synthesizers) • Signal Analysis research go back to the 1700s • Fourier and Laplace • Telephone and speech coding • Rex talking dog in 1922 • More recently: Dudley, Levinson • Dramatic progress resulting from ARPA (Advanced Research Project’s Agency ) challenges
Sample Sound Waves (Sound Editor) Download and install from ACORNS web-site Top: “this is a demo” Bottom: “A goat …. A coat” Time domain, frequency domain, cepstrals, windows, formants, features, frequency, period, amplitude, pitch, quasi periodic, vowels, fricatives, plosives, energy, zero crossings, pitch, sampling rate, windows, frames, filter, Nyquist, onset, duration, phase
GraphCalc (Freeware – download) • Freeware • I advise you download and install it • Good for creating and visualizing signals.
Amplitude — The distance from zero to the maximum height Period — The time it takes for a sine wave to complete one cycle Wavelength (λ) — The distance from one point to the same point on the next cycle Frequency (Hz) — The repetitions or cycles per second Introduction to Sound Sound results from vibrations in air pressure
Sound • High pitched sounds vibrate fast • Loud sounds have large amplitudes • Sounds with different timbre (qualities) have subordinate frequencies attached • Complex sound waves are a series of waves added together • http://www.colorado.edu/physics/phet/ contains sound and other simulations
Understanding Sine Waves • Sine is the ratio of the height to the hypotenuse • Many phenomena in nature occur in sine wave patterns
Complex Wave Patterns • Sound waves occupying the same space combine to form a new wave of a different shape. • Harmonically related waves add together and can create any complex wave pattern. • Harmonically related waves have frequencies that are multiples of a basic frequency. Fourier proposed that all sound signals can be decomposed into a group of sine waves
Nyquist Frequency (fN) = highest detectible frequency Sampling Frequency (fs) = samples per time period Maximum Signal Frequency (fmax) Theorem: fN = 2 * fmax; fs >= fN Nyquist Theorem How many cycles per second do we need? Inadequate Sampling Adequate Sampling
Aliasing Different frequencies become indistinguishable • When does this occur? • Frequencies (f>N) present that are above Nyquist Frequency(fN) • If f∆ = f>N – fN, then fN+f∆ is indistinguishable from fN-f∆. • What do we do about it? • Place an anti-aliasing filter to eliminate high frequencies • This CANNOT be done in software • Example of aliasing - Take a picture of sun every 23 hours • 24 x 23 = 552 hours between sunrises • Sun appears to move from west to east
Time vs. Frequency Domain Time Domain: Signal is a composite wave of different frequencies Frequency Domain: Split time domain into the individual frequencies
Formants • F0: Resonant frequency of the sound productions • Male average: 100 hz, Female average: 200 hz, Child average: 300 hz • F1, F2, F3: Formants are multiples of the fundamental frequency (resonances) that vary depending on shape of the vocal tract. • Articulator to the back moves formants together • Articulators to the front moves formants apart • Roundness impacts the complex relationship between F2 and F3 • Formants are an excellent feature for distinguishing vowels. They are less useful for distinguishing unvoiced sounds
Communication Create and receive information rather than passively extracting it • Form (message) • Meaning (semantics) • Signal (audio sound waves, written text) • Channel (medium): spoken, written, gestures
Semiotics The science of signs and symbols • Affective communication – Express primitive emotions. Meanings universal. • Iconic – Meaning easily inferred from the form of expression (slippery road signs). • Symbolic – Create arbitrary relationships between form and meaning. Each symbol or sound are clearly distinguished (colors) – limited set of meanings. • Natural – Add grammar, syntax, and sound combinations to express abstract concepts; express productively an unlimited number of messages.
Science of Language • Morphology: Language structure • Acoustics: Study of sound • Phonology: Classification of linguistic sounds • Semantics: Study of meaning • Pragmatics: How language is used • Phonetics: Speech production and perception Natural Language Processing draws from these fields to engineer practical systems that work.
Speech Noisy channel • Encode – send – signal – receive – decode • Communication tends to be effective and efficient • Speech is as easy on the mouth as possible while still being understood • Speakers adjust according to implied knowledge they share with their listeners
Human Language • Verbal: discrete message carried with continuous signal. • Prosodic: Continuous parallel intonation scale. • Affective: instinctive, sudden expression • Augmentative : varied by individual to clarify or inject personality. • Supra-segmental: intonation patterns of a language • Null (neutral): minimal use of prosody to accent words and phrases . • Text: Written channel • punctuation and context hints at prosody • TTS infers prosody using language specific knowledge.
Language Components • Phoneme: Smallest discrete unit of sound that distinguishes words (Minimal Pair Principle) • Syllable: Acoustic component perceived as a single unit • Morpheme: Smallest linguistic unit with meaning • Word: Speaker identifiable unit of meaning • Phrase: Sub-message of one or more words • Sentence: Self-contained message derived from a sequence of phrases and words
Natural Language Characteristics • Phones are the set of all possible sounds that humans can articulate. Each phone has unique characteristics. • Each language selects a set of phonemes from the larger set of phones (English – 40). Our hearing is tuned to respond to this smaller set. • Speech is a highly redundant sequential sequence of sounds (phonemes) , pitch (prosody), gestures, and expressions varying with time.
Audio Signal Redundancy • Continuous signal (virtually infinite) • Sampled • Mac: 44,100 2-byte samples per second (705kbps) • PC: 16,000 2-byte samples per second (256kbps) • Telephone: 4k 1-byte sample per second (32kbps) • CELP Compression: 8kbps • Research: 4kbps, 2.4 kbps • Military applications: 600 bps • Human brain: 50 bps
Course Goals • Introduce algorithms and techniques used in natural language processing • Explain how these techniques are useful outside of this specific field • Provide enough background so we, as a class, can begin to work towards significant contributions in the winter follow-up class • Discuss various areas, but focus on speech synthesis • Discuss topics in a manner accessible to students with diverse backgrounds
Projects • Pronunciation aid • Useful for language learning for students to grasp the phonemes that are not in their first language • Useful for the hearing impaired to be able to speak normally through visual feedback • Generate speech from a language independent script • Design a language-independent script and identify possible problems. A future application is to analyze speech and translate into this script. • Identify, codify speaker dependent speech components • Future applications are computer based games where audio can transform voices to a multitude of speakers
Pronunciation Lesson The sound “m” Sound Wave “mmmmmm” Sound Wave (program) Sound Wave (your sound) Your Sound Wave Program Wave Tongue placement Width of wind pipe Vocal chord vibration Picture key:
Pronunciation Lesson The sound “h” Sound Wave “hhhhhhhh” Sound Wave (program) Sound Wave (your sound) Your Sound Wave Program Wave Tongue placement Width of wind pipe Vocal chord vibration Picture key- Changes in: