Speech Processing

Speech Processing Production and Classification of Speech Sounds

Introduction • Simplified view of Speech Production (see Figure 3.1 in the next slide) • Lungs – act as a power supply and provide airflow to the larynx stage. • Larynx – modulates airflow and provides either: • Periodic puff-like airflow, or • Noisy airflow to vocal tract. • Vocal-tract – gives the modulated airflow its “color” (spectrally shaping the source) with: • Oral, • Nasal, and • Pharynx cavities. Veton Këpuska

Figure 3.1 Veton Këpuska

Introduction • Sound sources can also be generated by constrictions and boundaries that are made within the vocal tract itself: • Periodic source, • Noisy source, or • Impulsive airflow source. • Note that speech production mechanism does not generate a perfect periodic, impulsive, or noisy source. • Three general categories of the source for speech sounds: • Periodic • Noisy • Impulsive • Illustration of each in the word “shop”: • “sh” – noisy • “o” – periodic • “p” - impulse Veton Këpuska

Period Source Noise like signal Impulse Source Example of “Shop” Veton Këpuska

Introduction • Distinguishable speech sounds are determined • not only by source, but • also by different vocal tract configurations, • and combination of both. • Speech sound classes are referred to as phonemes. • Phonemics is the discipline that studies phoneme realizations (e.g., in a language). • Each phoneme class provides a certain meaning in a word. • Within a phoneme class there exist many sound variations that provide the same meaning. The study of these sound variations is called phonetics. • Phonemes are the basic building blocks of a language: • They are concatenated (more or less), as discrete elements into words, • According to a certain phonemic and grammatical rules. Veton Këpuska

Introduction • This chapter will cover: • Description of speech production mechanism • Resulting variety of phonetic sound patterns • How these sounds differ among different speakers. Veton Këpuska

Anatomy and Physiology of Speech Production Veton Këpuska

Anatomy and Physiology of Speech Production • Anatomy of speech production is shown in Figure 3.2 • Lungs: • Inhalation and exhalation of air. • Connected through trachea (“windpipe”) and epiglottis to Vocal Tract. • ~12-cm-long and ~1.5-2-cm-diameter pipe. • During the speaking, rhythmical cycle of inhalation and exhalation changes to accommodate speech production: • Duration of exhalation becomes roughly equal to the length of sentence/phrase. • Lung air pressure during this time is maintained at a constant level, slightly above the atmospheric pressure. Veton Këpuska

Anatomy and Physiology of Speech Production • Larynx • Complicated system of cartilages, flesh, muscles, and ligaments. • Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 3.3. • Vocal folds are: • ~15 mm in men • ~13 mm in women Veton Këpuska

Anatomy and Physiology of Speech Production • Three primary states of the vocal folds: • Breathing – Arytenoid Cartilages are held outward • Voiced - Arytenoid Cartilages are held close together. • Unvoiced – Arytenoid Cartilages are held outward or partially close • Complex motion of the vocal folds illustrated in Figure 3.4 • Nonlinear two-mass model of Flanagan et al. (Figure 3.5) • Arytenoid: ar·y·te·noid Pronunciation: \ˌa-rə-ˈtē-ˌnȯid, ə-ˈri-tən-ˌȯid\ Function: adjective Etymology: New Latin arytaenoides, from Greek arytainoeidēs, literally, ladle-shaped, from arytaina ladle Date: circa 1751 1 : relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 : relating to or being either of a pair of small muscles or an unpaired muscle of the larynx — arytenoid noun Veton Këpuska

Anatomy and Physiology of Speech Production • If one were to measure the airflow velocity at the glottis as a function of time, obtained waveform will be approximately similar to that of Figure 3.6. • Closed phase: folds are closed and no flow occurs • Open phase: folds are open and the flow increases up to a maximum. • Return phase: Time interval from the maximum air flow until the glottal closure. • Specific flow shape can change with: • Speaker • Speaking style • And specific speech sound. • Glottal air-flow is referred to glottal flow. • Time duration of one glottal cycle is referred to as the pitch period • Reciprocal of pitch period is referred to as pitch, also as fundamental frequency. Veton Këpuska

Example 3.1 • Consider a glottal flow waveform model of the form: u[n] = g[n]*p[n] Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P. Because the waveform is infinitely long, a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window. The window, denoted by w[n,], is centered at time , as illustrated in Figure 3.7 – next slide, and the resulting waveform segment is written as: u[n, ] = w[n,](g[n]*p[n]) Using Multiplication and Convolution Theorem of Chapter 2, the following expression in frequency domain is obtained: Veton Këpuska

Example 3.1 where • W(,) is the Fourier transform of w[n,], • G() is the Fourier transform of g[n], • k=(2/P)k, where 2/P is the fundamental frequency or pitch. • As illustrated in Figure 3.7 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes. • Effect of the harmonics of the glottal waveform on the spectrum. Veton Këpuska

Example 3.1 • Degrease in pitch period () causes increase () in the spacing of harmonics of glottal waveform: k=(2/P)k. • First harmonic is also the fundamental frequency. • At each harmonic frequency there is a translated window Fourier transform W(-k) weighted by G(k) • Magnitude of the spectral shaping function, i.e., glottal flow |G(k)| is referred to as spectral envelope of the harmonics. Veton Këpuska

Anatomy and Physiology of Speech Production • Fourier transform of periodic glottal waveform is characterized by harmonics. • Typically the spectral envelope of the harmonics (governed by the glottal flow over tone cycle, has on average a -12 dB/octave rolloff. • Rolloff is dependent on the nature of airflow and speaker characteristics. • See Exercise 3.18 for further details. • The model in Example 3.1 is ideal in the sense that even for sustained voicing – a fixed pitch period is almost never maintained in time: • It can “randomly” vary over successive periods – pitch “jitter”. • Amplitude of the airflow velocity within a glottal cycle may differ across consecutive pitch periods – amplitude “shimmer”. • Those variations are due to (perhaps!) • Time-varying characteristics of the vocal tract and vocal folds. • Nonlinear behavior in the speech anatomy, or • Appear random while being the result of an underlying deterministic (chaotic) system. • Jitter and shimmer are one component that give the vowels its naturalness. • In contrast a monotone pitch and fixed amplitude results in a machine-like sound. • Voice character is determined by the extend of jitter and shimmer in voice (e.g., hoarse voice). Veton Këpuska

Anatomy and Physiology of Speech Production • States of Vocal Folds: • Breathing • Voicing • Unvoicing – • Turbulence at the vocal folds – aspiration • Example: “he” – whispered sounds • Aspiration occurs also with voiced sounds (breathy voice) • Part of the vocal folds vibrate and part of it are nearly fixed. Veton Këpuska

Anatomy and Physiology of Speech Production • Other forms of atypical Vocal Fold movement: • Creaky voice – very tense vocal folds with only a short portion of the folds oscillating. Resulting in a voice that has • High pitch, and • Irregular pitch • Vocal fry – focal folds are massy and relaxed resulting in a voice with an abnormally: • Low pitch • Irregular pitch. • Characterized by secondary glottal pulses close to and overlapping the primary glottal pulse. • Result of coupling of false vocal folds with true vocal folds. • Diplophonic voice – secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 3.9b and Figure 3.16). Veton Këpuska

Examples of atypical voice types Veton Këpuska

Vocal Tract • Comprised of the oral cavity: • From larynx • To the lips including • the nasal passage – coupled to the oral tract by way of the velum. • Oral tract takes on many different lengths and cross-sections. This is accomplished by moving the articulators: • Tongue • Teeth • Lips • Jaw. • Average length for a adult male is 17 cm, and cross sectional area of up to 20 cm2. • Purpose of vocal tract is to: • Spectrally “color” the source, and • Generate new sources for sound production. Veton Këpuska

Spectral Shaping • Under a certain conditions, the relation between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances. • Resonance frequencies of the vocal tract are called formant frequencies or simply formants. • Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 3.10. Veton Këpuska

Spectral Shaping • The peaks of the spectrum of the vocal tract response correspond approximately to its formants: • For a time-invariant all-pole linear system model of vocal tract with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant. • Frequency of the formant is 0 • Bandwidth is dependent on the distance from the unit circle (r0). • Because the vocal tract is assumed stable (with poles inside the unit circle), its transfer function can be expressed either in product or partial fraction expansion form: Veton Këpuska

Spectral Shaping • Formants of the vocal tract are numbered from the low to high formants according to their location. • F1, F2, etc. • In general, the formant frequencies degrease as the vocal tract length increases: • Male speakers tend to have lower formants than a female. • Female speakers have lower formants than children. • Under a vocal-tract’s: • Linearity and time-invariance assumption, and • When the sound source occurs at the glottis, • Then: • The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response. Veton Këpuska

Example 3.2 • Consider a periodic glottal flow source of the form: u[n]=g[n]*p[n] Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P. When the sequence u[n] is passed through a linear time-invariant vocal tract with impulse response h[n], the vocal tract output is given by: x[n]=h[n]*(g[n]*p[n]) A window center at time , w[n,], is applied to the vocal tract output to obtain the speech segment: x[n,]=w[n,]{h[n]*(g[n]*p[n])} Using Multiplication and Convolution Theorems, Fourier transform of the speech segment representing frequency domain representation is obtained: Veton Këpuska

Example 3.2 • Where W(,) is the Fourier transform of w[n,], and • k=(2/P)k, and (2/P) is fundamental frequency or pitch. • Figure 3.11 (next slide) illustrates that the spectral shaping of the windowed transform at the harmonics 1, 2 ,…, N is determined by the spectral envelope |H()G()| - consisting of: • Glottal and • Vocal tract contributions (unlike example 3.1 consisting only of glottal contribution) Veton Këpuska

Example 3.2 Veton Këpuska

Example 3.2 • The general upward or downward slope of the spectral envelope, also called spectral tilt, is influenced by: • The nature of the glottal flow waveform over a cycle, e.g., a gradual or abrupt closing, and by • The manner in which formant tails add. • Note also from the figure 3.11 that the formant locations are not always clear from the short-time Fourier transform magnitude |X(,)| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics. • This is especially the case for high pitched speech. Veton Këpuska

Spectral Shaping • Previous example is important because: • It illustrates the difference between: • Formant (resonance frequency of vocal tract), and • Harmonic frequency. • A formant corresponds to the vocal tract pole (resonant frequency) • Harmonics arise due to the periodicity of glottal source (pitch). • In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation. • On the other hand, the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice). Veton Këpuska

Example 3.3 • A soprano singer often signs a tone whose first harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung. As shown in the next figure (Figure 3.12), when the nulls of the vocal tract spectrum are sampled at the harmonics, the resulting sound is weak, especially in the face of competing instruments. • To enhance the sound, the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 3.4) and can match the frequency of the first harmonic, thus generating a louder sound. Veton Këpuska

Nasal Sounds

Spectral Shaping • Nasal and oral components of the vocal tract are coupled by the velum. • When the vocal tract velum is lowered – introducing an opening into the nasal passage, and • Oral tract is shut off by the tongue or lips, Sound propagates through the nasal passage and out through the nose. • The resulting sounds have a spectrum that is dominated by low-frequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds: • Examples: “nose” and “mouse”. Veton Këpuska

Spectral Shaping: Nose Veton Këpuska

Spectral Shaping: Mouse Veton Këpuska

Spectral Shaping • Because the nasal cavity (unlike the oral tract) is essentially constant, characteristics of nasal sounds may be particularly useful in speaker identification. • Velum can be lowered even when the vocal tract is open: • When this coupling occurs the resulting sound is said to be nasalized (e.g., nasalized vowel): • There are two dominant effects that characterize nasalization: • Broadening of the formant bandwidth of oral tract because of loss of energy through nasal passage, • Introduction of anti-resonances (i.e., zeros in the vocal tract transfer function) due to the absorption of energy at the resonances of the nasal passage. Veton Këpuska

Plosives

Source Generation • In previous section the effect of vocal tract shape in the sound production was discussed. • In the Figure 3.10 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted. This closure is required when making an impulsive sound (plosives): • Build-up of pressure behind the palate, and • Abrupt release of pressure. Veton Këpuska

Source Generation: Plosives “Drop” Veton Këpuska

Fricatives

Source Generation • Another sound source is created when the tongue is very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (e.g., fricatives). • As with periodic glottal sound source, a spectral shaping can also occur for either type of input (i.e., impulse or noise source). • There is no harmonic structure with these types of inputs. The source spectrum is shaped at all frequencies by |H()|. • Note that the spectrum of noise was idealized assuming a flat spectrum. In reality these sources will themselves have a non-flat spectral shape. Veton Këpuska

Source Generation: Fricatives “NASA” Veton Këpuska

Source Generation • There is another class of the source type that is generated within the vocal tract, however, it is less understood than noisy and impulsive sources at oral tract constrictions. • This source arises from the interaction of vortices with vocal tract boundaries such as the false vocal folds, teeth, or occlusions in the oral tract. • Vortex can be thought off as a tiny rotational airflow in the oral tract. • There is evidence that sources due to vortices influence the • temporal and • spectral and perhaps • perceptual characteristics of speech sounds. Veton Këpuska

Categorization of Sound By Source • Voiced: Speech sounds generated with a periodic glottal source. • Unvoiced: Speech sounds not generated with periodic glottal source. There are variety of unvoiced sounds: • Fricatives - Sounds that are generated from the friction of the moving air against an oral tract constriction. Example: “thin” • Plosives – Created with an impulsive source within the oral tract. Example: “top” • Whispers – Barrier made at the vocal folds by partially closing the vocal folds, but without oscillations. Example: “he”. • However, the unvoiced sounds do not exclusively relate to the sound source. That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources. Thus above subcategories may exists for voiced sounds. • Example: • “zebra” vs. “sheba” -- Fricatives • “bin” vs. “pin” -- Plosives Veton Këpuska

Categorization of Sound By Source Veton Këpuska

Spectrographic Analysis of Speech

Spectrographic Analysis of Speech • Speech waveform consists of a sequence of different events. This time-variation corresponds to highly fluctuating spectral characteristics over time. • Example of a word “to”. • A single Fourier transform of the entire acoustic signal of the word “to” cannot capture this time-varying frequency content. • In contrast short-time Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability. Veton Këpuska

Speech Processing