1 / 17

ASSESS: a descriptive scheme for speech in databases

ASSESS: a descriptive scheme for speech in databases. Roddy Cowie. to refresh people’s memory …. ASSESS embodies an approach to processing audio element of a database It is about going beyond the raw audio signal; Providing processing that a lot of people might want,

odele
Télécharger la présentation

ASSESS: a descriptive scheme for speech in databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASSESS: a descriptive scheme for speech in databases Roddy Cowie

  2. to refresh people’s memory… • ASSESS embodies an approach to processing audio element of a database • It is about going beyond the raw audio signal; • Providing processing that a lot of people might want, • But not everyone can do.

  3. ASSESS covers several levels: • Basic transformations of the signal; • Key boundaries and the units that go with them; • Properties of the units. • the system generates a lot of files but a lot of the things you might want are there if you know where to look

  4. The processes ASSESS uses • A reasonable model: • Developed for inconsiderate inputs • Robust • Maximise availability • Systematic rather than selective

  5. ASSESS input characteristics • Input file: • Reasonably long (up to 2.5 mins) • 20kHz sampling rate • No header (.raw, not .wav) Messy, but conversion techniques are easily available

  6. Using ASSESS • Woefully undramatic • Supply 3 command lines • eg for a file called ‘test’ lasting x secs • filterbank test.raw test.spc 20000 • howard test.raw test.tx • stage2 test • Wait about x/2 secs • Admire outputs

  7. Basic transformations and 1st order output • Intensity • 1/3 octave spectrum • ‘pulses’ corresponding to vocal cord openings • - basis for estimating pitch • 1st order output consists of 2 files • intensity & 1/3 octave spectrum • estimated ‘pulses’ • Everything else ASSESS calculates is derived from those

  8. Conditioning 1st order outputs inASSESS • Raw intensity • Scaled by parameter derived from a ‘reference’ file • - representing normal speaking level under same recording conditions • Clumsy, but checks show it allows reasonable comparison across files • Same scaling applied to spectrum

  9. Conditioning 1st order outputs inASSESS • Raw pulse estimates cleaned • by selecting sequences where intervals are very close • Results (in pink) comparable to standard autocorrelation, but easier to clean further • High noise associated with frication filtered using spectrum

  10. Conditioning 1st order outputs inASSESS • Fitting flexible ‘rope’ filters extremes, captures broad shape • (zeroes mark pause boundaries – taken into account)

  11. Conditioning 1st order outputs inASSESS • In contrast, standard methods try to correct for octave jumps - • with the kind of result shown in the lower panel

  12. Boundary finding inASSESS • Silences are found iteratively • find an intensity level that separates a cluster of low-intensity samples (pauses) from a cluster of high-intensity samples (speech); • fine-tune using the spectrum of the definite pauses. • Again, robust: in a comparison sample • a phonetician identified 503pauses • ASSESS identified 498 • difference between times of corresponding bounds averaged • 10.4 ms for pause starts • -1.7ms for pause ends • A similar approach is applied to frication

  13. 2nd order output of ASSESS • .exm files specify • pitch and intensity contours • in terms of local maxima and minima • and speech/silence boundaries • episodes with frication (boundaries & average spectra)

  14. Describing units – 3rd order outputs ofASSESS • Basic units: • Pauses • Tunes (structures between pauses lasting over 150ms) • Pauses have only duration • Tunes have multiple attributes, and ASSESS covers them systematically

  15. Describing units – 3rd order outputs ofASSESS • Basic module of description (in .psg file) - Pattern repeated for pitch, & for each tune

  16. Describing units – structural properties • Tune properties include • global slope & curvature of pitch contour, • movement at start and end, • measures of spectral balance & change • Relations between tunes include • abruptness of change from last tune • ‘crescendo’ … • etc.

  17. Summary • ASSESS is part system, part philosophy • The system delivers robust estimates of spectrum, F0 and intensity contours, key boundaries, and properties of the units they define • The philosophy is using signal processing expertise to make multiple alternatives at multiple levels available to others.

More Related