1 / 17

ASR Front End Processing

ASR Front End Processing. Implemented on Texas Instruments OMAP-L137 Jacob Zurasky – 12/12/11. Project Goals. Create a front-end for embedded ASR Extract feature vectors from speech data Allow for many different specifications

cato
Télécharger la présentation

ASR Front End Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ASR Front End Processing Implemented on Texas Instruments OMAP-L137 Jacob Zurasky – 12/12/11

  2. Project Goals • Create a front-end for embedded ASR • Extract feature vectors from speech data • Allow for many different specifications • Extract features real-time, while allowing enough CPU time for analysis

  3. Hardware Platform • Texas Instruments OMAP – L137 DSP, dual core • TMS320C6747 • ARM9 • AIC3106 Audio Codec • 64MB SDRAM

  4. Signal Flow Block Diagram Audio Framing Pre-Emphasis Window FFT Mel Filter Log DCT Deltas 13 - MFCCs 13 - Deltas 13 - Delta Deltas

  5. Data Streams • Streams are a way to transfer blocks of data efficiently • Uses enhanced direct memory access (EDMA) • Block of data can be accessed by SIO_reclaim(…) • Block of data can be sent by SIO_issue(…) Input Stream DSP Audio Codec Output Stream

  6. Stream Example • After SIO_reclaim, pIn points to input data and pOut points to output data • After SIO_issue, those buffers are reused by the audio codec

  7. Pre-Emphasis • y[n] = x[n] – ax[n-1] • First order high-pass filter • Used to compensate for the higher frequency roll-off in human speech production

  8. Windowing Function • Rectangular, Hann, Hamming, Cosine, Gaussian… Hamming Window

  9. FFT • Magnitude of Frequency Spectrum • Texas Instrument’s DSPLIB for C67x

  10. Mel Filter • Triangular Bandpass Filters along Mel Frequency Scale • Mimics the logarithmic nature of human hearing

  11. Discrete Cosine Transform (DCT) • Transforms back from frequency domain • Typically first 12 values are used as the Mel Frequency Cepstral Coefficients • Look-up table for efficiency

  12. Deltas • Produce 13 MFCC’s per frame • 13 more from the first derivative • 13 additional from the second derivative • 39 dimensional vector to represent the current frame

  13. Observations • Pre-Emphasis and Windowing an input frame Input Frame Pre-Emphasis and Windowed Frame

  14. Observations • FFT and Log, Mel Filter Magnitude of Frequency Spectrum Log, Mel Filtered Spectrum

  15. Observations • Discrete Cosine Transform to produce MFCC’s Mel Frequency Cepstral Coefficients Full Feature Vector for 1 frame

  16. Observations • Frame Size = 256 samples @ 16 kHz Fs • 1 Frame = 16 mS • Feature Extraction Time • Debug – 1.55 mS • Release – 0.25 mS • Real Time Feature Extraction • 0.25 mS / 16 mS = 1.56% usuage

  17. Future Goals • Complete training code for DSP • Load training data to SDRAM • DSP calculates all feature vectors associated with a given phone • Calculates Gaussian mixture model • Save acoustic model off-chip • Evaluate the acoustic model (digital recognition) • Complete embedded ASR on limited vocabulary

More Related