Sphinx on Handhelds
As handheld and embedded devices become increasingly capable, leveraging speech recognition through Sphinx is a promising venture. However, challenges such as limited CPU speed, memory constraints, and inadequate operating systems pose significant hurdles. This document outlines the potential for implementing Sphinx on devices like the Sharp Zaurus, highlighting strategies for optimizing performance, such as using integer mathematics, memory-mapped I/O, and optimizing front-end processes. Future work involves refining file formats and exploring the Sphinx 3.x architecture for improved efficiency.
Sphinx on Handhelds
E N D
Presentation Transcript
Sphinx on Handhelds David Huggins-Daines dhuggins@cs.cmu.edu
Sphinx on Handhelds? • Handheld/embedded devices are pretty speedy these days • LVCSR on them is not unreasonable • An open-source one does not exist yet • CALO’s new focus on mobility • S2S translation projects could use it • Sublime, smartphone applications, etc • ISL has it, so should we!
Handheld challenges • CPU speed • Typically 200-400MHz ARM/XScale • Faster than the workstations Sphinx started out on • No hardware floating-point instructions • ARM has very fast and sophisticated integer ISA • Memory and storage capacity/speed • DRAM is very limited (32 or 64MB) • Storage is very slow (typically CF cards) • Inefficient and clumsy operating systems • WinCE has no stdio, broken malloc, 32MB limit • PalmOS is much, much worse!
Plan for Sphinx on Handhelds • Start out with Sphinx2 • It’s fast • People use it already • Convert “hot spots” to integer math • Precompute model files • Avoid parsing (no stdio, remember) • Allow memory-mapped I/O (subvert the 32MB limit on WinCE) • Disable non-useful features in libraries • e.g. flat lexicon search, CDHMM
Current Status • Sphinx2 on Sharp Zaurus • Linux, 40MB system RAM, 206MHz ARM • Performance on RM1: 1.7x realtime • No degradation in accuracy • Integer front-end and GMM code complete • Front end also has a “faster” mode • 10% faster, 10% degradation in accuracy • Memory consumption is too high • WSJ5k can just barely run • Sphinx2 consumes about 16MB of heap space • Requires quantized mixture weights (-8bsen) • Sphinx3.x is much smaller … and slower
Implementation details • FFT is done with 16:16 fixed point • Bits 31:16 are whole part and sign • Bits 15:0 are fractional part • I.e. all numbers scaled by 65536 • Lossless multiplication done using 4 integer shift-multiply-accumulates (ARM is really good at this) • Mel-spectrum calculated in log scale • Using base 1.0001 in order to exploit existing add-table implementation • “Faster” mode uses 28:4 fixed point instead • Overflows saturated to INT_MAX • Zeroes floored to log(2-4) - very important!
Implementation details • Abstract types for intermediate values • mfcc_t, powspec_t, mean_t, var_t • #define FIXED_POINT to make them ints • Arithmetic macros (fixpoint.h) • fixed32 type analogous to float32 • addition and subtraction work as expected • MFCCMUL(), MFCC2FLOAT(), FLOAT2MFCC() macros become no-ops in floating-point build • GMMADD(), GMMSUB() do saturating addition and subtraction • ARM has special instructions for this too! Wow!
Future Work • Rationalize the file formats • General WinCE porting (Mohit) • Front-end optimization • Implement fixed-point FHT • Investigate Sphinx 3.x for embedded • SubVQ and GS can make it fast and cut memory consumption even more • Much nicer architecture • But not widely used, API not as stable