290 likes | 563 Vues
Brief Overview of Different Versions of Sphinx. Arthur Chan. Introduction . Software aspect of the recognizer is very important Research always require correct use of the software. Sphinx II + III + IV + SphinxTrain ~= 100 k lines of code Each of them are fairly complex.
E N D
Brief Overview of Different Versions of Sphinx Arthur Chan
Introduction • Software aspect of the recognizer is very important • Research always require correct use of the software. • Sphinx II + III + IV + SphinxTrain • ~= 100 k lines of code • Each of them are fairly complex
This presentation (30 pages) • Introduction (3 pages) • History of Sphinx (13 pages) • Sphinx I (2 pages) • Sphinx II (2 pages) • Sphinx III (3 pages) • SphinxTrain (3 pages) • Sphinx IV (3 pages) • How do I get the source code? (4 pages) • Versioning • Three rules of not getting lost in different recognizers • Where can I get “official” information? (2 pages) • Outlook in each recognizer. (3 pages) • Conclusion
Brief history of Sphinx • Largely adapted from • Rita’s “The Sphinx Speech Recognition Systems” • www.cs.cmu.edu/~rsingh/ • Kevin et al’s “Speech Recognition: Past, Present and Future” • www.cs.cmu.edu/~msiegler/ASR/futureofcmu-final.html
Before Sphinx • Dragon • One of the first use of HMM in speech recognition • One of the first use of “purely statistically model” in speech • Express the knowledge using HMM network • Harpy • One of the first use of beam search • Use phoneme to represent words.
Sphinx I • Before Sphinx …... • From AT&T’s literature, the concept of speaker-independence was proposed in 1979 • In 1979-1987, most systems are either, • Speaker dependent • Speaker independent but in a very small domain (<100 words) • Sphinx I is therefore outstanding • Accuracy is 90% on Resource Management
Sphinx I (1987) • By Kai-Fu Lee and Roberto Bisiani • Key developer included Hsiao-wuen Hon, Fil Alleva • Written in C. • Continuous speech recognizer using discrete HMM with 3 codebooks of size 256. • Using simple word-pair grammar • Generalize triphones • Real-time on Sun3 or Dec 3000 • Where is the source code? Good antique!
Sphinx II (1992) • By Xuedong Huang • Hardwired to 5-state Bakis topology • 3-gram language models • Decision-tree tying of HMM (by Mei-Yuh Huang) • 90% in WSJ task (0 or 1?)
Fast Beam Search v. X • FBS-6 flat lexicon decoder • FBS-7 lexicon tree-based. • FBS-8 decoder (written by Ravi Mosur, see thesis in 96) • Support multiple types of beam pruning. • Lexical tree • Tricks in GMM Computation • Machine optimization: loop unrolling • Predictive Codebook computation • Phoneme lookahead • Best path search .
Other facts about Sphinx II • We license it at the beginning (seem to back till days like 95) • In 2000, it starts to be open-sourced in Sourceforge under Berkeley’s style license • You could incorporate Sphinx’s source code • You don’t need to open your source code. (No recursive legal binding) • Similar to LGPL • In 2001, a major alpha release by Kevin that ensures portability in several platforms.
Sphinx III flat lexicon decoder (“s3”,“s3flat”,”s3slow”) • Sphinx III (by Ravi Mosur) • Flat Lexicon • Support both CHMM and SCHMM • “Poor-man” trigram • Use only the most likely first word, this avoid D^2 expansion of the word lattice. • Arbitrary topology • Very accurate, used in evaluation of BN and others. • Derivative from the search include • N-best generator • Aligner • Phone recognizer
Sphinx III tree lexicon decoder(“s3.x”,”s3fast”,”s3inaccurate”) • What is s3.x actually? • A “spin-off” of the Sphinx III flat lexicon’s source code • First use was in BN 10x RT evaluation in 1999 • From s3.0 -> s3.2 • Use tree-lexicon with unigram lookahead • Lexical tree with approximation to avoid memory problem • One of the first in the world used Sub-vector quantization in speed-up GMM computation
(cont.) • From s3.2 -> s3.3 (Rita, Ricky) • Live mode recognizer (livedecode) and simulator (livepretend) • From s3.3 -> s3.4 (Evandro, Arthur C, Jahanzeb,) • 4-level of speed-up of GMM computation, phoneme lookahead • Bug fixes in live mode • From s3.4 -> s3.5 (Evandro, Arthur C, Yitao) • (Tentative) Speaker adaptation + documentation
Facts about S3 • A Java version exists -> sphin3j • Open source at ~2002 • Always being maintained by Evandro from 2001 to now. • s3.5 is the current active branch in S3 development.
SphinxTrain • Equally important and very complex • But not well understood. • What is SphinxTrain? • A collection of ~40 tools for Sphinx 2, 3 and 4 acoustic model training • A set of perl scripts to do training • Sphinx 2 and 3 all have slight different formats of models
Mini-history • Baum Welch trainer and Viterbi trainer existed very long time ago. • Training tool in general was not systematic and was no structured. • From the chaos, Eric Thayer first pull everything together to create the package SphinxTrain • Rita did numerous bug fixes and modification of the current trainer • Innovate the use of automatic question generation. (make_quest) • Built a set of training scripts for RM (the 0*/ scripts) • Write the first set of systematic tutorial on training • Ricky refined the code and wrote the first set of perl script for Training. • He made a PHD out of it too. (PHD = Push Here Dummy!) • Alan and Kevin • Put the set of code to sourceforge • Alan build a set of training script that can “run-through”
Sphinx IV • Why Sphinx IV? • Too many limitations in SphinxTrain and Sphinx III • Only N-gram • Approximation of triphones • Fast GMM computation could be very troublesome to understood • Bw doesn’t skip silence. We heavily rely on force alignement in training.
Sphinx IV (cont.) • (By no mean complete……) • Lead Design : Bhiksha (MERL) • Lead Team Developer : Willer Walker (Sun) • Key developers : Evandro, Rita, Phillip Kwok and Paul Lamere • Many heavy weight speech advisors: Evandro, Rita, Ravi, Bhiksha, Medro Moreno ……
Is Sphinx IV good? • Very accurate, very fast, very versatile and very nicely-pakcaged Java-based speech recognizer • Some internal benchmark in RM and WSJ 5k is shown to be faster and more accurate than s3.3 (under 1xRT and 10% better) • Support N-gram, FSM and FSG. • Will provide facilities like confidence-scoring • Still under development (just have first alpha release) • Trainer is not stable
Summary of the recognizers and trainers • Sphinx I -> obsolete • Sphinx II -> we are using the fast recognizer now • Sphinx III, the following coexists • S3 flat • S3 fast (s3.4 stable, s3.5 devel) • SphinxTrain (0.92 in the CVS) • Sphinx IV • Recognizer is alpha released • Trainer not yet stable
How can I get version X of Sphinx? • Official Web page of Sphinx • http://cmusphinx.sourceforge.net • Give announcement and news of development • Some documentation is there. • For the tarballs • http://sourceforge.net/projects/cmusphinx • Releases: • sphinx2-0.4.tgz (s2) • sphinx3-0.1.tgz (s3.3) • sphinx3-0.4-rc2.tgz (s3.4 release candidate II) • sphinx4-0.1alpha-src.zip (s4)
Rule 2: If it doesn’t exist in CVS, officially it doesn’t exist • Simply speaking, no one actually support and maintain them. Software fall into this category: • CMU LM Toolkit (we haven’t touched it for a while) • We may do it in the future. • Phoenix (Distributed somewhere else) • Training scripts in csh • Rita always actively support it.
Rule 1: If they were no tarballs, they are in CVS • ANYONE can get the following modules through CVS by using the following commands: • cvs –z3 –d:pserver:anonymous@cvs.sourceforge.net;/cvsroot/cmusphinx co modulname • modulename = • SphinxTrain -> SphinxTrain • archive_s3 -> s3 + s3.0 + s3.2 + s3.3 • sphinx2 -> devel ver. of sphinx2 • sphinx3 =~ s3.4 -> we will check base on this to develop s3.5 • share =~ cepview + lm3g2dmp • sphinx3j = the java version of sphinx3 • Sphinx4 = development version of sphinx4
Rule 3: You may need other modules to complete your task • SphinxTrain heavily rely on force alignment so you also need s3-align • Usage of any s3 recognizers required the LM in DMP format so you need the tool lm3g2dmp which can be found in sphinx2 or share.
Where can I get more information for the recognizer? • People to ask • s2 : Evandro , Ravi • S3 flat : Evandro, Ravi , ArthurC • S3 tree: Evandro, Ravi, ArthurC • SphinxTrain: Rita, Evandro, Ravi, ArthurC, Rong, Ziad, Murali. • S4 : S4’s developers in Sourceforge • Willie, Paul, Phillip, Bhiksha, Rita, Evandro.
Web page to look up • Rita’s web page • www.cs.cmu.edu/~rsingh • Contains the manual of training • Twiki web page for sphinx 4 design • www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/WebHome/ • ArthurC’s web page • Risk his life to write a manual for Sphinx 3.4 • Also collect some information for each Sphinx
Outlook of all recognizers • Sphinx II • Sorry, we won’t support it too much. • Reason, s3.4 and s4 are proved to have very nice speed and accuracy performance • Sphinx III • Only active branch is s3.5 • Moderate change in s3flat • Motivated by project CALO • This quarter : make adaptation works. • SphinxTrain • Write a set of scripts for Continuous HMM training • Silence deletion problem will be fixed.
(cont.) • sphinxDoc • Chapter 1 and 2 completed (*sigh*, still 7 left) • Only begin written when Arthur C is procrastinating and don’t want to read and play video game. • Will be there at around Sep or Oct. • Sphinx IV • Alpha release • Trainer will be fixed • Argus • Incorporate the advantages of many speech recognizers together • Not yet started.
Conclusion • This presentation • Summarize the current code status of Sphinx and SphinxTrain. • We still have a lot of work to do…… • Next presentation • s3 or s3.4 from main to the search.