Voice Recognition

Voice Recognition Lawrence Pan Syen Hassan Jamme Tan

Overview • History of voice recognition • Why voice recognition? • Technology behind voice recognition • Five major steps • Common applications • Current leaders • Demonstrations • Product Evaluation • Implementation of our own voice recognition system • Grade retrieval system for EE3414 • Future Challenges

History of Voice Recognition • Radio Rex (house trained dog), 1922 • U.S Department of Defense, 1940’s • Speech Understanding Research (SUR) program • Carnegie Mellon University & MIT • Automatic interception & translation of Russian radio transmissions (FAILURE) • Original message: “the spirit is willing but the flesh is weak” • Translated message: “the vodka is strong but the meat is disgusting.”

History Cont’d • First major achievements • Bell Laboratories, 1952 • Successful recognition of numbers 0 to 9, spoken over telephone • MIT, 1959 • Successful recognition of vowels with 93% accuracy • Carnegie Mellon University, 1970’s • HARPY system: capable of recognizing complete sentences

History Cont’d • Obstacles • Computing power: over 50 computers needed for HARPY system to perform • Ability to recognize speech from any person • Taking in account different accents, speech tones, etc. • Ability to recognize continuous speech • so…we…do…not…have…to…speak…like…this! • Commercialization of voice recognition systems

History Cont’d Computation required and computation available in available processors over time Accuracy and task complexity progress over time

Why Voice Recognition? • Convenience • Natural user interface: human speech • Improved services for the disabled • Wider range of users • Future possibilities and improvements • Internet use over phones through voice portals • Advanced applications implementing voice control in all areas

Technology behind Voice Recognition • Five major steps used by speech recognizer

Five major steps in voice recognition • Capture and Digitalization • System interacts with the telephony device to capture voice input at 8000 samples/sec • Spectral Representation • Voice samples converted to graphical representation • Segmentation • Speech signals are broken down into segmented parts. • Improves accuracy • Reduces computation: impossible to process entire signal in real time

Graphical Representations

Acoustic Model • Phonemes – smallest phonetic unit in a language • Creates distinction between other words • e.g. b in boy and t in toy • Allophone – different pronunciations of a phoneme/letter • E.g. t in tab, t in stab, tt in stutter • Database (Lexicon) of all words known to the system for a language • Should contain several recordings for certain words • E.g. “the” can be pronounced “duh” or “dee”

Acoustic Model Cont’d • Trelliss • Data structure made up of all possible combinations of allophones • Training of Acoustic models • For single-user systems • Text is read by user and recognized by system • For multi-user systems • Utterances spoken by many users compiled into a database, then inputted into a recognizer • Weights are put on certain allophones

Language Model • Languages have structures (i.e. grammar) • Difference between two words can be difficult to understand • Can be distinguished using context • E.g. “ours” and “hours” can be determined if previous word is “two”

Common Applications • Call Center Automation • Widely used in all industries (consumer interface) • Airline companies: booking flights, general info, etc. • Banking companies: “pay by phone”, account balances, etc. • Delivery Services (FedEx): tracking orders, etc. • All general customer service systems • Computer Integration of voice recognition • Personal Computers • Speech to Text Dictation • Accessibility purposes: voice control of computers

Common Applications cont’d • Integrated into automobiles: • Visteon Voice Technology™ used in Infiniti Q45 • Controls: • Climate • CD player • Navigation system

Competing Standards • VoiceXML (extensible markup language) • Partners: AT&T, IBM, Motorola, Lucent Tech. • Used in implementation of most voice portals • Shifting target toward web developers • SALT (Speech Application Language Tags) • Partners: Microsoft, Intel, Cisco, SpeechWorks • Targeted toward web developers

Future Challenges • Speech Technology • VoiceXML vs. SALT • Voice enabling web content • Real time access to source data • Stock market, traffic, sports, etc. • Clear connection needed for effective use of voice portals • Security Issues involved • Advertising based revenue

Voice Recognition