120 likes | 149 Vues
Explore the current state and future trends of speech recognition technology, its applications, limitations, and the impact on information access. Learn about evaluation metrics, technology demands, and the comparison with online text. Discover how speech recognition can revolutionize information access and communication in the digital age.
E N D
Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph Picone, PhD Professor, Electrical Engineering Mississippi State University Patti Price, PhD VP Business Development BravoBrava LLC
Outline • Introduction and state of the art (Price) • Research issues (Picone) • Evaluation metrics • Acoustic modeling • Language modeling • Practical issues • Technology demands • Conclusion and future directions (Price)
Introduction What is Speech Recognition? Words Speech Recognition “How are you?” Speech Signal • Speech recognition does NOT determine • Who is talker (speaker recognition, Heck and Reynolds) • Speech output (speech synthesis, Fruchterman and Ostendorf) • What the words mean (speech understanding) Goal:Automatically extract the string of words spoken from the speech signal
Introduction Speech in the Information Age Source of Information Film, video, multimedia, voice mail, radio, television, conferences, web, on-line resources Speech Text • Speech & text were revolutionary because of information access • New media and connectivity yield information overload • Can speech technology help? Time Access to Information Listen, remember Read books Computer typing Conversational language Careful spoken, written input
State of the ArtInitial and Current Applications 1997 • Database query • Resource management • Air travel information • Stock quote • Command and control • Manufacturing • Consumer products http://www.speech.be.philips.com/ Nuance, American Airlines: 1-800-433-7300, touch 1 • Dictation • http://www.dragonsys.com • http://www-4.ibm.com/software/speech
State of the ArtHow Do You Measure? USC, October 15, 1999: “the world's first machine system that can recognize spoken words better than humans can.” “ In benchmark testing using just a few spoken words, USC's Berger-Liaw … System not only bested all existing computer speech recognition systems but outperformed the keenest human ears.” • What benchmarks? What was training? What was test? Were they independent? How large was the vocabulary and the sample size? Did they really test all existing systems? “… functions at 60 percent recognition with a hubbub level 560 times the strength of the target stimulus.” • Is that different from chance? Was the noise added or coincident with speech? What kind of noise? Was it independent of the speech?
State of the ArtFactors that Affect Performance 2005 all speakers of the language including foreign wherever speech occurs 2000 regional accents native speakers competent foreign speakers vehicle noise radio cell phones 1995 speaker independent and adaptive normal office various microphones telephone quiet room fixed high –quality mic USER POPULATION speaker-dep. NOISE ENVIRONMENT 1985 application– specific speech and language careful reading expert years to create app– specific language model SPEECH STYLE COMPLEXITY planned speech some application– specific data and one engineer year natural human-machine dialog (user can adapt) all styles including human-human (unaware) application independent or adaptive
Research Theory and TrendsInitial and Current Applications • Insert Joe’s slides here
Conclusion and Future DirectionsTrends Speech as Access Speech as Source Information as Partner What are the words? What does it mean? Here’s what you need. We need new technology to help with information overload • Speech information sources are everywhere • Voice mail messages • Professional talk • Lectures, broadcasts • Speech sources of information will increase • As devices shrink • As mobility increases • New uses: annotation, documentation
Conclusion and Future DirectionsLimitations on Applications • Recognition performance, especially in error recovery UI • Natural language understanding (speech differs from text) • Speech unfolds linearly in time • Speech is more indeterminate than text • Speech has different syntax and semantics • Prosody differs from punctuation • Cost to develop applications (too few experts) • Cost to integrate/interoperate with other technologies • New capabilities • "When did he say Y and was he angry?” • Scanning, refocusing quickly (browsing) • Match past pattern, find novel aspects • Proactive information • Gist, summarize, translate for different purposes
Conclusion and Future DirectionsApplications on the Horizon Why doesn’t belong in the classroom • Beulah Arnott: also true of indoor plumbing Beginnings of speech as source of information • ISLIP http://www.mediasite.net/info/frames.htm • Virage http://www.virage.com • Speech technology in education and training • Cliff Stoll, High Tech Heretic • Good schools need no computers • Bad schools won’t be improved by them • BravoBrava: Co-evolving technology and people can • Dramatically reduce the cost of delivery of content • Increase its timeliness, quality and appropriateness • Target needs of individual and/or group • Reading Pal demo
SummaryGoal: Speech Better Than Text Healthy loop between research and applications • Research leads to applications, which lead to new research opportunities We need collaboration • Too much for one person, one site, one country Humans will probably continue to be better than machines at many things Can we learn to use technology and training to augment human-human and human-machine collaboration? It’s not a solved problem • Further technology development needed to enable the vision