Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe

Next Generation Speech and Video:Support for Research in Advanced Speech Recognition TechnologiesJohn GarofoloIAD Speech Group

Overview • Directions in automatic speech recognition • DARPA EARS Program • NIST RT-02 Evaluation • NIST Meeting Data Collection Project

Our Vision of the Future • Tightly-couple ASR with higher-level language technologies • non-lexical information from the source signal • speaker ID, speaking rate, prosody, emotion, non-speech sounds, etc. • real-time integration of language processing technologies • Semantic, syntactic, contextual, world knowledge, and ASR • integration with video input when available • face recognition, lip movement, gestures, people movement, object manipulation • Improved resources for readability and automatic content-processing technologies • Translation, Detection, Search, Extraction, Summarization, etc.

Possible Enriched ASR Output Derived Human Readable Transcript <speaker name=“Peter Jennings”> <sent> tonight this <proper_noun> thursday </proper_noun> big pressure on the <proper_noun>clinton </proper_noun> administration to do something about the latest killing in <proper_noun>yugoslavia </proper_noun></sent><sent>airline passengers and outrageous behavior at thirty thousand feet</sent> <sent type=interrogative>what can an airline do</sent> <sent type=interrogative>and now that <proper_noun>el nino</proper_noun> … Peter Jennings: Tonight this Thursday, big pressure on the Clinton administration to do something about the latest killing in about the latest killing in Yugoslavia. Airline passengers and outrageous behavior at thirty thousand feet. What can an airline do? And now that El Nino is virtually gone, there is La Nina to worry about. Announcer: From ABC News World Headquarters in New York, this is World News Tonight with Peter Jennings. Peter Jennings: Good evening. Enriched Transcription(Broadcast News Example) Traditional ASR Output tonight this thursday big pressure on the clinton administration to do something about the latest killing in yugoslavia airline passengers and outrageous behavior at thirty thousand feet what can an airline do and now that el nino is virtually gone there is la nina to worry about from a. b. c. news world headquarters in new york this is world news tonight with peter jennings good evening Annotated Word Stream Human readable Other language processing

DARPA EARS ProgramEffective, Affordable, Reusable Speech-to-Text • Multi-faceted program to improve state-of-the-art • Accuracy • novel approaches: perceptual/articulatory and prosodic features, more sophisticated search networks and language models, metadata feedback • Utility • usable interfaces • transcription enhanced with metadata (rich transcription) • Portability • new domains/training data, new languages, flexible language models • Speed • faster, more efficient processing algorithms • NIST will provide evaluation infrastructure for EARS • Accuracy measurement of core STT and metadata recognition • Usability measurement within context of integrated systems

EARS Objective EARS Multiple Applications WORDS + METADATA Powerfulspeech-to-text technology Input: Human-human speech(broadcasts, conversations) Output:Rich transcript(words + metadata) • Accurate enough for • Humans to read & understand easily • Machines to detect, extract, summarize, translate

RichTranscription Summarization Prototype System Metadata Extraction CoreSpeech-to-Text Interfaces NovelApproaches Linguistic Data EARS Structure Detection 1 Extraction TIDES Algorithms HUMAN-HUMAN SPEECH Translation WORDS + METADATA  EARSStandard Format (XML or DAML) 4 Adaptable to Different Languages & Media 2 3

Rich Transcription 2002 (RT-02) • Evaluation effort and workshop pushing the envelope of existing automatic transcription technology • Will also provide accuracy evaluation for DARPA EARS Program • Challenge test set to baseline current capabilities • ~3+ hours of news broadcasts, telephone conversations, and meeting excerpts • evaluation of automatic transcription of orthography AND generation of metadata annotations • metadata annotation will require new evaluation infrastructure • Dry run test April 2002 • Workshop May 2002 • http://www.nist.gov/speech/tests/rt/rt2002/

RT-02 Metadata • Currently considered types: • Speaker change detection/identification • Acronyms • NIST is administering the EARS evaluations • Verbal edits • General Dynamics’, uh no General Electric’s stock soared yesterday... • Named entities • George Bush addressed the country... • Numeric expressions • The U S won thirty four medals in the Olympics • Temporal expressions • The U S was attacked on September eleventh • This list will most certainly expand/change in the future.

RT-02 Meeting Transcription • Not part of EARS, but included as a look to the future and is of interest to much of the community • more challenging than broadcast news or telephone conversations • Will consist of 8 10-minute meeting excerpts collected at: • CMU, ICSI, LDC, and NIST • Very broad test set wrt/ microphones, speakers, forums, noise • Will provide baseline for future meeting transcription research • Focus on personal mics (from head boom or lapel) and center omnidirectional mic

NIST Meeting Data Collection Project • Goals: • Provide rich/diverse pool of audio and video corpora for advanced recognition research • multiple sensor types – will add more over time • varied meeting forums and vocabularies • varying number and types of participants • Explore research and integration issues • Help provide infrastructure for integration and evaluation www.nist.gov/speech/test_beds/mr_proj

Data Collection Infrastructure • Typical meeting space and noise environment • Standard meeting equipment • Instrumented with • 200 mics, 5 cameras, synchronized with SmartFlow across 13 processors • processors under floor and in adjacent room • several disk arrays • Monitor workstation • operator can start/stop data streams, select video views, audio channels, and manipulate cameras • Review workstation • participants can review meeting recordings and de-select excerpts from public distribution

NIST Pilot Corpus Design • 20 hours of meetings will be collected • ~60GB per hour uncompressed data rate • data distributed on large hard disks • distributed via the Linguistic Data Consortium • Varied forums • focus groups, game playing, interacting with experts, real working group meetings, event planning • Varied meeting lengths • 15 minutes to 1 hour • Varied number of participants • 3 to 8 participants • Subset to be annotated for RT-02

NIST Smart FlowDistributed Processing - Multi modal sensor arrays - Multi-channel data collection • Large grain data flow, for distributed processing of sensor data • Components and flows used by name: • network transparent • component transparent • The system handles detail work • resource searching • data pushing • flow visualization interface • Time Tags flow buffers • Data types: Video, Audio, Vectors, Matrices, Opaque data • Promotes well defined, public, interface standard for component technologies • Open Source, documentation, currently downloadable http://www.nist.gov/smartspace/toolChest/nsfs/

Smart Flow in theMeeting Room • Creates and manages data flows • Captures multi-channel multi-modal meeting room sensors • Five Camera Views • Twenty Three COTS close-talk and omni microphones • Three 59-element microphone arrays • Archives and time stamps high bandwidth sensor data flows in real time • … about sixty gigabytes per hour

What’s Next? • Addition of teleconference microphones and phone-channel recordings • collection of multi-site video/teleconferenced meetings • Replacement of array microphones with next-generation models with onboard A/D • Addition of interactive electronic whiteboard • will log and timestamp interactions • synchronize with audio and video • Exploration of other room sensors/interactive devices • e.g., location badges, handheld devices/wireless networks, collect data streams to screen • Development of multi-modal/multi-channel annotation tools

Review Workstation Demo • Multi-view/multi-audio channel • Permits subjects to review their meetings and request excerpts to be excluded from publication • LINUX-based • Uses SmartFlow architecture • Sample meeting excerpt • Group interaction with domain expert on office furnishing

Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe

Next Generation Speech and Video: Support for Research in Advanced Speech Recognition Technologies John Garofolo IAD Spe

Presentation Transcript

Get Your Game On! Video Games in the Library

Video On Demand

Free Speech/1 st Amendment

Text to Speech Systems (TTS) EE 516 Spring 2009

Speech Recognition

Reconstructing Spontaneous Speech

Parts of Speech

Occupational and Speech Therapy: Treating children with ASD

Why Inner Speech?

Parts of Speech

Laryngeal Function and Speech Production

A Tutorial on Bayesian Speech Feature Enhancement

CS626-449: NLP, Speech and Web-Topics-in-AI

Text to Speech Systems (TTS) EE 516 Spring 2009

Clear and present danger (test)

Feature Extraction for speech applications

Video Coding Concept

Conditional Random Fields for Automatic Speech Recognition

The Millennial Generation: A Generation Split by Ability Level

Video Store Pro review demo and premium bonus

Video Vibe Pro Review

Vlydo Video Player review - Vlydo Video Player sneak peek features