Open-vocabulary speech indexing for voice and video mail retrieval.

ACM Multimedia 96 Best Paper Award Open-vocabulary speech indexing for voice and video mail retrieval. M. G. Brown, J. T. Foote, G. J. F. Jones, K. Spärck Jones, and S. J. Young. Presented by Zahid Anwar

Outline • Motivation • Overview of Audio Recognition and Word Spotting • Authors’ Infrastructure • Phonemes, lattices (construction & scanning) • Evaluation • VMR Architecture • Comments / Future Work • Related Work

Motivation • A “Google” for audio/video? • Browsing answering machine full of messages on returning from a vacation. • Flexibility over video-on-demand:only one way of viewing content permitted. • Distance learning. Search university and corporate archives of recorded audio and video, analyst calls, prepared lectures, and instructional material. • Automatic call routing. Infer inbound caller intentions without resorting to IVR or finite-state grammar-based speech recognition. • Focus group analysis. Find and extract relevant clips without slogging through hours of recorded material. Reduce time and expense associated with preparing summaries and recommendations. • Deposition & interview support: Scenarios where expensive transcription isn’t available/used, search recorded depositions & interviews for key remarks.

Speech Recognition The complex speech production/perception Process

Speech Recognition • Traditional: Large Vocabulary Continuous Speech Recognition (LVCSR), performs time alignment, and produces an index of text content along with time stamps. • Recognizer tries to transcribe all input speech as a concatenation of words in its vocabulary. • Sufficiently mature that toolboxes are publicly available such as HTK (from Cambridge University, England) and ISIP (Mississippi State University, USA) as well as a host of commercial offerings.

Medusa Networked Multimedia Systems • Cambridge University, in collaboration with Olivetti Research Laboratory (ORL) • In regular use on a high speed ATM network • 1994, Message retrieval from known speakers (35 predefined keywords) • 1995, extended to unknown speakers • 1996, Open-keyword video document retrieval (arbitrary speakers) and a video mail browser

Word Spotting • Detecting fixed keywords in unconstrained speech • Recognizer is only concerned with occurrences of one keyword or phrase. • Based on HMMs – a state based statistical representation of a speech event (typically word or subword) • Phoneme: The smallest meaningful unit (linguistically distinct) in speech • Phone: Signifies the physical sound that is produced when a phoneme is uttered (realization of a phoneme) • All words are built from a phone sequence comprising 45 distinct phones

DARPA TIMIT Acoustic-phonetic Continuous Speech Corpus

Phones • Monophones –independent of context (used as fillers) • Biphones -used at the beginning and ends of a keyword • Triphones -model the internal structure • Keyword find : f+ay f-ay+n ay-n+d n-d. • Non-keyword speech modeled by an unconstrained network of monophones • Vary somewhat with context (depending on which preceed and which succeed) e.g “attack” and “stain” • Sufficient training data can help model this variation • Speaker-independent biphones models constructed by cloning the speaker dependent monophones such that each possible biphone represented

Lattice Generation • Active number of phone hypothesis: depth • Shallowness: poor performance (too many misses) • Depth: false alarms, increased storage, search HMM model parameters Viterbi Algorithm Spoken data A compact representation of multiple best hypothesis generated by a phone or word recognition system

Lattice Scanning (Generated before need) • Look at probabilities of various phonemes as we listen: • In corpus and “need” always starts with "n" sound. • What are the possibilities for the next sound? With probability 1, we know that next sound will be "iy". • What are possibilities for next sound? 11% of the time, “d” sound will be omitted • Probability of transitioning from "iy" to the "d" sound is 89. • Circles represent two things—states and observations. • In real world, state is hidden: For sound [iy], we don't know whether we are at second phoneme of the word “knee” or the second phoneme of the word “need”. • Overlapping word hypothesis are eliminated by keeping only the best scoring one • Searched extremely rapidly to find phone strings corresponding to desired query words • Too many hypothesis may lead to false alarms “need” sometimes “d” sound is left offIs it “knee” or “need”?

Speech Recognition Results • Natural-speech VMR corpus is more difficult than read speech (60% phone accuracy) to recognize. • Operating point of the recognition system may be adjusted by ignoring terms with a score below a given threshold. • An accepted Figure-of-merit (FOM) for word spotting is defined as the average percentage of correctly detected words as the threshold is varied from 1 to 10 • Not robust for short words

Information Retrieval • User composes a search request • Set of actual search terms derived • Documents scored from the number or weights of matching query terms. • An inverted file structure computed (at search time) where documents are indexed by term. • Retrieval efficiency improved by preserving lattice-scan results in the inverted file

IR Performance • 10 users, 5 text requests per person • Out of a 300 msg archieve a ‘suitable’ assessment subset was formed (30 messages from the category and 5 outside) • Preprocessing of the text transcription and written request (removal of function words) • Unweighted Score (uw) :count the number of terms in common • Collection Freq Weight (cfw) : • Favours rarer terms • Combined Weight (cw) • Normalises document length • Precision: Proportion of retrieved messages that are relevant • Average Precision: Precision values averaged for each relevant document (per query). Results averaged across the query set

Overview of the VMR System • Individual words per message assigned weights depending on • frequency of the word in the message, • the number of messages in which the word appears in • length of the message.

Enhancing Search Effectiveness • Concatenation of 2 words into one phone • “netscape”  “nets+scape” • Word Stems and shorter variants • “Managerial”  “manage” • Homophones (exact rhyme) • “Basque”  “bask” • Phonetic Representation • “Yeltsin”  “#y+#eh+#l+#t+#s+#ih+#n” • Conjunction • “A+T+M eighty”

Comments/Critiques + Trades off precision for a lot of speed for Rapid Audio Indexing - Mentions rates approaching x1000 real-time but no analysis +Independent of fixed vocabulary recognition or keywords +In fact no vocabulary at all +Can even recognize proper nouns! +Low penalty for new words +Work for any language given its phonemes e.g. German, French +Independent of spelling mistakes “Qaddafi”, “Khaddafi”, “Quadafy”, “Kaddafi”, or “Kadoffee”

Comments/Critiques +User-determined depth of search. Bad utterance of a phrase, background noise interference, causes wrong recognition by LVCSR. But recall that phonetic searching, however, returns multiple results, sorted by confidence level. +Architecture for phonetic searching implemented as a Software Developer Kit (SDK) in the Microsoft® Windows NT™ and Windows 2000™ environments +Playback of portion of video which is of interest but -Title of paper a little misleading (doesn’t use video cues)

Questions and Future Work • Storage Requirements? • Incorporation of speech based indexing? • Scalability? Mentions infeasibility/ineffectiveness of searching a 300 messages archive. • Searching phonetic tracks may lead itself to parallel execution. • Encoding of phonetic Lattice.

Related Work • Large-Vocabulary ASR and keyword spotting to investigate topic-identification (tests on Switchboard corpus by workers at BBN) • Informedia Project at CMU uses a combination of video and audio analysis and text based information retrieval techniques to align spoken words with the transcript of the video)

Related Work • ETH Zurich on speech retrieval using vowel-consonant-vowel subwords • M.I.T Open Language System Group investigates a number of possible sub-word units, from sequences of phones to multiple phone syllable units • A conversation between a user and JUPITER, an SLS-based weather forecast system:

The End • Thank you

General Aspects • A phone is a sub-word unit, equivalent to a unit of pronunciation, larger that a letter, but smaller than a word. • With phones recognition, there is no single stream, but rather series of phones/probabilities

Open-vocabulary speech indexing for voice and video mail retrieval.