610 likes | 629 Vues
Learn about the importance of speech-based information retrieval, speech recognition technology, spoken document retrieval, and ubiquitous information retrieval using spoken dialog. Explore how this technology is revolutionizing the way we search and access audio-visual content online.
E N D
Speech-based Information Retrieval Gary Geunbae Lee POSTECH Oct 15 2007, ICU
contents • Why speech-based IR? • Speech recognition technology • Spoken document retrieval • Ubiquitous IR using spoken dialog
Why speech IR? – SDR (backend multimedia material)[ch-icassp07] Broadcast News Podcasts Academic Lectures 3 • In the past decade there has been a dramatic increase in the availability of on-line audio-visual material… • More than 50% percent of IP traffic is video • …and this trend will only continue as cost of producing audio-visual content continues to drop • Raw audio-visual material is difficult to search and browse • Keyword driven Spoken Document Retrieval (SDR): • User provides a set of relevant query terms • Search engine needs to return relevant spoken documents and provide an easy way to navigate them
Why speech IR? - Ubiquitous computing (frond-end query) • Ubiquitous computing: network + sensor + computing • Pervasive computing • Third paradigm computing • Calm technology • Invisible computing • Irobot style interface – human language + hologram
Ubiquitous computer interface? • Computer – robot, home appliances, audio, telephone, fax machine, toaster, coffee machine, etc (every objects) • VoiceBox (USA) • Telematics Dialog Interface (POSTECH, LG, DiQuest) • EPG guide (POSTECH) • Dialog Translation (POSTECH)
Home networking Car-navigation Tele-service Robot interface Example Domain for Ubiquitous IR
What’s hard – ambiguities, ambiguities, all different levels of ambiguities John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. [from J. Eisner] - donut: To get a donut (doughnut; spare tire) for his car? - Donut store: store where donuts shop? or is run by donuts? or looks like a big donut? or made of donut? - From work: Well, actually, he stopped there from hunger and exhaustion, not just from work. - Every few hours: That’s how often he thought it? Or that’s for coffee? - it: the particular coffee that was good every few hours? the donut store? the situation - Too expensive: too expensive for what? what are we supposed to conclude about what John did?
contents • Why speech-based IR? • Speech Recognition Technology • Spoken document retrieval • Ubiquitous IR using spoken dialog
The Noisy Channel Model • Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] • The noisy channel model [Lee et al., 1996] • Acoustic input considered a noisy version of a source sentence Noisy Channel Decoder Source sentence Noisy sentence Guess at original sentence 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요?
The Noisy Channel Model • What is the most likely sentence out of all sentences in the language L given some acoustic input O? • Treat acoustic input O as sequence of individual observations • O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: • W = w1,w2,w3,…,wn Bayes rule Golden rule
Speech Recognition Architecture Meets Noisy Channel 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? Feature Extraction Decoding Speech Signals Word Sequence Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation
25ms . . . 10ms a1a2a3 Feature Extraction • The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] • Frame size : 25ms / Frame rate : 10ms • 39 feature per 10ms frame • Absolute : Log Frame Energy (1) and MFCCs (12) • Delta : First-order derivatives of the 13 absolute coefficients • Delta-Delta : Second-order derivatives of the 13 absolute coefficients X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension)
bj(x) codebook Acoustic Model • Provide P(O|Q) = P(features|phone) • Modeling Units [Bahl et al., 1986] • Context-independent : Phoneme • Context-dependent : Diphone, Triphone, Quinphone • pL-p+pR : left-right context triphone • Typical acoustic model [Juang et al., 1986] • Continuous-density Hidden Markov Model • Distribution : Gaussian Mixture • HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause
Pronunciation Model • Provide P(Q|W) = P(phone|word) • Word Lexicon [Hazen et al., 2002] • Map legal phone sequences into words according to phonotactic rules • G2P (Grapheme to phoneme) : Generate a word lexicon automatically • Several word may have multiple pronunciations • Example • Tomato • P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 • P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [ow] [ey] 0.5 1.0 0.2 1.0 1.0 [m] [t] [ow] [t] 0.8 1.0 0.5 1.0 [ah] [aa]
ONE TWO ONE THREE Sentence HMM ONE TWO THREE ONE W AH N Word HMM ONE W Phone HMM 2 1 3 Training • Training process [Lee et al., 1996] • Network for training yes Speech DB Feature Extraction Baum-Welch Re-estimation Converged? End no HMM
Language Model • Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] • We saw this was also used in the decoding process as the probability of transitioning from one word to another. • Word sequence : W = w1,w2,w3,…,wn • The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language • n-gram Language Model • n-gram language models use the previous n-1 words to represent the history • Bi-grams are easily incorporated in a viterbi search
Language Model • Example • Finite State Network (FSN) • Context Free Grammar (CFG) • Bigram 세시 네시 서울 부산 에서 기차 버스 출발 하는 대구 대전 출발 도착 $time = 세시|네시; $city = 서울|부산|대구|대전; $trans = 기차|버스; $sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans P(에서|서울)=0.2 P(세시|에서)=0.5 P(출발|세시)=1.0 P(하는|출발)=0.5 P(출발|서울)=0.5 P(도착|대구)=0.9 …
I L 일 I L I 이 S S A M 삼 A S A 사 M Intra-word transition Word transition start end 이 I P(이|x) 이 LM is applied 일 P(일|x) 일 I L 사 P(사|x) 사 Between-word transition A S 삼 P(삼|x) 삼 S M A Network Construction • Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model Language Model Search Network
Decoding • Find • Viterbi Search : Dynamic Programming • Token Passing Algorithm [Young et al., 1989] • Initialize all states with a token with a null history and the likelihood that it’s a start state • For each frame ak • For each token t in state s with probability P(t), history H • For each state r • Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H
Decoding • Pruning [Young et al., 1996] • Entire search space for Viterbi search is much too large • Solution is to prune tokens for paths whose score is too low • Typical method is to use: • histogram: only keep at most n total hypotheses • beam: only keep hypotheses whose score is a fraction of best score • N-best Hypotheses and Word Graphs • Keep multiple tokens and return n-best paths/scores • Can produce a packed word graph (lattice) • Multiple Pass Decoding • Perform multiple passes, applying successively more fine-grained language models
Large Vocabulary Continuous Speech Recognition (LVCSR) • Decoding continuous speech over large vocabulary • Computationally complex because of huge potential search space • Weighted Finite State Transducers (WFST) [Mohri et al., 2002] • Dynamic Decoding • On-demand network constructions • Much less memory requirements WFST Word : Sentence Search Network WFST Phone : Word Combination Optimization WFST HMM : Phone WFST State : HMM
Out-of-Vocabulary Word Modeling[ch-icassp07] How can out-of-vocabulary (OOV) words be handled Start with standard lexical network Separate sub-word network is created to model OOVs Add sub-word network to word network as new word, Woov OOV model used to detect OOV words and provide phonetic transcription (Bazzi & Glass, 2000)
Mixture Language Models[ch-icassp07] • When building a topic-specific language model: • Topic-specific material may be limited and sparse • Best results when combining with robust general model • May desire a model based on a combination of topics • …and with some topics weighted more heavily than others • Topic mixtures is one approach (Iyer & Ostendorf, 1996) • SRI Language Modeling Toolkit provides an open source implementation (http://www.speech.sri.com/projects/srilm) • A basic topic mixture-language model is defined as a weighted combination of N different topics T1 to TN :
Automatic Alignment of Human Transcripts[ch-icassp07] • Goal: Align transcript w/o time markers to long audio file • Run recognizer over utterances to obtain word hypotheses • Use language model strongly adapted to reference transcript • Align reference transcript against word hypotheses • Identify matched words ( ) and mismatched words (X) • Treat multi-word matched sequences as anchor regions • Extract new segments starting and ending within anchors • Force align reference words within each new segment si
contents • Why speech-based IR? • Speech Recognition Technology • Spoken document retrieval • Ubiquitous IR using spoken dialog
Spoken Document Processing[ch-icassp07] • The goal is to enable users to: • Search for spoken documents as easily as they search for text • Accurately retrieve relevant spoken documents • Efficiently browse through returned hits • Quickly find segments of spoken documents they would most like to listen to or watch • Information (or meta-data) to enable search and retrieval: • Transcription of speech • Text summary of audio-visual material • Other relevant information: • speakers, time-aligned outline, etc. • slides, other relevant text meta-data: title, author, etc. • links pointing to spoken document from the www • collaborative filtering (who else watched it?)
Transcription of Spoken Documents[ch-icassp07] Misspelled words Furui Frewey Makhoul McCool Tukey Tuki Eigen igan Gaussian galsian cepstrum capstrum Substitution errors Fourier for your Kullback callback a priori old prairie resonant resident affricates aggregates palatal powerful • Manual transcription of audio material is expensive • A basic text-transcription of a one hour lecture costs >$100 • Human generated transcripts can contain many errors • MIT study on commercial transcripts of academic lectures • Transcripts show a 10% difference against true transcripts • Many differences are actually corrections of speaker errors • However, ~2.5% word substitution rate is observed:
Rich Annotation of Spoken Documents[ch-icassp07] • Humans take 10 to 50 times real time to perform rich transcription of audio data including: • Full transcripts with proper punctuation and capitalization • Speaker identities, speaker changes, speaker overlaps • Spontaneous speech effects (false starts, partial words, etc.) • Non-speech events and background noise conditions • Topic segmentation and content summarization • Goal: Automatically generate rich annotations of audio • Transcription (What words were spoken?) • Speaker diarization (Who spoke and when?) • Segmentation (When did topic changes occur?) • Summarization (What are the primary topics?) • Indexing (Where were specific words spoken?) • Searching (How can the data be searched efficiently?)
Text Retrieval[ch-icassp07] • Collection of documents: • “large” N: 10k-1M documents or more (videos, lectures) • “small” N: < 1-10k documents (voice-mails, VoIP chats) • Query: • ordered set of words in a large vocabulary • restrict ourselves to keyword search; other query types are clearly possible: • Speech/audio queries (match waveforms) • Collaborative filtering (people who watched X also watched…) • Ontology (hierarchical clustering of documents, supervised or unsupervised)
Text Retrieval: Vector Space Model[ch-icassp07] • Build a term-document co-occurrence (LARGE) matrix (Baeza-Yates, 99) • rows indexed by word • columns indexed by documents • TF (term frequency): frequency of word in document • could be normalized to maximum frequency in a given document • IDF (inverse document frequency): if a word appears in all documents equally likely, it isn’t very useful for ranking • (Bellegarda, 2000) uses normalized entropy
Text Retrieval: Vector Space Model (2) [ch-icassp07] For retrieval/ranking one ranks the documents in decreasing order of relevance score: query weights have minimal impact since queries are very short, so one often uses a simplified relevance score:
Text Retrieval: TF-IDF Shortcomings[ch-icassp07] • Hit-or-Miss: • returns only documents containing the query words • query for Coca Cola will not return a document that reads: • “… its Coke brand is the most treasured asset of the soft drinks maker …” • Cannot do phrase search: “Coca Cola” • needs post processing to filter out documents not matching the phrase • Ignores word order and proximity • query for Object Oriented Programming: • “ … the object oriented paradigm makes programming a joy … “ • “ … TV network programming transforms the viewer in an object and it is oriented towards…”
Vector Space Model: Query/Document Expansion[ch-icassp07] • Correct the Hit-or-Miss problem by doing some form of expansion on the query and/or document side • add similar terms to the ones in the query/document to increase number of terms matched on both sides • corpus driven methods: TREC-7 (Singhal et al,. 99) and TREC-8 (Singhal et al,. 00) • Query side expansion works well for long queries (10 words) • short queries are very ambiguous and expansion may not work well • Expansion works well for boosting Recall: • very important when working on small to medium sized corpora • typically comes at a loss in Precision
Vector Space Model: Latent Semantic Indexing[ch-icassp07] • Correct the Hit-or-Miss problem by doing some form of dimensionality reduction on the TF-IDF matrix • Singular Value Decomposition (SVD) (Furnas et al., 1988) • Probabilistic Latent Semantic Analysis (PLSA) (Hoffman, 1999) • Non-negative Matrix Factorization (NMF) • Matching of query vector and document vector is performed in the lower dimensional space • Good as long as the magic works • Drawbacks: • still ignores WORD ORDER • users are no longer in full control over the search engine Humans are very good at crafting queries that’ll get them the documents they want and expansion methods impair full use of their natural language faculty
Probabilistic Models (Robertson, 1976) [ch-icassp07] • Assume one has a probability model for generating queries and documents • We would like to rank documents according to the point-wise mutual information • One can model using a language model built from each document (Ponte, 1998) • Takes word order into account • models query N-grams but not more general proximity features • expensive to store
Text Retrieval: Scaling Up[ch-icassp07] • Linear scan of document collection is not an option for compiling the ranked list of relevant documents • Compiling a short list of relevant documents may allow for relevance score calculation on the document side • Inverted index is critical for scaling up to large collections of documents • think index at end of a book as opposed to leafing through it! All methods are amenable to some form of indexing: • TF-IDF/SVD: compact index, drawbacks mentioned • LM-IR: storing all N-grams in each document is very expensive • significantly more storage than the original document collection • Early Google: compact index that maintains word order information and hit context • relevance calculation, phrase based matching using only the index
TREC SDR: “A Success Story” [ch-icassp07] • The Text Retrieval Conference (TREC) • pioneering work in spoken document retrieval (SDR) • SDR evaluations from 1997-2000 (TREC-6 toTREC-9) • TREC-8 evaluation: • focused on broadcast news data • 22,000 stories from 500 hours of audio • even fairly high ASR error rates produced document retrieval performance close to human generated transcripts • key contributions: • Recognizer expansion using N-best lists • query expansion, and document expansion • conclusion: SDR is “A success story” (Garofolo et al, 2000) • Why don’t ASR errors hurt performance? • content words are often repeated providing redundancy • semantically related words can offer support (Allan, 2003)
Broadcast News: SDR Best-case Scenario[ch-icassp07] • Broadcast news SDR is a best-case scenario for ASR: • primarily prepared speech read by professional speakers • spontaneous speech artifacts are largely absent • language usage is similar to written materials • new vocabulary can be learned from daily text news articles • state-of-the-art recognizers have word error rates <10% • comparable to the closed captioning WER (used as reference) • TREC queries were fairly long (10 words) and have low out-of-vocabulary (OOV) rate • impact of query OOV rate on retrieval performance is high (Woodland et al., 2000) • Vast amount of content is closed captioned
Beyond Broadcast News[ch-icassp07] • Many useful tasks are more difficult than broadcast news • Meeting annotation (e.g., Waibel et al, 2001) • Voice mail (e.g., SCANMail, Bacchiani et al, 2001)) • Podcasts (e.g., Podzinger, www.podzinger.com) • Academic lectures • Primary difficulties due to limitations of ASR technology: • highly spontaneous, unprepared speech • topic-specific or person-specific vocabulary & language usage • unknown content and topics potentially lacking support in general language model • wide variety of accents and speaking styles • OOVs in queries: ASR vocabulary is not designed to recognize infrequent query terms, which are most useful for retrieval • General SDR still has many challenges to solve
Spoken Term Detection Task[ch-icassp07] • A new Spoken Term Detection evaluation initiative from NIST • Find all occurrences of a search term as fast as possible in heterogeneous audio sources • Objective of the evaluation • Understand speed/accuracy tradeoffs • Understand technology solution tradeoffs: e.g., word vs. phone recognition • Understand technology issues for the three STD languages: Arabic, English, and Mandarin
Text Retrieval: Evaluation[ch-icassp07] • trec_eval (NIST) package requires reference annotations for documents with binary relevance judgments for each query • Standard Precision/Recall and Precision@N documents • Mean Average Precision (MAP) • R-precision (R=number of relevant documents for the query) • Ranking on reference side is flat (ignored)
contents • Why speech-based IR? • Speech Recognition Technology • Spoken document retrieval • Ubiquitous IR using spoken dialog
Dialog System • A system to provide interface between the user and a computer-based application [Cole, 1997; McTear, 2004] • Interaction on turn-by-turn basis • Dialog manager • Control the flow of the dialog • Main flow • information gathering from user • communicating with external application • communicating information back to the user • Three types of dialog system • frame-based • agent-based • finite state- (or graph-) based (~ VoiceXML-based)
DARPA Communicator - Revisited • From DARPA Communicator framework to PostechUbiquitous Natural Language Dialog System [Lee et al. 2006] • Architecture based on Communicator hub-client structure • Adding back-end modules (contents DB assistance, dialog model building)
Semantic Frame Speech ASR Text SLU SQL Database Response SQL Generate Spoken Language Understanding • Spoken language understanding is to map natural language speech to frame structure encoding of its meanings. [Wang et al., 2005] • What’s difference between NLU and SLU? • Robustness; noise and ungrammatical spoken language • Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) • Dialog; dialog history dependent and utt. by utt. analysis • Traditional approaches; natural language to SQL conversion A typical ATIS system (from [Wang et al., 2005])
“Show me flights from Seattle to Boston” ShowFlight <frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’> FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot> </frame> Subject Flight FLIGHT Departure_City Arrival_City SEA BOS Semantic Representation • Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002] • An intermediate semantic representation to serve as the interface between user and dialog system • Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005]
Info. Source Feature Extraction / Selection + Dialog Act Identification Frame-Slot Extraction Relation Extraction + + + Unification + Semantic Frame Extraction • Semantic Frame Extraction (~Information ExtractionApproach) • Dialog act / Main action Identification ~ Classification • Frame-Slot Object Extraction ~ Named Entity Recognition • Object-Attribute Attachment ~ Relation Extraction • 1) + 2) + 3) ~ Unification Examples of semantic frame structure Overall architecture for semantic analyzer
The Role of Dialog Management • For example, in the flight reservation system • System : Welcome to the Flight Information Service. Where would you like to travel to? • Caller : I would like to fly to London on Friday arriving around 9 in the morning. • System : There is a flight that departs at 7:45 a.m. and arrives at 8:50 a.m. ?????????? • In order to process this utterance, the system has to engage in the following processes: • 1) Recognize the words that the caller said. (Speech Recognition) • 2) Assign a meaning to these words. (Language Understanding) • 3) Determine how the utterance fits into the dialog so far and decide what to do next. (Dialog Management)
Task Agent Chat Agent Discourse History Domain-Specific SLU Domain-Specific SLU Discourse History Dialog Example Database Domain-Specific Dialog Expert Domain-Specific Chat Expert Chat Dialog Example Database Domain Knowledge Database Overall Architecture [on-going research] Speech Recognizer Linguistic Analysis Keyword Feature Extractor Generic SLU Agent / Domain Spotter Dialog Management System Utterance Text-To-Speech
References - Recognition • L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52. • C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566. • K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. • T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. • M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88.