910 likes | 1.11k Vues
Automatic Spoken Document Processing for Retrieval and Browsing. Tutorial Overview. Introduction (25 minutes) Speech Recognition for Spoken Documents (55 minutes) Intermission (20 minutes) Spoken Document Retrieval & Browsing (75 minutes) Summary and Questions (5 minutes). Motivation.
E N D
Automatic Spoken DocumentProcessing for Retrievaland Browsing
Tutorial Overview • Introduction (25 minutes) • Speech Recognition for Spoken Documents (55 minutes) • Intermission (20 minutes) • Spoken Document Retrieval & Browsing (75 minutes) • Summary and Questions (5 minutes)
Motivation • In the past decade there has been a dramatic increase in the availability of on-line audio-visual material… • More than 50% percent of IP traffic is video • …and this trend will only continue as cost of producing audio-visual content continues to drop • Raw audio-visual material is difficult to search and browse • Keyword driven Spoken Document Retrieval (SDR): • User provides a set of relevant query terms • Search engine needs to return relevant spoken documents and provide an easy way to navigate them Broadcast News Podcasts Academic Lectures
Spoken Document Processing • The goal is to enable users to: • Search for spoken documents as easily as they search for text • Accurately retrieve relevant spoken documents • Efficiently browse through returned hits • Quickly find segments of spoken documents they would most like to listen to or watch • Information (or meta-data) to enable search and retrieval: • Transcription of speech • Text summary of audio-visual material • Other relevant information: • speakers, time-aligned outline, etc. • slides, other relevant text meta-data: title, author, etc. • links pointing to spoken document from the www • collaborative filtering (who else watched it?)
Transcription of Spoken Documents • Manual transcription of audio material is expensive • A basic text-transcription of a one hour lecture costs >$100 • Human generated transcripts can contain many errors • MIT study on commercial transcripts of academic lectures • Transcripts show a 10% difference against true transcripts • Many differences are actually corrections of speaker errors • However, ~2.5% word substitution rate is observed: Misspelled words Furui Frewey Makhoul McCool Tukey Tuki Eigen igan Gaussian galsian cepstrum capstrum Substitution errors Fourier for your Kullback callback a priori old prairie resonant resident affricates aggregates palatal powerful
Rich Annotation of Spoken Documents • Humans take 10 to 50 times real time to perform rich transcription of audio data including: • Full transcripts with proper punctuation and capitalization • Speaker identities, speaker changes, speaker overlaps • Spontaneous speech effects (false starts, partial words, etc.) • Non-speech events and background noise conditions • Topic segmentation and content summarization • Goal: Automatically generate rich annotations of audio • Transcription (What words were spoken?) • Speaker diarization (Who spoke and when?) • Segmentation (When did topic changes occur?) • Summarization (What are the primary topics?) • Indexing (Where were specific words spoken?) • Searching (How can the data be searched efficiently?)
When Does Automatic Annotation Make Sense? • Scale: Some repositories are too large to manually annotate • Collections of lectures collected over many years (Microsoft) • WWW video stores (Apple, Google, MSN, Yahoo, YouTube) • TV: all “new” English language programming is required by the FCC to be closed captioned http://www.fcc.gov/cgb/consumerfacts/closedcaption.html • Cost: Some users have monetary restrictions • Amateur podcasters • Academic or non-profit organizations • Privacy: Some data needs to remain secure • corporate customer service telephone conversations • business and personal voice-mails • VoIP chats
The Research Challenge • I've been talking -- I've been multiplying matrices already, but certainly time for me to discuss the rules for matrix multiplication. • And the interesting part is the many ways you can do it, and they all give the same answer. • So it's -- and they're all important. • So matrix multiplication, and then, uh, come inverses. • So we're -- uh, we -- mentioned the inverse of a matrix, but there's -- that's a big deal. • Lots to do about inverses and how to find them. • Okay, so I'll begin with how to multiply two matrices. • First way, okay, so suppose I have a matrix A multiplying a matrix B and -- giving me a result -- well, I could call it C. • A times B. Okay. • Uh, so, l- let me just review the rule for w- for this entry. 8 Rules of Matrix Multiplication: The method for multiplying two matrices A and B to get C = AB can be summarized as follows: 1) Rule 8.1 To obtain the element in the rth row and cth column of C, multiply each element in the rth row of A by the corresponding… “I want to learn how to multiply matrices”
The Research Challenge • Lectures are very conversational (Glass et al, 2004) • More similar to human conversations than broadcast news • Fewer filled pauses than Switchboard (1% vs. 3%) • Similar amounts of partial words (1%) and contractions (4%) • I've been talking -- I've been multiplying matrices already, but certainly time for me to discuss the rules for matrix multiplication. • And the interesting part is the many ways you can do it, and they all give the same answer. • So it's -- and they're all important. • So matrix multiplication, and then, uh, come inverses. • So we're -- uh, we -- mentioned the inverse of a matrix, but there's -- that's a big deal. • Lots to do about inverses and how to find them. • Okay, so I'll begin with how to multiply two matrices. • First way, okay, so suppose I have a matrix A multiplying a matrix B and -- giving me a result -- well, I could call it C. • A times B. Okay. • Uh, so, l- let me just review the rule for w- for this entry.
Speech Recognition for Spoken Documents • Vocabulary Selection • Overview of Basic Speech Recognition Framework • Language Modeling & Adaptation • Acoustic Modeling & Adaptation • Experiments with Academic Lectures • Forced Alignment of Human Generated Transcripts
Defining a Vocabulary • Words not in a system’s vocabulary can not be recognized • State-of-the-art recognizers attack the out-of-vocabulary (OOV) problem using (very) large vocabularies • LVCSR: Large vocabulary continuous speech recognition • Typical systems use lexicons of 30K to 60K words • Diminishing returns from larger vocabularies • Example from BBN’s 2003 EARS system (Matsoukas et al, 2003):
Analysis: Vocabulary Size of Academic Lectures • Average of 7,000 words/lecture in a set of 80 ~1hr lectures • Average of 800 unique words/lecture (~1/3 News Broadcasts)
Analysis: Vocabulary Usage in Academic Lectures • Rank of specific words from academic subjects in the Broadcast News (BN) and Switchboard (SB) corpora • Most frequent words not present in all three subjects • Difficult to cover content words w/o topic-specific material
Vocabulary Coverage Example • Out-of-vocabulary rate on computer science (CS) lectures using other sources of material to predict the lexicon • Best matching data is from subject-specific material • General lectures are a better fit than news or conversations
Speech Recognition: Probabilistic Framework • Speech recognition is typically performed a using probabilistic modeling approach • Goal is to find the most likely string of words, W, given the acoustic observations, A: • The expression is rewritten using Bayes’ Rule:
Speech Recognition: Probabilistic Framework • Words are represented as sequence of phonetic units. • Using phonetic units, U, expression expands to: • Search must efficiently find most likely U and W • Pronunciation and language models typically encoded using weighted finite state networks • Weighted finite state transducers (FSTs) also common
Finite State Transducer Example: Lexicon • Finite state transducers (FSTs) map input strings to new output strings • Lexicon maps /phonemes/ to ‘words’ • FSTs allow words to share parts of pronunciations • Sharing at beginning beneficial to recognition speed because search can prune many words at once
FST Composition • Composition (o) combines two FSTs to produce a single FST that performs both mappings in single step
Out-of-Vocabulary Word Modeling • How can out-of-vocabulary (OOV) words be handled • Start with standard lexical network • Separate sub-word network is created to model OOVs • Add sub-word network to word network as new word, Woov • OOV model used to detect OOV words and provide phonetic transcription (Bazzi & Glass, 2000)
N-gram Language Modeling • An n-gram model is a statistical language model • Predicts current word based on previous n-1 words • Trigram model expression: P( wn | wn-2 , wn-1 ) • Examples P( boston | residing in ) P( seventeenth | tuesday march ) • An n-gram model allows any sequence of words… • …but prefers sequences common in training data.
N-gram Model Smoothing • For a bigram model, what if… • To avoid sparse training data problems, we can use an interpolated bigram: • One method for determining interpolation weight:
Analysis: Language Usage in Academic Lectures • Language model perplexities on computer science lectures • Perplexity measures ability of a model to predict language usage • Small perplexity good prediction of language usage • Written material is a poor predictor of spoken lectures • Style differences of written and spoken language must be handled
Mixture Language Models • When building a topic-specific language model: • Topic-specific material may be limited and sparse • Best results when combining with robust general model • May desire a model based on a combination of topics • …and with some topics weighted more heavily than others • Topic mixtures is one approach (Iyer & Ostendorf, 1996) • SRI Language Modeling Toolkit provides an open source implementation (http://www.speech.sri.com/projects/srilm) • A basic topic mixture-language model is defined as a weighted combination of N different topics T1 to TN :
Acoustic Feature Extraction for Recognition • Frame-based spectral feature vectors (typically every 10 milliseconds) • Efficiently represented with Mel-frequency scale cepstral (MFCCs) • Typically ~13 MFCCs used per frame
Acoustic Feature Scoring for Recognition • Feature vector scoring: • Each phonetic unit modeled w/ a mixture of Gaussians:
Bayesian Adaptation • A method for direct adaptation of model parameters • Most useful with large amounts of adaptation data • A.k.a. maximum a posteriori probability (MAP) adaptation • General expression for MAP adaptation of mean vector of a single Gaussian density function: • Apply Bayes rule:
Bayesian Adaptation (cont) • Assume observations are independent: • Likelihood functions modeled with Gaussians: • Maximum likelihood estimate from Χ:
Bayesian Adaptation (cont) • The MAP estimate for a mean vector is found to be: • The MAP estimate is an interpolation of the ML estimates mean and the a priori mean: • MAP adaptation can be expanded to handle all mixture Gaussian parameters (Gauvain and Lee, 1994) • MAP adaptation learns slowly and is sensitive to errors
MLLR Adaptation • Maximum Likelihood Linear Regression (MLLR) is a common transformational adaptation techniques (Leggetter & Woodland, 1995) • Idea: Adjust models parameters using a transformation shared globally or across different units within a class • Global mean vector translation: • Global mean vector scaling, rotation and translation:
Example Recognition Results • Recognition results over 5 academic lectures • Total of 6 hours from 5 different speakers • Language model adaptation based on supplemental material • Lecture slides for 3 lectures • Average: 1470 total words and 32 new words • Googled web documents for 2 lectures • Average: 11700 total words and 139 new words • Unsupervised MAP adaptation for acoustic models • No data filtering based on recognizer confidence
Importance of Adaptation • Experiment: Examine performance of recognizer on four physics lectures from a non-native speaker • Perform adaptation: • Adapt language model by adding 2 physics textbooks and 40 transcribed physics lectures to LM training data • Adapt acoustic model by adding 2 previous lectures (100 minutes) or 35 previous lectures (29 hours) to AM training data • Acoustic model adaptation helps much more than language model adaptation in this case
Example Recognition 1 • Example hypothesis from a recognizer: “…but rice several different k means clustering zen picks the one that if the the minimum distortion some sense…” • Here’s the true transcription: “…but try several different k means clusterings and pick the one that gives you the minimum distortion in some sense…” • Recognizer has 35% word error rate on this segment. • Full comprehension of segment is difficult… • …but determining topic of segment is easy!
Example Recognition 2 • Another example hypothesis from a recognizer: “… and the u’s light years… which is the distance that light travels in one year… you’ve milliseconds we have microsecond so… we have days weeks ours centuries month… all derived units…” • Here’s the true transcription: “… and they use light years… which is the distance that light travels in one year… we have milliseconds we have microseconds… we have days weeks hours centuries months… all derived units…” • Recognizer has 26% word error rate on this segment • Comprehension is easy for most readers • Some recognition errors are easy for readers to correct
Automatic Alignment of Human Transcripts • Goal: Align transcript w/o time markers to long audio file • Run recognizer over utterances to obtain word hypotheses • Use language model strongly adapted to reference transcript • Align reference transcript against word hypotheses • Identify matched words ( ) and mismatched words (X) • Treat multi-word matched sequences as anchor regions • Extract new segments starting and ending within anchors • Force align reference words within each new segment si
Aligning Approximate Transcriptions • Initialize FST lexical network G with words in transcription • Account for untranscribed words with OOV filler model • Allow words in transcription to be deleted • Allow substitution of OOV filler for words in transcription • Result: Alignment with transcription errors marked
Automatic Error Correction • Experiment: After automatic alignment re-run recognizer over regions marked as alignment errors • Allow any word sequence to replace marked insertions and substitutions • Allow word deletions to be reconsidered • Use trigram model to provide language constraint • Results over three lectures presented earlier:
Spoken Document Retrieval: Outline • Brief overview of text retrieval algorithms • Integration of IR and ASR using lattices • Query Processing • Relevance Scoring • Evaluation • User Interface • Try to balance overview of work in the area with experimental results from our own work • Active area of research: • emphasize known approaches as well as interesting research directions • no established way of solving these problems as of yet
Text Retrieval • Collection of documents: • “large” N: 10k-1M documents or more (videos, lectures) • “small” N: < 1-10k documents (voice-mails, VoIP chats) • Query: • ordered set of words in a large vocabulary • restrict ourselves to keyword search; other query types are clearly possible: • Speech/audio queries (match waveforms) • Collaborative filtering (people who watched X also watched…) • Ontology (hierarchical clustering of documents, supervised or unsupervised)
Text Retrieval: Vector Space Model • Build a term-document co-occurrence (LARGE) matrix (Baeza-Yates, 99) • rows indexed by word • columns indexed by documents • TF (term frequency): frequency of word in document • could be normalized to maximum frequency in a given document • IDF (inverse document frequency): if a word appears in all documents equally likely, it isn’t very useful for ranking • (Bellegarda, 2000) uses normalized entropy
Text Retrieval: Vector Space Model (2) • For retrieval/ranking one ranks the documents in decreasing order of relevance score: • query weights have minimal impact since queries are very short, so one often uses a simplified relevance score:
Text Retrieval: TF-IDF Shortcomings • Hit-or-Miss: • returns only documents containing the query words • query for Coca Cola will not return a document that reads: • “… its Coke brand is the most treasured asset of the soft drinks maker …” • Cannot do phrase search: “Coca Cola” • needs post processing to filter out documents not matching the phrase • Ignores word order and proximity • query for Object Oriented Programming: • “ … the object oriented paradigm makes programming a joy … “ • “ … TV network programming transforms the viewer in an object and it is oriented towards…”
Vector Space Model: Query/Document Expansion • Correct the Hit-or-Miss problem by doing some form of expansion on the query and/or document side • add similar terms to the ones in the query/document to increase number of terms matched on both sides • corpus driven methods: TREC-7 (Singhal et al,. 99) and TREC-8 (Singhal et al,. 00) • Query side expansion works well for long queries (10 words) • short queries are very ambiguous and expansion may not work well • Expansion works well for boosting Recall: • very important when working on small to medium sized corpora • typically comes at a loss in Precision
Vector Space Model: Latent Semantic Indexing • Correct the Hit-or-Miss problem by doing some form of dimensionality reduction on the TF-IDF matrix • Singular Value Decomposition (SVD) (Furnas et al., 1988) • Probabilistic Latent Semantic Analysis (PLSA) (Hoffman, 1999) • Non-negative Matrix Factorization (NMF) • Matching of query vector and document vector is performed in the lower dimensional space • Good as long as the magic works • Drawbacks: • still ignores WORD ORDER • users are no longer in full control over the search engine Humans are very good at crafting queries that’ll get them the documents they want and expansion methods impair full use of their natural language faculty