Confidence Measures in Speech Recognition

Confidence Measures in Speech Recognition Stephen Cox School of Computing Sciences University of East Anglia Norwich, UK. sjc@cmp.uea.ac.uk

Talk Outline • Why do we need confidence measures in speech systems? • Motivation for recogniser-independent measures • PART 1: Two methods for estimating confidence measures, based on phone/word models • Phone correlation • Metamodels • PART 2: Using semantic information to estimate confidence measures • Discussion

Why Confidence Measures? • A confidence measure (CM) is a number between 0 and 1 indicating our degree of belief that a unit output by a recogniser (phrase, word, phone etc.) is correct • The most important application of CMs is in speech dialogue systems e.g. ticket booking, call-routing, information provision etc. • Uncorrected errors can be disastrous in a dialogue system, but confirmation of each content word is tedious • The system can use a CM to decide which words are correct and which need to be confirmed or corrected. • Unsupervised speaker adaptation—use the CM in adaptation oi the acoustic models (adapt only models of words that the system considers are likely to be correct) • Aids selection of multiple hypotheses

Previous Work • Confidence measures (CMs) mostly based on deriving ad hoc features from “side-output” from recogniser e.g. • number of competing hypotheses when a word is decoded • likelihood ratio of hypotheses • stability of word in the output lattice (N-best) • number of instances of word or phonemes in word in training dataetc. etc. • Problem: These are usually highly recogniser-specific

Example: Number of hypothesized word-ends as a confidence measure

Probability of aword sequence WLANGUAGE MODEL Probability of some acoustics A given a word sequence WACOUSTIC MODELS Pr( A | W ) Pr( W ) = Pr( W | A ) Pr( A ) Probability ofa word sequence Wgiven some acoustics A Probability ofsome acoustics A PART 1: A General Approach I Speech recognition relies on Bayes’ Theorem: W = word sequence A = acoustics of speech signal

A General Approach II • Errors occur when either Pr(W) (language model) is inaccurate or Pr(W|A) is inaccurate (acoustic models) • In decoding words in a recogniser, these two probabilities are integrated • We can attempt to disentangle their effects by using a parallel phone recogniser • Two approaches: • use correlation between phone recogniser string and word recogniser string as confidence measure • use phone recogniser to hypothesise word strings and correlate with word recogniser output

Use of a parallel phone recogniser

p k Pre-processing for phone correlation Speech Word recogniser Phoneme Phoneme recogniser transcription DP p p p p p p p … 3 alignment 1 3 1 2 2 2 q q q q q q q ... 1 2 1 1 2 1 2 Tagged frames p p p p … 1 2 3 4 q q q q . . . means that p is within word k 2 1 4 3 Aligned phonemes

Phone correlation: distance measure p p p ... Confusion matrix 3 2 1 q q q ... 3 2 1 Aligned phoneme- sequences or tagged frames

Phone correlation: likelihood ratio Correctly decoded word Incorrectly decoded word

where P* is the most likely phoneme sequence Hypothesising words from phone strings Pr(P* | A) can be estimated from a parallel phoneme recogniser Pr(W | P*) is estimated using two techniques: LexList and Metamodels

LexList: Constructing hypothetical word-sequences

Hypotheses Made by a Sliding Window of Length 3 Phonemes

MetaModels—candidate word lists built using phoneme confusions Motivation: • LexList method requires some ad hoc decisions about window-length, short words etc. • Combinatorial explosion in candidate words when confusion-matrix is used • MetaModel uses knowledge of phoneme confusions within an HMM framework to produce candidate word lists for CM estimation

Building a Set of Metamodels

Obtaining a confidence measure from a set of metamodels

Data and Models • Recogniser built using WSJCAM0 database • Acoustic model training: SI-TR data, ~10000 sentences, 92 speakers • Testing: SI-DT dataset, ~1900 sentences, 20 speakers • Models: 8 Gaussian mixture triphone models with tying (~ 3500 states) • Bigram language model with backoff, 20000 word vocabulary, perplexity ~160 • Confidence measures • Independent training and testing sets from SI-DT dataset

Performance measurement • Use CM to tag each decoded word as‘C’(correct) or ‘I’ (incorrect) • Guessing measure (G) error-rate: • Confidence measure (CM) error-rate: • Improvement I:

Baseline: “N-best” Confidence Confidences: can = 4/9, an = 5/9, increase = 8/9 etc. etc.

Performance comparison

PART II: Use of semantic information in confidence measures • It is possible to define incorrect words in an utterance on semantical grounds e.g. Exxon corporations said earlier this week that it replaced one hundred forty percent its violin gas production in nineteen eighty serve on.(violin = “oil and”) • Clearly, only a small proportion of incorrect words can be identified on such grounds • However, this information is likely to be independent of measures based on decoder output, and so might be advantageously combined with other CMs. • Also, it requires no recogniser side information at all.

Preliminary Experiment • Examined decodings of about 600 sentences from our recogniser • Marked any word that we considered to be incorrect on grounds of semantics • Checked results against the transcriptions: • Only 470 incorrect words were marked as incorrect (Recall = 470/3141 = 15%) • Of these words, 421 were incorrect (Precision = 412/470 = 90%) • So human performance may be useful, but at low recall

Latent Semantic Analysis • We need a way of identifying words that are “semantically distant” from the other decoded words in a sentence • Clustering words only works up to a point because of data sparsity • Also, many semantically close word-pairs may rarely co-occur and so not cluster e.g.movie and film (synonyms)striker and batsman (both sporting roles, but different games) • Latent Semantic Analysis (LSA) has been successfully used to associate semantically similar words

N documents Doc N . . . . . . . . Doc 3 Doc 1 Doc 2 . . . . . . . . . . . . . . . . . . . . a about access account you you’ve your 0 0 0 1 0 0 0 0 1 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 1 0 . . . . . . . . . . . . . . M words Co-occurrence matrix W

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Singular Value Decomposition of W WORD/DOCUMENTSPACE LSA SPACE M x N M x R R x R R x N d d documents 1 N w 1 0 0 s = d r o w x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x w x x x x x x x x x x x x x x x x x x x x T M V S  U W (W=USVT when R=N) In this case, M  20000, N  20000, R  100

Data and Representation • Use the Wall Street Journal corpus (same material as utterances for recognition experiments). • The “Documents” are the paragraphs: each paragraph is (pretty much) semantically coherent • 19396 documents and 19685 different words • Each word represented in the LSA space by a 100-d vector • Computed “semantic similarity” between two words as the dot-product of the vectors representing the words:

Semantic Score Distributions for Four Words OUTRank 1 CAUTIOUSRank 6763 DENOMINATIONRank 19666 ABOARDRank 13892

Confidence measures from LSA • Several confidence measures for a decoded word were evaluated: • 1. Mean semantic score to the other decoded words (MSS) • 2. Mean rank of semantic score to other decoded words given the complete distribution of scores to all words (MR). • 3. Probability of observing the set of scores to the other decoded words given the distribution of scores for the word. The score distribution was approximated by a five component Gaussian mixture. (PSS)

Use of a Stop List • Very commonly occurring words (e.g. function words) co-occur with most words, so have high scores to most words, and so contribute noise. • Hence words whose mean semantic score to all words in training-set was above a threshold LT were omitted. • Recogniser baseline performance increases when these words are omitted and this is taken account of in results.

Largedifference in distributionsfor high scores Littledifference in distributionsfor low scores Distribution of PSS scores

Discussion I • We expected this technique to work by identifying as incorrectly decoded words that were semantically distant from other words. • However, PSS derives its discrimination by identifying the correctly decoded words. • Analysis revealed that the words associated with high values of PSS were predominantly words that commonly occurred in the WSJ data (numbers, financial terms etc.). These are highly cognate with each other.

Discussion II • Inspection of the decoded words that had very low values of PSS associated with them showed that some of these were very common words that had been correctly decoded. • It is possible that the corpus used for making the LSA analysis does not have enough material to capture the large set of words that these common words co-occur with. • Hence the decoded utterances in the test-set contain previously unseen co-occurrences that lead to a low semantic score for these words. • Some test-set words are also out-of-vocabulary

Performance of semantic CM

Final Comments • We have developed techniques for identifying incorrect words in the output of a speech recogniser that do not depend on “side-information” from the recogniser, which is highly recogniser-specific. • The most successful is the “metamodels” technique. This uses a parallel phone recogniser working with the word recogniser and then correlates the output of the word recogniser with possible words constructed using metamodels. • Using semantic information gives a small but significant confidence gain and requires no other recogniser. This may well be domain-dependent. • The final test of the utility of these measures comes when they are used in a real system.

Confidence Measures in Speech Recognition

Confidence Measures in Speech Recognition

Presentation Transcript

Speech Recognition

Speech Recognition

Speech Recognition

The contribution of speech recognition confidence and

Speech Recognition

Speech recognition

Speech Recognition

Speech Recognition

Confidence Measures for Automatic Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Confidence Measures for Speech Recognition

Speech Recognition

SPEECH RECOGNITION:

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition

Speech Recognition