1 / 90

Advanced Speech Application Tuning T opics

Advanced Speech Application Tuning T opics. Yves Normandin, Nu Echo yves.normandin@nuecho.com SpeechTEK University August 2009. Fundamental principles. Tuning is a data driven process It should be done on representative samples of user utterances You can only tune what you can measure

Télécharger la présentation

Advanced Speech Application Tuning T opics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Speech Application Tuning Topics Yves Normandin, Nu Echo yves.normandin@nuecho.com SpeechTEK University August 2009

  2. Fundamental principles • Tuning is a data driven process • It should be done on representative samples of user utterances • You can only tune what you can measure • And you must measure the right things • Tuning can be quite time-consuming, so it’s important to have efficient ways to: • Quickly identify where the significant problems are • Find and implement effective optimizations • Measure impact of changes

  3. Activities in a tuning process • Produce application performance reports • Call analysis • Tuning corpus creation & transcription • Benchmark setup + produce baseline results • Grammar / dictionary / confidence feature tuning • Confidence threshold determination • Application changes (if required) • Integration of tuning results in application • Tests

  4. Call analysis • Goal: Analyze complete calls in order to identify and quantify problems with the application • Focus is on detecting problems that won’t be obvious from isolated utterances, e.g., usability, confusion, latency • This is the first thing that should be done after a deployment • For this, we need a call viewing tool that allows • Selecting calls that meet certain criteria (failures, etc.) • Stepping through a dialog • Listening to a user utterance • Seeing the recognition result • Annotating calls (to classify and quantify problems observed)

  5. About call analysis • Only using utterances recorded by the engine doesn’t provide a complete picture • We don’t hear everything the caller said • Often difficult to interpret why the caller spoke in a certain was (e.g., why was there a restart?) • Having the ability to do full call recordings makes it possible to get key missing information and better understand user behavior • An interesting trick is to ask questions to callers in order to understand their behavior

  6. Tuning corpus creation • Build a tuning corpus for each relevant recognition context • For each utterance, the corpus should contain: • The waveform logged by the recognition engine • The active grammars when the utterance was collected • The recognition result obtained in the field • Useful to provide an initial transcription • Allows comparing field results with lab results • Utterance attributes, e.g., • Interaction ID (“initial”, “reprompt-noinput”, “reprompt-nomatch”, etc.) • Date, language, etc.

  7. Corpus transcription • Our tuning process assumes that accurate orthographic transcriptions are available for all utterances • Transcriptions are used to compute reference semantic interpretations • The “reference semantic interpretation” is the semantic interpretation corresponding to the transcription • It is produced automatically by parsing the transcription with the grammar • This needs to be done manually • Recognition result can be used as pre-transcription

  8. Benchmark setup + Produce baseline performance results • There are several goals to this phase: • Obtain a stable ING/OOG classification for all utterances • Produce a reference semantic interpretation for all ING utterances • Clean up grammars, if required • Produce a first baseline result • This can be a significant effort, but: • Effective tools make this fairly efficient • It doesn’t require highly skilled resources

  9. High-level grammar tuning process

  10. Scoring recognition results:Basic definitions

  11. Remarks • We use the term “confidence feature” to designate any score that can be used to evaluate confidence in a recognition result • We often compute confidence scores that provide much better results than those provided by the recognition engine confidence score. • The terms “accept” and “reject” mean that the confidence feature is above or below the threshold being considered • The definition of “correct” should be configurable, e.g., • Semantic scoring vs. word-based scoring • 1-best vs. N-best scoring

  12. Scoring recognition results:Sufficient statistics

  13. Equivalence with commonly used symbols

  14. All metrics can be calculated based on these sufficient statistics All metrics clearly defined so that there is no ambiguity

  15. Any metrics can be calculated based on the sufficient statistics Key metrics: Includes both incorrect recognitions and OOG utterances

  16. Fundamental performance plot:Correct Accept vs. False Accept Low Threshold High Threshold Correct Accept rate False Accept rate

  17. The graphical view makes improvements immediately visible That’s a very effective way of measuring progress

  18. Problems with the basic tuning process

  19. Some reasons why transcriptions are not covered

  20. Examples (birth date grammar)

  21. What to do about such utterances? • We certainly can’t ignore them • They’re represent the reality of what users actually say • The application has to deal with that • We can’t just assume they should be rejected by the application • Many of these are actually perfectly well recognized, often with a high score • The “False Accept” rate becomes meaningless • Many of them should be recognized • We can’t score them because we have no reference interpretation

  22. Our approach:“Human perceived ING/OOG” A transcription is considered ING (valid) if a human can easily interpret it; It is OOG otherwise • Doesn’t depend on what the recognition grammar actually covers • Makes results comparisons meaningful since we always have the same sets of ING and OOG utterances • Provides accurate and realistic performance metrics • CA measured on all valid user utterances • Reliable FA measurement for precise high threshold setting

  23. Challenge: Computing the reference semantic interpretation Two techniques:

  24. Sample regex transformations:Remove “as in” in postal codes

  25. Sample regex transformations:Remove repetition of first letter Repetition of first letter

  26. Focus on high confidence OOG utterances We want to avoid utterances incorrectly classified as false accepts Transcription error (should be “one”)

  27. Tool to add paraphrases A paraphrase replaces a transcription by another one with same meaning that parses Aligns paraphrase with transcription Shows if the paraphrase is in-grammar

  28. Postal code example The advantage of supporting certain repeats, corrections, and the form “m as in mary” is clearly demonstrated

  29. Postal code example Impact of adding support for “p as in peter”

  30. Comments on the transformations-based approach • Advantages • Not dependant on a specific semantic representation • The transformation framework makes this very efficient • Single rules can deal with dozens of utterances • Problems • For really “natural language utterances”, transformed transcriptions end up bearing little resemblance to the original one • Better to use semantic tagging in this case

  31. High-level grammar tuning process (revisited) (2) Tuning (1) Benchmark setup Note: The reference grammar is often a good starting point for the recognition grammar

  32. Key advantage: Meaningful performance comparisons Address grammar that supports apartment numbers • Scoring done only on address slots • Same set of ING and OOG utterances in both cases, despite significant grammar changes ensures that comparisons are meaningful Address grammar that doesn’t

  33. 0.5% FA Key advantage: Better tuned applications With transformations Threshold = 0.63 CA = 83.0% Without Threshold = 0.85 CA = 78.4%

  34. Other advantages • Lab results truly represent field performance • Better confidence in the results obtained • Little surprises when applications are deployed

  35. Techniques to identify problems

  36. Fundamental techniques • Listen to problem utterances • This includes incorrectly recognized utterances AND correctly recognized utterances with a low score • This cannot be emphasized enough • Identify the largest sources of errors • Frequent substitutions • Words with high error rate • Slot values with high error rate • Look at frequency patterns in the data • Analyze specific semantic slots • Certain slots cause more problems than others • Compare experiments

  37. Substitutions / word errors Are there words with unusually high error rates?

  38. Then examine all sentences with a specific substitution(using a substitution filter) In this case: a  eight

  39. Slot-specific scoring Month slot Day slot Are there semantic slots that perform unusually badly?

  40. Tags and Tag Reports • In Atelier, we can use tags to create partitions based on any utterance attribute • Semantic interpretation patterns in the transcription or the recognition result • ING / OOG • Index of correct result in the N-best list • Scoring category • Confidence score ranges • Tags can be used to filter the utterances in powerful ways • Tag reports are used to compute selected metrics for any partition of the utterances

  41. Use tag reports to find out where the biggest problems are Semantic tags Sort based on correct accept rate

  42. Filter utterances in order to focus on specific problem cases • The “saint-leonard” borough has a high error rate. Let’s look at these utterances

  43. Looking at semantic substitutions • What are the most frequent substitutions with “saint-leonard”?

  44. Comparing experiments • Can choose which fields to consider for comparison purposes • This precisely shows the impact of a change on an utterance per utterance basis

  45. Computing grammar weights for diagnostic purposes • There are many ways of saying a birth date. Which ones are worth covering? public$date = ($intro | $NULL) ( $month{month=month.month}(the|$NULL)$dayOfMonth{day=dayOfMonth.day} | $monthNumeric{month=monthNumeric.month} (the|$NULL) $dayOfMonth{day=dayOfMonth.day} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ;

  46. Computing grammar weights for diagnostic purposes public$date = ($intro | $NULL) ( $month{month=month.month}(the|$NULL)$dayOfMonth{day=dayOfMonth.day} | $monthNumeric{month=monthNumeric.month} (the|$NULL) $dayOfMonth{day=dayOfMonth.day} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | (the|$NULL) $dayOfMonth{day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ; • January the sixteenth eighty • zero one sixteen eighty • sixteen zero one eighty • sixteen of the zero one eighty • sixteen January eighty • the sixteenth of January eighty

  47. Compute frequency weights based on transcriptions public$date = (/0.00001/ $intro | /1/ $NULL) ( /0.9636/ $month{month=month.month} (/0.06352/ the | /0.9365/ $NULL) $dayOfMonth{day=dayOfMonth.day} | /0.001654/ $monthNumeric{month=monthNumeric.month} (/0.00001/ the | /1/ $NULL) $dayOfMonth{day=dayOfMonth.day} | /0.004962/ (/0.00001/ the | /1/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | /0.0008271/ (/1/ the | /0.00001/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | /0.012406/ (/0.00001/ the | /1/ $NULL) $dayOfMonth{day=dayOfMonth.day}$month{month=month.month} | /0.01654/ (/0.25/ the | /0.75/ $NULL) $dayOfMonth {day=dayOfMonth.day}of $month{month=month.month} ) $year{year=year.year} ; • Weight means probability of using alternative

  48. Discriminative grammar weights based on recognition results public$date = (/-110.743109/ $intro | /110.7751/ $NULL) ( /291.1/ $month{month=month.month} (/-104.318/ the | /395.418/ $NULL) $dayOfMonth {day=dayOfMonth.day} | /-265.0/ $monthNumeric{month=monthNumeric.month} (/-75.4683/ the | /-189.53/ $NULL) $dayOfMonth {day=dayOfMonth.day} | /-16.85/ (/-17.085/ the | /0.2347/ $NULL) $dayOfMonthThirteenAndOver{day=dayOfMonthThirteenAndOver.day} $monthNumeric{month=monthNumeric.month} | /0.000035/ (/0.000035/ the | /0/ $NULL) $dayOfMonthThirteenAndOver {day=dayOfMonthThirteenAndOver.day} of the $monthNumeric{month=monthNumeric.month} | /-21.16/ (/-10.058/ the | /-11.01/ $NULL) $dayOfMonth {day=dayOfMonth.day}$month{month=month.month} | /11.94/ (/-2.211/ the | /14.15/ $NULL) $dayOfMonth {day=dayOfMonth.day} of $month{month=month.month} ) $yearyear=year.year} ; • Positive: Alternative should be favored • Negative: Alternative should be disfavored

  49. Looking at utterance distribution statistics: Address date grammar People move more on the first of the month Note that 20 (“vingt”) has lowest recognition rate

  50. What are the substitutions for 20 (“vingt”, in French)?

More Related