380 likes | 598 Vues
A Meeting Browser that Learns. Patrick Ehlen * Matthew Purver * John Niekrasz Computational Semantics Laboratory Center for the Study of Language and Information Stanford University. The CALO Meeting Assistant. Observe human-human meetings Audio recording & speech recognition
E N D
A Meeting Browser that Learns Patrick Ehlen * Matthew Purver * John Niekrasz Computational Semantics Laboratory Center for the Study of Language and Information Stanford University
The CALO Meeting Assistant • Observe human-human meetings • Audio recording & speech recognition • Video recording & processing • Process written and typed notes, & whiteboard sketches • Produce a useful record of the interaction
The CALO Meeting Assistant (cont’d) • Offload some cognitive effort from participants during meetings • Learn to do this better over time • For now, focus on identifying: • Action items people commit to during meeting • Topics discussed during meeting
Human Interpretation Problems Compare to a new temp taking notes during meetings of Spacely Sprockets
Human Interpretation Problems (cont’d) • Problems in recognizing and interpreting content: • Lack of lexical knowledge, specialized idioms, etc • Overhearer understanding problem (Schober & Clark, 1989)
Machine Interpretation Problems • Machine interpretation: • Similar problems with lexicon • Trying to do interpretation from messy, overlapping multi-party speech transcribed from ASR, with multipleword-level hypotheses
Human Interpretation Solution • Human temp can still do a good job (while understanding little) at identifying things like action items • When people commit to do things with others, they adhere to a rough dialogue pattern • Temp performs shallow discourse understanding • Then gets implicit or explicit “corrections” from meeting participants
How Do We Do Shallow Understanding of Action Items? • Four types of dialogue moves:
How Do We Do Shallow Understanding of Action Items? • Four types of dialogue moves: • Description of task Somebody needs to do this TZ-3146!
How Do We Do Shallow Understanding of Action Items? I guess I could do it. • Four types of dialogue moves: • Description of task • Owner Somebody needs to do this TZ-3146!
How Do We Do Shallow Understanding of Action Items? • Four types of dialogue moves: • Description of task • Owner • Timeframe Can you do it by tomorrow?
How Do We Do Shallow Understanding of Action Items? • Four types of dialogue moves: • Description of task • Owner • Timeframe • Agreement Sure.
How Do We Do Shallow Understanding of Action Items? • Four types of dialogue moves: • Description of task • Owner • Timeframe • Agreement Sounds good to me! Sure. Sweet! Excellent!
Machine Interpretation • Shallow understanding for action item detection: • Use our knowledge of this exemplary pattern • Skip over deep semantic processing • Create classifiers that identify those individual moves • Posit action items • Get feedback fromparticipants after meeting
Challenge to Machine Learning and UI Design • Detection challenges: • classify 4 different types of dialogue moves • want classifiers to improve over time • thus, need differential feedback on interpretations of these different types of dialogue moves • Participants should see and evaluate our results while doing something that’s valuable to them • And, from those user actions, give us the feedback we need for learning
Feedback Proliferation Problem • To improve action item detection, need feedback on performance of five classifiers (4 utterance classes, plus overall “this is an action item” class) • All on noisy, human-human, multi-party ASR results • So, we could use a lot of feedback
Feedback Proliferation Problem (cont’d) • Need a system to obtain feedback from users that is: • light-weight and usable • valuable to users (so they will use it) • can solicit different types of feedback in a non-intrusive, almost invisible way
Feedback Proliferation Solution • Meeting Rapporteur • a type of meeting browser used after the meeting
Feedback Proliferation Solution (cont’d) • Many “meeting browser” tools are developed for research, and focus on signal replay • Ours: • tool to commit actionitems from meeting to user’s to-do list • relies on implicit user supervision to gather feedback to retrain classification models
Action Items Subclass hypotheses Top hyp is highlighted Mouse-over hyps to change them Click to edit them (confirm, reject, replace, create)
Action Items Superclass hypothesis delete = neg. feedback commit = pos. feedback merge, ignore
Feedback Loop • Each participant’s implicit feedback for a meeting is stored as an “overlay” to the original meeting data • Overlay is reapplied when participant views meeting data again • Same implicit feedback also retrains models • Creates a personalized representation of meeting for each participant, and personalized classification models
Problem • In practice (e.g., CALO Y3 CLP data): • seem to get a lot of feedback at the superclass level (i.e., people are willing to accept or delete an action item) • but not as much (other than implicit confirmation) at subclass level (i.e., people are not as willing to change descriptions, etc)
Questions • User feedback provides information along different dimensions: • Information about the time an event (like discussion of an action item) happened • Information about the text that describes aspects of the event (like the task description, owner, and timeframe)
Questions (cont’d) • Which of these dimensions contribute most to improving models during retraining? • Which dimensions require more cognitive effort for the user when giving feedback? • What is the best balance between getting feedback information and not bugging the user too much? • What is the best use of initiative in such a system (user- vs. system- initiative)? • During meeting? • After meeting?
Experiments • 2 evaluation experiments: • “Ideal feedback” experiment • Wizard-of-Oz experiment
Ideal Feedback Experiment • Turn gold-standard human annotations of meeting data into posited “ideal” human feedback • Using that ideal feedback to retrain, determine which dimensions (time, text, initiative) contribute most to improving classifiers
Ideal Feedback Experiment (cont’d) • Results: • both time and text dimensions alone improve accuracy over raw classifier • using both time and text together performs best • textual information is more useful than temporal • user initiative provides extra information not gained by system-initiative
Wizard-of-Oz Experiment • Create different Meeting Assistant interfaces and feedback devices (including our Meeting Rapporteur) • See how real-world feedback data compares to the ideal feedback described above • Assess how the tools affect and change behavior during meetings
owner timeframe AI task agreement • • • • • • Linearized utterances u1 u2 • • • uN Action Item Identification • Use four classifiers to identify dialogue moves associated with action items in utterances of meeting participants • Then posit the existence of an action item, along with its semantic properties (what, who, when) using those utterances
Like hiring a new temp to take notes during meetings of Spaceley Sprockets • Even if we say, “Just write down the action items people agree to, and the topics,” That temp will run up against a couple problems in recognizing and interpreting content (rooted in the collaborative underpinnings of semantics): • Overhearer understanding problem (Schober & Clark, 1989) • Lack of vocabulary knowledge, etc • machine overhearer that uses noisy multi-party transcripts is even worse • We do what a human overhearer might do: shallow discourse understanding • If you were to go into a meeting you might not understand what was being talked about, but you could understand when somebody agreed to do something. • why:? Because when people make commitments to do things with others, they typically adhere to a certain kind of dialogue pattern
Superclass Feedback Actions Superclass hypothesis delete = neg. feedback commit = pos. feedback (add to “to-do” list)