Summarization and Personal Information Management

Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Announcements • Questions? • Plan for Today • Murray, Renals, Carletta, & Moore, 2006 • Student presentation by Matt Marge

Continuing from last time…. Summarizing “nasty spoken dialogues”

Notes on research process • Identified the goal for summarization • Designed new features • Designed an evaluation approach • Compared new approach to baseline on traditional and new evaluation technique • New evaluation showed that new approach is better • Old evaluation provides less convincing evidence of the distinction • Strong points? • Weak points? • Skeptical about anything? • What would you have done differently?

What I would have done differently • Understand what each of the features are a proxy for • Understand the relationship between features • Understand better the relationship between what the features prefer and the “ideal summary” • Do more error analysis on each approach

Typical Speech Features from Prior Work • Prosodic cues can help with relevance ranking • Prior work related to broadcast news shows that prosody helps with identifying key portions of reports • Possible to see what is being emphasized • Maybe possible to identify points of tension in conversations, and then when the tension dissipates…

Motivation for Identifying New Types of Features • Typical goal of summarization: locate regions that are high in content • New Goal: identify regions of high activity, decision making, and planning • Based on your experience with meetings, what would you expect would be the difference between these in terms of what would get picked?

New Features • Speaker activity: how many speakers were active • Discourse cues • Decide, discuss, conclude, agree • Listener feedback: regions of high levels of interaction • Keyword spotting • Frequent content words • Meeting location • Dialogue act length • How do these fit the two goals? • What would be the same or different between these and the LSA based approach? • Are these independent? • Do these prefer the same or different regions?

Why were these sentences picked?

Rouge-2: Bigram Overlap • First collect one or more “gold standard” summaries • Count percentage of overlap of bigrams between generated summary and each gold standard • You’ll get a score for each gold standard summary • Average across these scores • Some prior work indicated that this might not work well for meeting summarization. Why would that be the case?

Weighted Precision • Each gold standard summary is constructed through a process • Write a summary that answers specific questions • Draw links between sentences in this summary and sentences in the transcript • Each line in the transcript is weighted based on how many links point to it • Summaries evaluated based on how many high weight sentences are included

Specific Questions Guiding Gold Standard Formation • Why are they meeting and what are they talking about? • Decisions made by the group? • Progress and achievements • Problems described

What would you expect to be different? • First collect one or more “gold standard” summaries • Count percentage of overlap of bigrams between generated summary and each gold standard • You’ll get a score for each “gold” summary • Average across • Each gold standard summary is constructed through a process • Summary answers specific questions • Draw links • Weight by links • How many high weight sentences are included

Speaker and Discourse Features in Summarization Gabriel MurraySteve Renals Jean Carletta Johanna Moore University of Edinburgh

Research Question Can speech summarizers be improved by using speech and discourse features?

Problems with Spoken Dialogue • Sparse information • Speech is hard to recognize • Speech not as fluent as text

Possible Solution #1 • Incorporate speech and discourse features before dimensionality reduction • Features incorporated: • Speaker activity • Discourse cues • Listener feedback • Simple keyword spotting • Meeting location • Dialogue act length in words (using LSA) • Use intuition when considering features

Speaker Activity Features • Identify the speaker of each dialogue act • Check if that speaker just spoke before or afterwards • Identify the speakers on both sides of a dialogue act • Check how many speakers talked in the past and next 5 dialogue acts • Identify areas of high speaker activity • Intuition: Key utterances will be discussed

Superficial Features • Discourse cues • Look for words like“decide”, “discuss”, “We should…” that may indicate action items • Listener feedback • Look for ACKs after dialogue acts (e.g., “okay”, “yeah”, “right”) that may indicate feedback • Keyword spotting • Look for top 20 frequent words (not stopwords)

Structural Feature • Weigh middle and later dialogue acts higher than early and ending ones • Intuition: Less focus on small talk • Problem: Is this a bit too harsh? What about end action items to conclude a meeting session?

Dialogue Act Length Feature • Idea: Relevant dialogue acts will be longer than others • Rank sentences using a Latent Semantic Analysis (LSA) sentence score

Build a Matrix of Features • Then perform dimensionality reduction on the matrix • Weigh discourse and listener feedback cues higher Overall matrix LSA Sentence Score

Possible Solution #2: LSA Centroid • Base a centroid (pseudo-document) on the top 20 keywords from a tf-idf score calculation on all words in the meeting • Perform LSA on a huge corpus (Broadcast news and ICSI data) • LSA Centroid: Average of constituent keyword vectors

LSA Centroid (cont’d) • For dialogue acts: • Fetch the vectors of constituent words from the centroid • Each dialogue act is the average of its word vectors • Task: Find dialogue acts with the highest cosine similarity with the centroid and add these acts to the summary as necessary

Combined Approach • Idea: Combine Speaker/Discourse features with LSA Centroid • System devises two ranks for the two methods • Could they be combined more intuitively?

Experiment with ICSI corpus • Used 6 meetings of about one hour each • Each about 10K words • Goal: Form 350-word automatic summaries • Form summaries from both manual and ASR output to extract dialogue acts

Experiment (cont’d) • First Task: Textual summary • Human corpus annotators built textual summaries of the meetings • Audience: An interested researcher • Parameters to be filled: • Paper-like abstract • Decisions made during the meeting • Progress and goals achieved • Problems discussed • 200 word limit per parameter

Experiment (cont’d) • Second Task: Extractive summary • Human annotators extracted dialogue acts that would yield the information formed in their textual summaries • Sub-task: For each extracted dialogue act • annotators selected the sentences from the textual summary that supported the dialogue act

First Evaluation • Calculate weighted precision • (% of relevant documents returned) • For each auto-extracted dialogue act, count the number of times each annotator links the dialogue act to a summary sentence • Used 3 annotators and averaged this count among all three

Second Evaluation: ROUGE • Used ROUGE-2 • Calculates bigram overlap between human-generated abstracts and automated dialogue extracts

Weighted Precision Results • Best performer: Speech features • Used ANOVA (sig, p<0.05) • Best feature: Dialogue act length For both manual & ASR! Why was “Combined” so poor?

ROUGE-2 Results • Best performer: LSA Centroid • But…random summaries approach (manual) not significantly worse than other approaches • Raises questions about ROUGE-2 reliability Both did really poorly with ASR output!

Example Text Snippet • Speech features approach

Conclusions and Future Work • No significant correlation between weighted precision and ROUGE scores on ICMI test set • For both ASR and manual methods • Improve upon current combination approach? • Use ML techniques (e.g., SVMs) • Good to see summarization approaches not heavily affected by the errors in ASR

Questions Unanswered • Interesting that length was so important, but what was the length distribution in the corpus? And which kinds of sentences ended up being longer? • Is it possible that longer sentences contain more important content by chance? • What was the real qualitative distinction between summaries generated with the two techniques? Does combining really make sense? • What is your take away message from this?

Questions?

Summarization and Personal Information Management