Unraveling Corpus Data with InfoMagnets: Topic Segmentation Insights

InfoMagnets: Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

Outline • InfoMagnets • Applications • Topic Segmentation • Conclusions • Q/A

Defining Exploratory Corpus Analysis • Getting a “sense” of your data • How does it relate to: • Information retrieval • Need to understand the whole corpus • Data mining • Need rich interface to support serendipitous search • Text classification • Need to find the “interesting” classes

InfoMagnets

InfoMagnets Applications • Behavioral Research • 2 Publishable results (submitted to CHI) • CycleTalk Project, LTI • New findings on mechanisms at work in guided exploratory learning • Robert Kraut’s Netscan Group, HCII • Conversational Interfaces • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

InfoMagnets Applications • Behavioral Research • 2 Publishable results (submitted to CHI) • CycleTalkProject, LTI • New findings on mechanisms at work in guided exploratory learning • Robert Kraut’s Netscan Group, HCII • Conversational Interfaces • Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005); Gweon et al., (2005)

Authoring Conversational Interfaces • Goal: Make Authoring CI’s easier • Solution: • Guide development with pre-processed sample human-human conversations • Addresses different issues • Accessible to non-computational linguists • Developers ≠ domain experts • Consistent with user-centered design: “The user is not like me!”

Authoring Conversational Interfaces Constructing a Master Template B A C Transcribed human-human conversations A C Topic Segmentation B

Topic Segmentation • Preprocess for InfoMagnets • But, an important computational linguistics problem in its own right! • Previous Work • Marti Hearst’s TextTiling (1994) • Beeferman, Berger, and Lafferty (1997) • Barzilay and Lee (2004) NAACL best paper award! • ….. • But, should it all fall under “topic segmentation”?

Topic Segmentation of Dialogue • Dialogue is Different: • Very little training data • Linguistic Phenomena • Ellipsis • Telegraphic Content • Coherence is organized around a shared task, not primarily around a single flow of information

Coherence Defined Over Shared Task • Lots of places where there is no overlap in “meaningful” content

Coherence Defined Over Shared Task Multiple topic shifts in regions w/ zero lexical cohesion

Experimental Condition • 22 student-tutor pairs • Conversation captured through mainstream chat client • Thermodynamics domain • Training and test data coded by one coder • Results shown in terms of p_k (Lafferty & Beeferman, 1999) • Significant tests: 2-tailed, t-tests

1st Attempt: TextTiling • TextTiling (Hearst, 1997) • Slide two adjacent “windows” down the text • At each state calculate cosine correlation • Use correlation values to calculate “depth” • “Depth” values higher than a threshold correspond to topic shifts w1 w2

TextTiling Results • Trend for TextTiling to perform worse than degenerate baselines • Difference not statistically significant • Why doesn’t it work?

TextTiling Results • Lots of gaps where the correlation = 0 • Must select boundary heuristically • And, still a heuristical improvement on original

TextTiling Results • But, topic shifts tend NOT to occur where corr > 0.

2nd Attempt: Barzilay and Lee (2005) • Cluster utterances • Treat each cluster as a “state” • Construct HMM • Emission probabilities: state-specific language models • Transition probabilities: based on location and cluster-membership of the utterances • Viterbi re-estimation until convergence

B&L Results • B&L statistically better than TT, but not better than degenerate algorithms

B&L Results • Too fine grained topic boundaries • Most clusters based on “fixed expressions” (e.g. “ok”, “yeah”, “sure” ) • Remember: cohesion based on shared task • Are state-based language models sufficiently different?

Incorporating Dialogue Dynamics • Dialogue Act coding scheme • Not originally developed for segmentation, but for discourse analysis of human-tutor dialogues • 4 main dimensions: • Action: open question, closed question, negation, etc. • Depth: (yes/no) is utterance accompanied with explanation or elaboration • Focus: (binary) is focus on speaker or other agent • Control: Initiation, Response, Feedback • Dialogue Exchange (Sinclair and Coulthart, 1975)

3rd Attempt: Cross-Dimensional Learning • (Donmez, 2004) • Use estimated labels on some dimensions to learn other dimensions • 3 types of Features: • Text (discourse cues) • Lexical coherence (binary) • Dialogue Acts labels • 10-fold cross-validation • Topic Boundaries learned on estimated labels, not hand coded ones!

X-Dimensional Learning Results • X-DIM statistically better than TT and degenerate algorithms!

Statistically Significant Improvement

Future Directions Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. • Explore other work in topic segmentation of dialogue

Recap • InfoMagnets and applications • Corpus exploration and authoring of CI’s • Challenges of topic segmentation of dialogue • Description of TextTiling, Barzilay & Lee, X-DIM vs. degenerate methods and each other

Q/A Thank you!

Unraveling Corpus Data with InfoMagnets: Topic Segmentation Insights