1 / 18

A Generative Model for Statistical Determination of Information Content from Conversation Threads

This paper presents a generative model for determining the information content of conversation threads, helping to identify interesting topics and remove irrelevant messages.

eppinger
Télécharger la présentation

A Generative Model for Statistical Determination of Information Content from Conversation Threads

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Generative Model for Statistical Determination of Information Content from Conversation Threads Malik Magdon-Ismail Rensselaer Polytechnic Institute magdon@cs.rpi.edu

  2. Motivation • Vast online communication • Blogs, emails, message boards, newsgroups,… • Text analysis is challenging • What are the interesting message threads • Detect interesting topics • Remove uniformative, spam, or broadcast messages Time: January 12, 2005, 09:35 From: joe@xyz.com To: sue@abc.com Subject: Hello Message: Where have you been? 16:06:31] <FreeTrade> Republicans were the worst pacifists before ww1 and ww2 [16:06:43] <SweetLeaf> France Fries [16:06:50] <FreeTrade> As a generality, of course their were Republican Hawks. [16:07:13] <FreeTrade> Sweet, good pun but bad story! [16:07:18] <SweetLeaf> yup [16:07:23] <Lupine> anyways, he's perpetually tormented by presidential actions [16:07:25] <SweetLeaf> it aint good for no one [16:07:47] <SweetLeaf> I think they knew it was commiing [16:07:51] <FreeTrade> Rossevelt met monthly in New York with mostly trusted Republicans to talk about how to get america into the war. [16:08:10] <FreeTrade> and he spent 2 year with Churchill meeting him sometimes secretly in the ocean to discuss the same topic. [16:08:22] <FreeTrade> Exchanging a lot of letters. [16:08:25] <FreeTrade> telegrams [16:08:28] <Lupine> There really is nothing like a shorn scrotum. It's breathtaking, I suggest you try it. [16:08:55] <FreeTrade> Well they didnt literally meet in the ocean, they were on ships. IEEE ISI SOCO 2008, Taipei

  3. Emails From: stephanie.panus@enron.com To: 80 recipients Subject: Legal out of office FYI - The legal team will be out of town at the Enron Corp. Legal Conference beginning at 12:00 noon on Wednesday, May 2 through Friday, May 4. We will return to the office on Monday, May 7. • From: lindy.donoho@enron.com • To: steven.harris@enron.com • Subject: Gina's Slide • Did you have any comments? Want anything changed/added/deleted? No thread H D H D H D IEEE ISI SOCO 2008, Taipei

  4. Forums • Garmin Nuvi 650 • 71340 views, 738 replies • First post: 11/05/2007 • Last post: 02/23/2008 • Talks about functions, price and function comparison with other products and merchants • (a review) • Garmin Nuvi 760 Amazon.com $495.25 w/free shipping • http://www.fatwallet.com/forums/hot-deals/808394?highlight_key=y&keyword1=garmin • 1933 reviews, 5 replies • First post: 02/12/2008 • Last post: 02/12/2008 • Gives price only (an ad) IEEE ISI SOCO 2008, Taipei

  5. Message Boards THE GOOGLE CHRONICLES 159 replies First post: 03/02/2006 Last post: 04/12/2006 Basically, the game has changed. That's about it. It's changed. Many think the game is over. But it is not. It has merely changed. The change is that GOOG stock is no longer the invincible Titan that it was prior to Jan 20 this year. It is no longer impervious to extreme calamities…… CNBC on GOOG right now... 0 replies First post: 06/26/2007 "Good stock to be in" "We wont even see slowing growth until 4 or 5 years from now" NO THREAD IEEE ISI SOCO 2008, Taipei

  6. Information Content Factor (ICF) ICF:statistical measure of information content based on the structure of the conversation thread - no text analysis. “Interesting” messages generate long conversation threads IEEE ISI SOCO 2008, Taipei

  7. Toy Example 4 conversation threads on the same topic. Which one should an analyst look at? IEEE ISI SOCO 2008, Taipei

  8. The Generative Model • ICF: message’s ability to generate conversation. • Let 0≤b≤1 be the ICF of a message. • Two essential parts of the generative model: • P[reply] = b • ICF propagation decay factor f ICF[reply] = f b, 0<f<1 IEEE ISI SOCO 2008, Taipei

  9. b f b f 2b S R S R P[reply]=b P[reply]=f b P[no reply]=1-f 2b Single Recipient Process Let E(b) be the expected number of replies in thread. Number of Replies=2 P[thread]=b2f(1-f2b) E(b)=b(1+E(fb)) Can compute b to maximize P[thread] or so that E[b]=2. IEEE ISI SOCO 2008, Taipei

  10. Reply All Process Expected Number of Replies: IEEE ISI SOCO 2008, Taipei

  11. Mixed Reply Process Expected Number of Replies: IEEE ISI SOCO 2008, Taipei

  12. Toy Example IEEE ISI SOCO 2008, Taipei

  13. b f b f 2b S R S R P[reply]=b P[reply]=f b P[no reply]=1-f 2b Selecting f 0.1≤ f ≤ 0.3 seems appropriate Choose an informative thread and require large ICF. IEEE ISI SOCO 2008, Taipei

  14. ICF for Broadcasts Enron Data: 50 training and 50 test email threads. Human classified into Broadcast (B) / Non-Broadcast (NB) IEEE ISI SOCO 2008, Taipei

  15. ICF for Broadcasts Broadcast if ICF>T Determine optimal value of T from training set. Optimal T depends on propagation decay factor f. For f<0.9 training error is min at 2 IEEE ISI SOCO 2008, Taipei

  16. ICF for Broadcasts Enron Data: 50 training and 50 test email threads. 94% test accuracy Confusion Matrix ROC Curve HUMAN B Non-B B ICF Non-B Classification of larger data set with information content under way. IEEE ISI SOCO 2008, Taipei

  17. Conclusions and Future Work • Statistical method to evaluate how informative a message is by the conversation it triggered. • The method is effective and robust in detecting broadcast messages. • Future research: Applying the methodology to detect interesting topics from more conversation threads IEEE ISI SOCO 2008, Taipei

  18. Thank You! http://www.cs.rpi.edu/~magdon IEEE ISI SOCO 2008, Taipei

More Related