1 / 22

On Frequent Chatters Mining

On Frequent Chatters Mining. Claudio Lucchese 1 st HPC Lab Workshop. Frequent Patterns Mining. Claudio Lucchese, Salvatore Orlando, Raffaele Perego : Mining Top-K Patterns from Binary Datasets in Presence of Noise . SDM 2010. How may patterns do you see in the following dataset ?.

romney
Télécharger la présentation

On Frequent Chatters Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Frequent Chatters Mining Claudio Lucchese 1st HPC Lab Workshop 1st HPC Workshp - Claudio Lucchese

  2. Frequent Patterns Mining Claudio Lucchese, Salvatore Orlando, RaffaelePerego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010 How may patterns do you see in the following dataset ? 1st HPC Workshp - Claudio Lucchese

  3. Frequent Patterns Mining 1st HPC Workshp - Claudio Lucchese

  4. Frequent Patterns Mining • usually rows and cols are not in “good-looking” order 1st HPC Workshp - Claudio Lucchese

  5. State of the art • Most recent approaches try to discover the top-k patterns that optimize different cost functions: • Minimize Noise (“holes”) or • Minimize MDL • encoding(Patterns) + encoding(Data|Patterns) • Maximize Information Ratio: • Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution • Minimize length of patterns and the amount of noise (our approach =) 1st HPC Workshp - Claudio Lucchese

  6. Evaluation • Unsupervised: • Measure how well the proposed algorithm optimizes the proposed cost function • What is the best cost function ? • We are investigating supervised measures: • Unsupervised extraction: extract patterns from classification/clustering dataset without class/cluster labels information • Supervised evaluation: measure how well the patterns can predict/match classes/clusters • Preliminary result: • Fancy cost functions might not be the best ones 1st HPC Workshp - Claudio Lucchese

  7. Information Overload in News Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news recommendation. WSDM 2012. 1st HPC Workshp - Claudio Lucchese

  8. Can we exploit Twitter? • Timeliness • Personalization Number of mentions of “Osama Bin Laden” 1st HPC Workshp - Claudio Lucchese

  9. News Get Old Soon • 90% of the clicks happen within 2 days from publication • Only a few occur early! 1st HPC Workshp - Claudio Lucchese

  10. T.Rex (Twitter-based news recommendation system) • Builds a user model from Twitter • Signals from user generated content, social neighbors and popularity across Twitter and news • Entity-based representation (overcomes vocabulary mismatch) • Learn a personalized news ranking function: • Pick up candidates from a pool of related or popular fresh news,rank them and present top-k to the user 1st HPC Workshp - Claudio Lucchese

  11. Recommendation Model • Ranking function is user and time dependent • Social model + Content model + Popularity model • Popularity model tracks entity popularity by the number of mentions in Twitter and news (with exponential forgetting) • Content model measures relatedness of a bag-of-entities representation of a users’ tweet stream and of a news article • Social model weights the content model of every social neighbor by a truncated PageRank on the Twitter network 1st HPC Workshp - Claudio Lucchese

  12. System Overview • Designed to be streaming and lightweight (just counting) • User model is updated continuously 1st HPC Workshp - Claudio Lucchese

  13. Learning the Weights • Learning to rank approach with SVM • Each time the user clicks on a news, we learn a set of preferences (clicked_news > non_clicked_news): • Prune the number of constraints for scalability: • only news published in the last 2 days • only take the top-k news for each ranking component • Can optionally include additional features for news articles: • click count, age, etc... (T.Rex+) 1st HPC Workshp - Claudio Lucchese

  14. Predicting Clicked News • User generated content is a very good predictor albeit very sparse • Click Count is a strong baseline but does not help T.Rex+ 1st HPC Workshp - Claudio Lucchese

  15. Predicting Clicked Entities 1st HPC Workshp - Claudio Lucchese

  16. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. 1st HPC Workshp - Claudio Lucchese

  17. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Example: European sovereign-debt crisis Fiscal Compact EuroBond Berlusconi Obama New Italiangovernment Monti Merkel Loan EU France Greece time 1st HPC Workshp - Claudio Lucchese

  18. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Applications: • Given the news the user is currently reading, provide an explanation of the related facts that precede that news • Given a query, provide an explanation of the documents related to that query • Given a set of topics, explain their relations over time • Browse a collection of news, by changing the topics of interest, the time window, the granularity 1st HPC Workshp - Claudio Lucchese

  19. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • A topic is a named entity relevant over time • An interaction is a cluster of news related to some event and relevant in a small time window • It might be important to cover the given time window, but recent events might be more interesting 1st HPC Workshp - Claudio Lucchese

  20. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Given a maximum number of main topics and interactions, maximize: • Topic coverage and diversity • Eventstime coverage • Cluster similarity • Main topicsconnectivity 1st HPC Workshp - Claudio Lucchese

  21. Future works (?) • Explain a set of news showing how the main topicsinteracted with each other over time. • Its is different from news clustering: • Even if you had a good clustering, might not be trivial to select which events and which topicsto show in order to maximize the amount of information delivered to the user • There is some interesting related work • aimed at finding chains of news,we are more interested in topic evolution 1st HPC Workshp - Claudio Lucchese

  22. Thank you ! 1st HPC Workshp - Claudio Lucchese

More Related