1 / 23

SNOW Workshop, 8th April 2014

Real-time topic detection with bursty ngrams : RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert Gordon University ). SNOW Workshop, 8th April 2014. Outline. Architecture diagram Results Future work. Architecture diagram. Crawler. Entities Extractor.

jerold
Télécharger la présentation

SNOW Workshop, 8th April 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Real-time topic detection with burstyngrams: RGU participation in SNOW 2014 challenge Carlos Martin and AyseGoker (Robert Gordon University) SNOW Workshop, 8th April 2014

  2. Outline • Architecturediagram • Results • Futurework

  3. Architecturediagram Crawler Entities Extractor Tweets (with Entities) Tweets (English) Tweets Solr

  4. Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr

  5. Entities Extractor • Extract entities per tweet using Stanford NER (http://nlp.stanford.edu/software/CRF-NER.shtml). • 3 class model  Identifies Person, Location and Organization. • Efficient enough for a real-time system.

  6. Architecturediagram Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Tweets (English) Tweets Solr

  7. BNgramapproach • Detection of burstyngramsbasedondf-idf score  Burstyentities, hashtags and urls are alsoincluded in theapproach. Re ngrams, 2- and 3-grams are considered (no unigramsanymore). • Variant oftf-idf Penalization of frequentterms in previoustimeslots. • Termscontaininghashtags, entities, urls are boosted. • Twoprevioustimeslots(s=2) wereconsidered in ourexperiments.

  8. BNgramapproach • “Partial” membershipclusteringapproachisaninterestingalternative as onetermcouldbelongtodifferentclusters (Forexample, entity “Obama” forthestories “Obama wins in Ohio” and “Obama wins in Illinois”). • Aprioriclusteringalgorithmhas beenused in theexperiments of SNOW challenge • Explore maximal associations between terms based on the number of shared tweets.

  9. BNgramapproach • Output:Clusters of trendingtermswithtweetsfromthelasttimeslotassociatedtothem. • A tweetshouldcontain a minimumnumber of clustertermsto be included. • Clusters are rankedbytheirbursty scores (maximumdf-idfvalue of topicterms)

  10. Architecturediagram Keyword Extractor Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr

  11. Keyword Extractor and TopicAggregator modules • TopicAggregator module: • Aggregateentities, hashtags and urls per topic (comingfromtopictweets of thecorrespondingtimeslot) keepingtheirfrequencies. • Keepthoseoneswhosefrequencyishigherthan a threshold. • Keyword Extractor module: • Extractmainkeywords(includingngrams) per topic (notextractedfromTopicAggregator) usingburstytermsfromtheclusters. • Removal of urls, hashtags, usermentions, entities and acronyms. • Overlaps are also removed. • Keepdf-idf scores as theirweights.

  12. Architecturediagram Keyword Extractor Ranked topics Merged topics Crawler Entities Extractor BNgram Topics Combiner Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr

  13. Topic Combiner module • TopicCombinermodule: • Mergesimilar topicsfromthesametimeslot. • Basedontheco-occurrence of keywords (unigrams),entities,hashtagsandurlsfromthecomparedtopics. • Accordingtopreliminaryresults, Apriorialgorithmmakesthis module more accurateas onetermcouldbelongtodifferenttopics.

  14. Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Crawler Entities Extractor BNgram Topics Combiner Query Builder Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr

  15. QueryBuildermodule • Creation of final queries to retrieve all the related tweets to the topic (Solr queries) and also filtering by time (simulating real-time scenario). • 3 types of queries: • Keywords • Entities and Hashtags • Urls • If keywords and entities in topic, keywords closer to the entities are the selected ones. • Image population: If tweets contains links to images (metadata), they are added to the topic.

  16. QueryBuilder module • Repliesare alsoconsidered. Be carefulwithspamreplies • Replies are nottext-querydependent. More diversity?. • Sentimentanalysis, extraction of relevantkeywords.

  17. QueryBuilder module • Diversetweetsare computedbasedoncosinesimilarity. • Thisapproachcould be more orlessstrictdependingontheselectedthreshold.

  18. Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr

  19. TopicLabeller module • BuzzFeed editor-in-chief Ben Smith: “Headlines sure look a lot like tweets these days.” (http://perryhewitt.com/5-lessons-buzzfeed-harvard/) • Foreachtopictweet, a scoreiscomputedbasedonthefollowing formula. whereα = 0.8. Thetweetwiththehighest score isselected as theTopiclabelaftercleaningit.

  20. TopicLabeller module • Example of tweetsaftercleaningthem • Granularityisstillanissue Sometopiclabels are too general orspecific.

  21. Results - Examples of topics

  22. Future work • Improve Topic Combiner module – use of similarity measures. • Further research on the use of replies and diverse tweetsper Topic. • Improve Topic Labeller module – granularity issue. • Modifications in QueryBuildermodule – use of term weights (Solr).

  23. Thank you! E-mail address: c.j.martin-dancausa@rgu.ac.uk Twitter account: @martincarloscit

More Related