1 / 32

CSCE 561 Social Media Projects

CSCE 561 Social Media Projects. Ryan Benton October 8, 2012. Social Media. 30 billion pieces of content shared each month. 140 million daily tweets. 153 billion US SMS messages in 2009. Sources: Facebook ; Twitter; CTIA. Social Media Processing. Twitter. Tweets User

Télécharger la présentation

CSCE 561 Social Media Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. CSCE 561Social Media Projects Ryan Benton October 8, 2012

  2. Social Media 30 billion pieces of content shared each month 140 million daily tweets 153 billion US SMS messages in 2009 Sources: Facebook; Twitter; CTIA

  3. Social MediaProcessing

  4. Twitter • Tweets • User • Sender information • Name • Display name • Location • Follower and friend counts • If it directed to other users • If retweet, who from • Tweet • The message • Hashtags • Date and Time • Media Information

  5. What are Hashtags? • The # symbol, called a hashtag, is used to mark keywords or topics in a Tweet. It was created organically by Twitter users as a way to categorize messages.

  6. Representation • Can convert the social media into graphs • Homogenous • One node type • One link type • Heterogeneous • One or more node types • One or more link types • Requirement • Either the links or the nodes (or both) must have more than one type.

  7. Nodes • Nodes represent an object • Examples • Users • Concepts • Hashtags • Locations • May have multiple attributes describing object

  8. Links • Relationships between nodes • May have more than one attribute

  9. Visualize

  10. Visualize

  11. Problems • Identifying relationships between hashtags in Twitter Data • Identify (Generate) Important Keywords from Tweets

  12. Identifying relationships between hashtags in Twitter Data

  13. The idea • If we have a collection normal associations of hashtags or hashtags that are usually used together. • Will we be able to identify a situation developing by analyzing a “strange” association?

  14. Research Problem • The main goal of the project is to find common association of entities or groups of “real world” concepts, using a graph structure of hashtags • Cluster the hashtags to form group of entities and find out inter-cluster associations. • Given a collection of hashtags with frequency and user information, can we identify a change in underlying structure from time t1 to time t2.

  15. Project 1: Cluster Hashtags into Entities • Can we use a underlying graph structure to identify normal associations. • If so, can it be used identify an association that is not normal • eg: #UTAustin evacuated due to #Bombthreat

  16. Project 2: Analyze the transition between events • If we have a collection of hashtags from a emergency event, eg: Hurricane, Forest Fire • If we also have collection of hashtags before the event happened • Can we identify the transition if hashtags, like frequency or associations?

  17. Identify (Generate) Important Keywords

  18. Why? • Hashtags not sufficient • Example • A tree just flew into my house during #hurricane Isaac

  19. Employ Keyword Selection Methods to Find “Good” Keywords • Multiple methods • You can choose/research one of your choice. • Discuss two • “CMore Approach” • “Shixian Chu Approach”

  20. CMore • NSF CMORE” Filter Approach • Generated as part of NSF • Concept Candidate List • First, generated that corresponds to all phrases with one, two, three, and four words. • Phrases are not allowed to span from one sentence to another.

  21. CMore, cont. • Filter Steps • Probabilistic filter uses various concept frequencies to determine whether or not a concept is of interest. • The filters that it uses are iterative in nature. • Concepts of length one are filtered first, then concepts of length two and so on. • Several functions that measure the frequency of a concept relative to its prefix and suffix are defined. • Utilizes Thresholds Filtering rules are formed by applying certain minimum threshold to the values of these functions. Once concepts of all lengths are processed using these rules, the remaining concepts are the relevant ones according to the probabilistic filter.

  22. CMore, cont. • Filter Steps • Stop words filter. • IF phrase contains word in stop word list then that concept is removed. • Entity type concepts filter • Therefore, those concepts that do not parse to a noun phrase are discarded •  Commonality filter • Applied only to candidate concepts of length one and two words. • Comparing the frequency with which a concept appears in a document to the frequency with which that concept appears in the Reuters [5] corpus.

  23. Shixian Chu’s Approach

  24. Used Jaguar car sale (0,0) R L New Jaguar car (3,0) Used Jaguar car (0,0) Jaguar car sale (0,1) (1,1) Jaguar car model (2,0) L L R R L R L R Used Jaguar (0,0) Jaguar car (0,1) (1,0) (2,0) (3,1) Car sale (0,2) (1,1) Car model (2,1) New jaguar (3,0) L R R L L R R L R L new (3,0) used (0,0) Jaguar (0,1) (1,0) (2,0) (3,1) car (0,2) (1,1) (2,1) (3,2) sale (0,3) (1,2) model (2,2) Root node Parent-Network

  25. Used Jaguar car sale (0,0) Jaguar car sale (0,1) (1,1) Jaguar car model (2,0) Jaguar car (0,1) (1,0) (2,0) Root node Simplified Parent-Network

  26. Parent-Network-based Key Phrase Extraction • Step 1: Document pruning. • Sentence boundaries are marked and non-word tokens are stripped. • Step 2: Document stemming. • Step 3: Creating Parent-Network. • Step 4: Computing logical frequency. • The logical frequency = (physical_frequency - the logical_frequency of all its ancestors that have been accepted as key phrases). • If no parents, the logical frequency = physical frequency. • Key phrase if logical frequency >= frequency threshold of this level. • The order for computation is from higher level to lower level (parent to child).

  27. Phrase Extraction -- catch. • Designed to work on documents and/or collection of documents • Tweets are very small

  28. Logical Frequency • Arithmetic Logical Frequency • Entropy-based Logical Frequency

  29. Solution • Create “tweet” collections • Randomly select X hashtags • For each hashtag, group tweets by time • Hour, day or week • Each hashtag/time group is now a collection

  30. Evaluation • Test impact of changing • Number of hashtags • Time period used to group • Modifying threshold values • What is impact on number of keywords? • How much overlap? • Does the results look reasonable?

  31. Resources • Twitter Collection Code • Need to check availability • If not, fairly straightforward to implement. • Database Schema • MySQL

  32. Thank-you Questions?

More Related