1 / 66

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16 th 2008. “It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties , structure , content .”. Thesis Statement. Key Observations.

bairn
Télécharger la présentation

Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Social Media Communities and Content Akshay Java Ph.D. Dissertation Defense October 16th 2008

  2. “It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

  3. Key Observations • Understanding communication in social media requires identifying and modeling communities • Communities are a result of collective, social interactions and usage.

  4. Developed and evaluated innovative approaches for community detection A new algorithm for finding communities in social datasets SimCUT, a novel algorithm for combining structural and semantic information First to comprehensively analyze two important, new social media forms Feed Readership Microblogging Usage and Communities Built systems, infrastructure and datasets for the social media research community Contributions

  5. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  6. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  7. Social Media Describes the online technologies and practices that people use to share opinions, insights, experiences, and perspectives and engage with each other. UGC + Social Network ~Wikipedia

  8. What you… Think blogs Say Podcasts See Flickr, YouTube Hear Pandora, Last.fm Do Twitter,Jaiku, Pownce It’s about YOU!

  9. Who are our... Friends Facebook Colleagues LinkedIn Virtual Avatars secondlife Also about US

  10. What we share Knowledge Wikipedia Links del.icio.us, StumbleUpon Love/Hate yelp, Upcoming Location FireEagle, BrightKite Spaces Ustream, Qik How We Share

  11. Communities • Social interactions • build communities • Shared Interests • Common Beliefs • Events • Organization/Location

  12. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  13. What is a Community Political Blogs • A community in real world is represented in a graph as a set of nodes that have more links within the set than outside it. • Graph • Citation Network • Affiliation Network • Sentiment Information • Shared Resource (tags, videos..) Twitter Network Facebook Network

  14. Existing Approaches Clustering Approach • Agglomerative/Hierarchical Incrementally, group similar nodes to form clusters Communities in Football League (Hierarchical Clustering) Football Teams

  15. Existing Approaches Clustering Approach • Agglomerative/Hierarchical Topological Overlap: Similarity is measured in terms of number of nodes that both i and j link to. (Razvasz et al.)

  16. Existing Approaches Clustering Approach • Agglomerative/Hierarchical • Divisive/Partition based (Girvan Newman) Normalized Cut (NCut) (Shi, Malik) Political Books

  17. Existing Approaches Graph Laplacian Normalized Cuts Cost of edges deleted to disconnect the graph Total cost of all edges that start from B The graph is partitioned using the eigenspectrum of the Laplacian. (Shi and Malik) The second smallest eigenvector of the graph Laplacian is the Fiedler vector. The graph can be recursively partitioned using the sign of the values in its Fielder vector.

  18. Existing Approaches • Modularity Score (Newman et al.) • Measure of quality of clustering eii fraction of intra-community edges ai expected value of eii disregarding communities • Q = 0 Communities are random • Q >0 Higher values are better • Optimizing modularity is NP-Hard* • Spectral Methods • Heuristics * (Brandes et al.)

  19. Limitations Existing methods Do not scale well for Web graphs Fail to exploit the underlying graph’s distributions Unable to use available meta-data and semantic features.

  20. “It is possible to develop effective algorithms to detect Web-scale communities using their inherent properties, structure, content.” Thesis Statement

  21. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  22. Special Properties of Social Datasets • The Long Tail • 80/20 Rule or Pareto distribution • Few blogs get most attention/links • Most are sparsely connected • Motivation • Web graphs are large, but sparse • Expensive to compute community structure over the entire graph • Goal • Approximate the membership of the nodes using only a small portion of the entire graph.

  23. Special Properties of Social Datasets • Intuition • communities are defined by the core (A) and the membership of the rest of the network (B) can be approximated by how they link to the core. • Direct Method • NCut (Baseline) • Approximation • Singular Value Decomposition (SVD) • Sampling • Heuristic

  24. Approximating Communities ICWSM ‘08 Nodes ordered by degree • SVD (low rank) • Sampling based Approach • Communities can be extracted by sampling only columns from the head (Drineas et al.) • Heuristic Cluster head to find initial communities. Assign cluster that the tail nodes most frequently link to. r

  25. Approximating Communities ICWSM ‘08 • Dataset: A blog dataset of 6000 blogs. Heuristic Approximation Original Adjacency Modularity = 0.51

  26. Approximating Communities ICWSM ‘08 Similar Modularity Lower Time • AdvantageFaster detection using small portion of the graph, less Memory. • SVD O(n3), Ncut O(nk), Sampling O(r3), Heuristic O(rk) n = number of blogs, k = number of clusters, r = number of columns More Time Low Modularity

  27. Approximating Communities ICWSM ‘08 Blog Dataset Social network datasets: Additional evaluations using Variation of Information score

  28. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  29. Tags are free meta-data! • Other semantic features: • Sentiments • Named Entities • Readership information • Geolocation information • etc. • How to combine this for detecting communities?

  30. Social Media Graphs Links Between Nodes and Tags Links Between Nodes Simultaneous Cuts

  31. Communities in Social Media A community in the real world is identified in a graph as a set of nodes that have more links within the set than outside it and share similar tags.

  32. SimCUT: Simultaneously Clustering Tags and Graphs WebKDD ‘08 Nodes Tags Tags Tags Nodes Nodes Tags Nodes Fiedler Vector Polarity β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut

  33. SimCUT: Simultaneously Clustering Tags and Graphs WebKDD ‘08 Clustering Only Links Clustering Links + Tags β= 0 Entirely ignore link information β= 1 Equal importance to blog-blog and blog-tag, β>> 1 NCut

  34. Datasets • Citeseer (Getoor et al.) • Agents, AI, DB, HCI, IR, ML • Words used in place of tags • Blog data • derived from the WWE/Buzzmetrics dataset • Tags associated with Blogs derived from del.icio.us • For dimensionality reduction 100 topics derived from blog homepages using LDA (Latent Dirichilet Allocation) • Pairwise similarity computed • RBF Kernel for Citeseer • Cosine for blogs

  35. Clustering Tags and Graphs Clustering Only Links Clustering Links + Tags

  36. Clustering Tags and Graphs Accuracy = 36% Accuracy = 62% Higher accuracy by adding ‘tag’ information

  37. Varying Scaling Parameter β β >> 1 β=0 β=1 Accuracy = 36% Accuracy = 39% Accuracy = 62% Only Graph Only Tags Graphs & Tags Higher accuracy by adding ‘tag’ information Simple Kmeans ~23% Content only, binary Content only ~52% (Getoor et al. 2004)

  38. Effect of Number of Tags, Clusters • Mutual Information • Measures the dependence between two random variables. • Compares results with ground truth Link only has lower MI More Semantics helps Citeseer Similar results for real, blog datasets

  39. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  40. Tags are one type of meta-data! • Other semantic information: • Sentiments • Named Entities • Readership information • Geolocation information • etc. • How do we get additional semantics?

  41. Additional Semantics (TREC 06, IJCAI/AND 07) • BlogVox: • Sentiments and Opinions • SemNews: • Named Entities, beliefs, facts • Link Polarity: • Sentiment from anchor text • Readership: • Feed subscriptions and usage (AAAI SS 05, HICS 06, IJSWIS) (ICWSM 07) (ICWSM 07)

  42. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  43. Key Observations • Understanding communication in social media requires identifying and modeling communities • Communities are a result of collective, social interactions and usage.

  44. Feeds Readership http://ftm.umbc.edu ICWSM ‘07 Folders Use folder label as topics/tags. Group similar folders together. Rank Feeds under a “topic”

  45. 83K publicly listed subscribers 2.8M feeds, 500K are unique 26K users (35%) use folders to organize subscriptions Data collected in May 2006 Feed Subscription Statistics ICWSM ‘07 Although there may be ~ 50M+ Blogs, only a small fraction get continued user attention in the form of subscriptions

  46. Feeds That Matter http://ftm.umbc.edu ICWSM ‘07 • Communities from Feed Subscriptions • A Common vocabulary emerges from folder names • Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity Folder Usage # of Users Using a Folder Rank of a Folder (By number of Feeds in it)

  47. Tag Cloud After Merging Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.

  48. Feed Recommendations http://ftm.umbc.edu ICWSM ‘07 • Two feeds are similar if they are categorized under similar folders If you like X you will like….. • Feed Distillation for “Politics” • Merged folders: “political”, “political blogs” • Talking Points Memo: by Joshua Micah Marshal • Daily Kos: State of the Nation • Eschaton • The Washington Monthly • Wonkette, Politics for People with Dirty Minds • http://instapundit.com/ • Informed Comment • Power Line • AMERICAblog: Because a great nation deserves the truth • Crooks and Liars Tech Knitting

  49. Outline • Introduction • Detecting Communities in Social Media • Combining Semantic Information • Case Studies • Feed Usage and Distillation • Microblogging Communities • Future Work • Conclusions

  50. Wikipedia is our collective wisdom Twitter is our collective consciousness

More Related