1 / 44

Community-Centric Information Exploration on Social Content Sites

Community-Centric Information Exploration on Social Content Sites. Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29 th , 2009. Acknowledgement (thanks to Facebook). Acknowledgement. Michael Benedikt, Oxford Alban Galland, INRIA Jian Huang, Penn State

Jimmy
Télécharger la présentation

Community-Centric Information Exploration on Social Content Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Community-CentricInformation Explorationon Social Content Sites Sihem Amer-Yahia and Cong Yu Yahoo! Research New York M3SN, Shanghai, China March 29th, 2009

  2. Acknowledgement (thanks to Facebook) Acknowledgement • Michael Benedikt, Oxford • Alban Galland, INRIA • Jian Huang, Penn State • Laks Lakshmanan, UBC • Julia Stoyanovich, Columbia

  3. Social Networking Sites Are Popular!

  4. An Emerging Trend: Social Content Integration

  5. Social Content Sites (SCS) • Web destinations that let users: • Consume content-oriented information: videos, photos, news articles, etc. • Engage in social activities with their friends and people of similar interests • Two major driving factors: • Incorporating social activities improves the attractiveness of traditional content sites • The “similar traveler” feature improves the user duration on Yahoo! Travel significantly • Incorporating content is critical to the value of social networking sites • A significant amount of user time is spent on browsing other people’s photos, notes, etc. • Privacy: important but not the focus here

  6. Main Challenges in SCS • Social Content Management • Data must be maintained and analyzed at web-scale: the underlying social content graph can be enormous. (Edward Chang’s keynote later) • Information has to be integrated from various places. • Information Discovery • Current: pure content based search or simple popularity/rating based browsing • Future: effectively and efficiently leverage the underlying social graph in more sophisticated ways • Information Presentation • Current: a single ranked list based on relevance • Future: metrics other than relevance / groups and clusters / explanations Information Exploration

  7. Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

  8. Information Exploration on SCS • Major Paradigms • User-based: browsing the content of your friends or other users you simply follow (Facebook, twitter) • Search: content based keywords matching (most traditional content sites) • Recommendations: hotlists, tag-based hotlists (Amazon, many collaborative tagging sites) • Query: database-style querying with complex conditions (not many so far) But which ones are more frequent?

  9. Search vs. Recommendation • Majority of those searches are in fact recommendations in disguise! • Users are usually NOT looking for specific items • Majority of the sessions are seeking recommendations with geographical andtopical constraints • Ranking the paradigms (for neurotic people who insist on orders): “Recommendation” ~ User-based > Search > Query

  10. Recommendation 2.0 (or 1.5) • Users demand a lot more from us • Be able to specify the topic: “what are the good destinations for a romantic trip in Asia?” • Be able to specify the community: “what’s hot among my tech-savvy friends?” • Be able to present the results in ways that can maximize the understanding of those results • But we do know a lot more about the users and the contents (through the users) • Richer user profiles • Friendships and relationships • Content activities (rating movies, tagging travel destinations, etc.) • Communication activities (IM, email, twitter follows, Facebook wall messages, etc.)

  11. id: 301 destination state: NY id: 103 people Joe Yankees NYC friend visit id: 101 people John Football fun, club id: 302 destination state: MI college: umich AA id: 102 people Amy Data Model: Social Content Graph A social content graph is a logical graph structure where the labeled nodes represent users and objects, and the labeled edges represent relations between users and items, as well as activities users perform on items or other users. Both nodes and edges can have structural attributes.

  12. A Community-Centric Model • Communities of users as the central concept • Community: a group of “similar” or “related” users • Recommendations are performed per community and results from each community are assembled together in the end • A user can belong to multiple communities: • A vector-based representation: user:<c1:w1, c2:w2, ...> • Challenge I: Community Generation • Topic based • Relationship based • Activity based • Challenge II: Community Selection • Given a user session, determine the appropriate set of communities to work with

  13. A Case Study in Yahoo! Travel on Topic-Based Communities

  14. Data in Yahoo! Travel • Users • Yahoo! users • Destinations • cities, attractions, etc. • Tagging activities: • Destination tags: editorial + custom • Self tags: mostly custom • Goal: • Given a user session (user + tags + region), recommending travel destinations to the user • Approach: • Discover topics for users, tags and destinations based on tagging actions • Construct communities based on topics

  15. Topic Discovery through LDA • Joint work with Jian Huang • Intuition • Each destination is modeled as a document consists of all tags being used on it • Tags are treated as words in the document • Apply Latent Dirichlet Allocation [BNJ03] • A probabilistic generative model for document topic discovery • Produces the following conditional probabilities • Prob(w|t): given a latent topic, the probability of the tag word belongs to that topic • Prob(t|d): given a document, the probability of the document being about a certain latent topic • Prob(t|u) = (Prob(t|di)), where u tagged di

  16. Examples of Topic Discovery Topic 0: Romantic/Luxury Tokens: romantic, 4-star, shopping, spa, golf, luxury, honeymoon … Top cities: Cancun, San Juan, Grand Canyon, Bangkok … Paris Topic 2: Arts/Historical/Cultural Tokens: architecture, sightseeing, art, history, culture, cathedral, castle … Top cities: Paris, London, Boston, Istanbul Amsterdam, Rome, Hong Kong … Cancun Topic 1: Seaside/Water related Tokens: beach, scuba, summer, diving, fishing, snorkeling, island, lake … Top cities: San Diego, Honolulu,Chicago, Miami, Seattle … Topic 3: Nightlife, City-life Tokens: nightlife, gambling, wine, drinking, exciting, casino … Top cities: Las Vegas, New York City, Los Angeles, San Francisco … San Diego Las Vegas

  17. Community Generation • Identify destinations belong to the same topic • For each topic t, identify destination d belongs to t: D(t) = { d | Prob(t|d) > threshold } • Generate topical communities • D(u) = { d | user u visited destination d } • Interests-based topical community Cinterest(t) = { u | > threshold } • Expertise-based topical community Cexpert(t) = { u | > threshold } D(u) D(t)

  18. Community Selection • Each user session consists of three pieces of information • (user, tag, region) • Region is treated as a Boolean constraint • We combine and convert user and tag into a single topical vector Prob(t|(u,w)) = w1·Prob(t|u) + w2·(Prob(w|t)·Prob(t)/Prob(w)) • A community C(t) is selected if Prob(t|(u,w)) is greater than a threshold • The choice between Interests or Experts Communities is heuristically decided • User is an expert: leverage interests-based community • User is not an expert: leverage experts-based community

  19. Example Results

  20. Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

  21. Querying Items within Social Network • Problem Definition: • Given a user and a query, retrieve items relevant to the query based only on the user’s network • The score of the item depends on keyword and user • E.g., score(u,q,i) = cnt(v), where tag(v,i,q) and friend(u,v) • A common scenario on social tagging sites like del.icio.us • Naïve Approach • One inverted index for each (user,keyword) • Apply top-k threshold algorithms at query time • Optimal efficiency • But, high storage overhead

  22. score score 53 99 80 36 30 78 15 75 14 72 tag = music tag = news item item score score item item 10 63 10 60 i5 i5 i1 i1 30 73 5 50 i2 i9 65 i2 i8 29 i2 i8 27 62 i3 i4 i7 i6 40 25 i4 i2 i5 i1 i3 i5 39 23 i6 i8 i6 i6 20 18 i7 i4 i7 i7 15 16 i3 i3 i9 i8 13 16 user Jane user Jane user Ann user Ann Space Overhead of Per-User Indexing • Conservative Example: • 100K users, 1M items, 1K tags • 20 tags/item from 5% of the taggers • 10 bytes per inverted list entry • 1 Terabyte index!

  23. Clustering to the Rescue! • Joint work with Michael Benedikt, Laks Lakshamanan, Julia Stoyanovich • Clustered Approach • Group users based on shared behavior into communities • Shared items • Shared items with same tags • One inverted index for each (group,keyword) • Replace item score with score upper bound (among all users in the group) • Sacrifice a bit of query performance for indexing space efficiency • Community-based (seeker clustering): • Groups are keyword-independent • E.g., shared items, shared tags • Behavior-based (tagger clustering): • Groups are keyword-specific • E.g., Jane may be similar to John on “travel”, but very different from him on “sports”.

  24. Sarah Ann User Clustering one user, one cluster one inverted index per cluster upper-bound score per item Behavior-based Community-based Jane Chris one user, many clusters one inverted index per cluster upper-bound score per item

  25. Community-based Clustering: Space

  26. Community-based Clustering: Performance

  27. Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

  28. Information Presentation • A broad definition: Given the set of relevant items, identify the appropriate subset of items to be shown to the user(s) in the right organization andwith the right amount of information. • Appropriate Subset: • Timeliness • Geographical closeness • Diversity • Etc. • Right Organization: • Ranking • Grouping • Faceted Navigation • Right Amount of Information • Explanation

  29. A Case Study in Recommendation • Joint work with Laks Lakshmanan • While relevance is important to recommendation, others are critical too: • Novelty: avoid returning results that users are likely to know already. • Serendipity: aim to return less relevant results that might give users a pleasant surprise. • Diversity: avoid returning results that are too similar to each other. • Recommendation Diversification: From the pool of candidate items, identify a list of items that are dissimilar to each other while maintaining a high cumulative relevance, i.e., strike a good balance between relevance and diversity.

  30. Existing Solutions for Diversification • Attribute-Based Diversification • Diversity semantics: pair-wise distance functions based on item attributes (e.g., movie attributes). • Combining with relevance: • Threshold either relevance or distance, maximize the other • Optimize an overall score as a weighted combination of relevance and distance • Algorithm • Perform traditional recommendation • Obtain the attributes of each candidate item and compute the pair-wise distance • Ad-hoc methods follow how diversity and relevance are combined • A Major Problem: • Lack of attributes suitable for estimating distance between pairs of items: e.g., URLs in del.icio.us, photos on Flickr, videos on Vimeo.

  31. Explanation-Based Diversification • Intuition • Explanation is the set of objects because of which a particular item is recommended to the user. • Two items share similar explanations are likely to be similar to each other. • Explanation for Item-Based Strategies • Explanation for Collaborative Filtering Strategies (social!)

  32. Explanation-Based Diversity • Pair-wise diversity distance between two recommended items • Standard similarity measures like Jaccard similarity and cosine similarity • E.g. (Distance based on Jaccard similarity) • Diversity for the set of recommended results (S)

  33. Benefits and Practicality of Explanation-Based Diversification • Applicable to items without attributes or whose attributes are difficult to analyze • Common on social content sites • Explanations are by-products of many recommendation processes • They can be maintained with little overhead See our poster on Monday for details

  34. Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

  35. User Social Content Admin Facebook Y! IM Y! Sports SocialScope Platform Information Presentation Social Content Grouper Social Content Ranker User Interface Information Discovery Query / Result Social Content Analyzer Social Query Evaluator Activity Manager Data Manager Content Management Social Content Graph Activities Content Integrator OpenSocial API OpenSocial API Important Information Flow (1) Raw/derived social nodes and links (2) Social content sub-graph (3) Relevant social content sub-graph (4) Relevant social nodes and links

  36. A Graph Based Logical Algebra Framework • Designed for information discovery on social content sites • Aim to provide a declarative way of specifying analysis and query tasks • Uniformity and flexibility • Opportunities for performance optimization • Basic operators: • Node Selection (σN), Link Selection (σL) • Composition, Semi-Join • Node Aggregation, Link Aggregation • Details in [CIDR09]

  37. A Simple Search Task John’s friends People visited Denver John’s friends who visited Denver Their activities

  38. Jelly:A Language Over Social Content Sites • Designed with a focus on community-centric information exploration applications • Most useful applications • A restricted implementation of the SocialScope algebra • Based on nested relation model, instead of full graph model • Built-in primitives for topic and community generation • Topic Generation, Community Extraction • Recommendation Generation • Group Generation, Explanation Generation

  39. A Simple (Incomplete) Example • Topic Generation and Community Extraction • Recommendation Generation • Information Presentation generate topics for item into topics from tagging R using LDA (seed, th=0.8) seed R.item group R.tag weight-with count() generate communities into experts from topics T, tagging R where T.*.item = R.item using jaccard-similarity (seed, th=0.7) seed (R.user, T.topic) list R.item generate recommendations into candidates given user u, query q from experts T where Selected (T.topic, u, q) using count-users (seed) seed T.*.item list T.user generate explanations into results from candidates C, tagging T where C.item = T.item using identity (seed) seed C.item list T.user weight-with count()

  40. Talk Outline • Emerging Trend of Social Content Sites (SCS) • Information Exploration on SCS • Community Centric Model • Information Discovery: Improving Search Performance through Clustering • Information Presentation: Diversification via Explanation • SocialScope and Jelly • Further Challenges • Conclusion

  41. Scalability Challenges • Social content graphs are large • 10’s of millions of users • millions (movies, destinations) or billions (web links) of items • Challenge 1: being able to analyze the graph efficiently • We need to design Map-Reduce frameworks that are suitable for those analyses • Activities are being generated at a very high rate • millions of status updates per day (Facebook, twitter) • Similar amount of content activities (browsing, tagging, etc.) • Challenge 2: being able to incrementally incorporate new activities into the existing model

  42. Semantic Challenges and Questions • Connections are multi-dimensional • Two main categories on the surface: explicit (friendship) vs. implicit (shared activities) • However: not all friendship links are the same and not all activities are the same • Challenge: for a given user on a given topic, identify the appropriate set of explicit and implicit links • expert versus interest based communities is one step toward that goal • Temporal, geographical, and other external influences • How does a community evolve at time goes on and as members moves around? • Social graph interactions • How does one social graph (say Facebook) impact another (say MySpace)?

  43. Conclusion • The emerging trend of social content sites present many challenges • Social information integration • Information discovery • Information presentation • The notion of communities and topics are crucial for effective and efficient information exploration on those sites • Community discovery and analysis • Community-based information exploration • Lots more! • At Yahoo!, we are focusing on building systems that can address those challenges

  44. Questions?

More Related