Efficient Monitoring, Mining and Analysis of User-generated Content

Ka Cheung “Richard” Sia (UCLA)‏ kcsia@cs.ucla.edu Sept 8 2008 Efficient Monitoring, Mining and Analysis of User-generated Content

Explosion of user-generated content • Doubling every 5 months – by Technorati

Characteristics of content About 97%-98% daily content are new 50 words shingles 62% weekly content are new on the web (“Whats new on the Web on the Web? The Evolution of the Web from a Search Engine Perspective”, by Ntoulas et.al., WWW 20004)

Characteristics of content Mostly consist of current event chatter Politics Technology Entertainment Sports

The Yahoo! Buzz service

Agenda Introduction: growth and characteristics of user-generated content Three aspects Monitoring: How to deliver fresh content to users Aggregation: How to efficiently deliver personalized results to users Analysis of tagging data: Making tagging data useful for advertisers

Framework Pull model: A central server monitors data source changes and provides digested content to users Push model: data sources notify server for updates

Overview New challenges Content update more frequently with recurring pattern More time-sensitive requirements Modeling of post update Definition of delay Strategies for allocation and scheduling

How updates are changed? Homogeneous Poisson modelλ(t) = λ at any t Periodic inhomogeneous Poisson modelλ(t) = λ(t-nT), n=1,2,…

Definition of metrics Delay of a data sourcesum of elapsed time for every post Delay experienced by the aggregator

Approach Resource allocation How often to contact data sources? O1 is more active than O2, how much more often should we contact O1 than O2? Retrieval scheduling When to contact a data source? 2 retrievals are allocated for O1, when should these 2 retrievals be located?

Single retrieval per period example λ(t) = 1, t [0,1], λ(t)=0, t [1,2] Periodicity T=2 τ = 0.5, expected delay = 0.75 τ = 1, expected delay = 0.5 τ = 2, expected delay = 1.5

Multiple retrievals per period m retrievals per period are allocated, when scheduled at time τ1, …, τm, the expected delay is given by:

Example 6 retrievals for λ(t)=2+2sin(2πt)‏

Experiment Data – 10k RSS feeds from syndic8.com collected during Oct – Dec 2004 Typical power law distribution – good for resource allocation

Performance CGM03 (“Effective page refresh policy for Web crawlers”, by Cho and Garcia-Molina in ACM TODS 2003) Homogenous Poisson model Optimize for “age” metrics Ours – both resource allocation and retrieval scheduling

Size of estimation window Resource constraint: 4 retrievals per day per feeds on average 2 weeks seems an appropriate choice

Consistency of posting rate 90% of the RSS feeds post consistently

Summary Resource allocation is aggressive Retrieval scheduling optimizes within individual data source Significantly improved freshness of content Also considered user browsing pattern “Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho, Hyun-Kyu Cho, in IEEE TKDE 2007 “Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007

Agenda • Introduction: growth and characteristics of user-generated content • Three aspects • Monitoring: How to deliver fresh content to users • Aggregation: How to efficiently deliver personalized results to users • Analysis of tagging data: Making tagging data useful for advertisers

Aggregate query over blogs User-generated content in Blogosphere and Web 2.0 services contain rich information of recent events Aggregation of individual user opition to show current popular trends

Motivation Global aggregation (examples from blogpulse.com)‏ Recent news got picked up quickly “Dark Knight” in the week of July 18 “Olympics” related phrases in the week of August 8 Potential drawbacks What if a user not interested in entertainment at all? Groups of bloggers collaborated to promote advertisement videos Personal aggregation Users selectively aggregate from different sources Efficient strategy to handle large number of users and sources

From global to personal aggregation Finished watching Michael Phelps in Olympics, let me try the WALL-E DVD... Michael Phelps performance in Olympicsis awesome... Dark Knight is great, more entertaining than watching Olympics and shows in Las Vegas! Um.. it will be good if there is a free show of Dark Knight and WALL-E bloggers Olympics Dark Knight Las Vegas items(phrases)‏ Michael Phelps WALL-E

Matrix forumulation Endorsement matrix (E)‏ - e.g. the number of times a blogger mentions an object (keywords / links) in his posts. Trust matrix (T)‏ - e.g. how often a user reads from a blog Personalized score (TE) – weighted endorsement score by a user’s trust vector‏ E o1 o2 O3 T b1 b2 b3 b4 TE o1 o2 o3 b1 3 2 0 u1 0.8 0.8 0 0 u1 2.4 4.0 0.0 b2 0 3 0 u2 0.2 0.2 0.6 0.6 u2 1.8 2.2 2.4 b3 1 0 1 u3 0 0 0.5 0.5 u3 1.0 1.0 2 b4 1 2 3 Total 5 7 4

Baseline implementations Endorsement (blog_id, iterm, score)‏, Trust (user_id, blog_id, score)‏ Personal Aggregate QuerySELECT t.item, sum(t.score*e.score) As p_scoreFROM Endorsement e, Trust tWHERE e.blog_id = t.blog_id ANDt.user_id = <user id>GROUP BY t.itemsORDER BY p_score DESC LIMIT 20 On-the-fly (OTF) View

Optimizing the query Identify “template” users Typical users interested in sports / politics / technology / ... Results of template users are pre-computed Results of individual users are combined from partially computed results

Using NMF to discover user groups Factorize trust matrix Decompose T into two sub-matrices W and H Non-negative matrix factorization W: <individual users : template users> relationship H: <template users : blogs> relationship User 2’s trust vector is expressed as linear combination of the trust vectors of template user 1 and 2 NMF as an approximation of original trust matrix

Reconstruction of results PersonalizedEndorsement score of template users are pre-computed, results of individual users are computed on request (HE) is maintained as sorted-lists for all template users W * (HE) is the personal aggregation result Computed using Threshold Algorithm (by Fagin et.al. PODS 2001)‏ Top-K list (HE) are sorted lists W * (HE) is weighted linear combination

Partition of trust matrix Decomposition is useful when matrix is dense Real life data is often skewed (by Akshay et.al. ICWSM 2007) Hybrid method: uses decomposition only when it is effective 2.7M subscription pairs 2. VIEW 1. OTF Users with >30 subscriptions Feeds with >30 subscribers 10k feeds, 24k users~1M subscription pairs Blogs with more subscribers 3. NMF Users with more subscription

Experiments Bloglines.com : online RSS reader Trust matrix T (1-0 version): subscription profile 91K users 487K RSS feeds Endorsement matrix E: blog – keywords occurrence Feed content collected between Nov 2006-Jul 2007 Keywords filtered by nouns with high tf-idf values Platform Python implementation of proposed scheme MySQL server on linux with data stored on RAID

How different is personalization? Week 2007 Jan 7 – 2007 Jan 13major event: iphone released Personal aggregation results differ from global aggregation 2007-01-07 to 2007-01-13 Global User 90439 User 90550 User 91017 sales cattle brazil yorker iphone beef iguazu iraq apple iphone reuters bush manager chicago search president iraq iraq vegas views management bush argentina avenue development apple kibbutz dept software companies video troops business prices cathartik saddam phone quarter google iran

How different is personalization? Overlap comparison of global aggregation and personal aggregation LG – global top 20 items Li – individual top 20 items of user i Personal aggregation results also differ among users Overlap degree withglobal aggregation result Pair-wise among users

Approximation accuracy Dense region of subscription matrix >30 subscribers: 10152 feeds >30 subscriptions: 24340 users L2 norm comparison Sparsity of W (23%), H (13%)‏ NMF approximation is close to SVD with sparseness adv. Rank SVD NMF 80 848.5 856.9 90 841.6 850.1 100 835.1 844.6 110 829.0 837.9 120 823.2 833.0

Approximation accuracy • How many items are approximated by NMF in top 20 list? • Ti – top 20 items of user i computed by OTF • Ai – top 20 items of user i computed by NMF • 70% approximation and more accurate for higher rank items Correlation with rank

Efficiency of proposed method Update cost (for 1 week data) OTF (222K) < NMF (3.2M) < VIEW (23.6M)‏ Query response time Average over 1000 users with highest number of subscription OTF: execute SQL query on MySQL server NMF: phython implementation of Threshold Alogrithm that interface MySQL server Average query response time reduced by 75%, eliminated outliers of significant delay Method avg std max min OTF 2.05s 3.60s 84.42s 0.037s NMF 0.46s 0.53s 2.84s 0.007s

Summary Deliver tailored results to users by personal aggregation Proposed a model for personal aggregate queries Optimization by NMF & Threshold Algorithm Real life dataset study shows query response time can be reduced by significantly with acceptable approximation accuracy “Efficient Computation of Personal Aggregation Queries on Blogs”, with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008 “Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007

Agenda • Introduction: growth and characteristics of user-generated content • Three aspects • Monitoring: How to deliver fresh content to users • Aggregation: How to efficiently deliver personalized results to users • Analysis of tagging data: Making tagging data useful for advertisers

More than just tag-cloud

LDA of tagging data • d – bookmark, w – tag (merged all users), z – topics • Sample of topics

Change of entropy • Tags with increasing popularity in a period Correspond to well-establishedtopics where users have common consensus Correspond to developing topics where users are willing to explorenew pages

Change of topic association • “Programmers” at Oct 2005 • programming, development, code, patterns, dev, coding, algorithms, scheme, software, ... • “Programmers” at Jan 2006 • programming, development, code, lisp, dev, coding, algorithms, scheme, software, cs, ... • work, jobs, career, job, shell, sleep, uml, regex, scripting, bash, ...

Specificity of word semantics • Entropy vs idf metrics

Summary Features can be combined to build a classifier for words Tag entropy change rate KL-divergence of topic distribution Entropy of semantic Assist advertisers to select better keywords for advertisement “Exploring Social Annotations for Word Usage Evolution” – work in progress

Thank you!

Definition of metrics τj – retrieval timeλ(t) – posting rate Expected delay Homogeneous Poisson model Inhomogeneous Poisson model

Resource allocation Consider n data source O1, …, On λi – posting rate of Oi wi – weight of Oi N – total number of retrievals per day mi – number of retrievals per day allocated to Oi Optimal allocation

Single retrieval per period For a data source with posting rate λ(t) and period T, the expected delay when retrieved at time τ is given by:

Blogs becoming inactive Detection of abandoned blog to save resource [2] D.R. Cox “Regression models and life-tables (with discussion)” Journal of the Royal Statistical Society, B(34), 1972 [3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”Technical report, Microsoft Research

More examples

Major posting patterns K – means clustering

Efficient Monitoring, Mining and Analysis of User-generated Content