190 likes | 315 Vues
This paper presents a comprehensive framework for extracting and updating user profiles from large-scale datasets using a novel Kullback-Leibler (KL) divergence approach. By analyzing user interactions and community snapshots, we identify meaningful keywords that characterize users' interests. Our methodology incorporates MapReduce for scalability, facilitating efficient processing of extensive data. Experimental results demonstrate significant improvements in user profiling accuracy, showcasing how tailored user profiles can inform targeted advertising and enhance user engagement in digital platforms.
E N D
Joint work with Michal Shmueli-Scheuer, Haggai Roitman, David Carmel and Yosi Mass Extracting User Profiles from Large Scale Data David Konopnicki IBM Haifa Research Lab
Motivating Example User Browsing Keywords Modeling:for each user, report the most meaningful keywords to describe her profile. san-francisco peer michael jackson alive analysis Large scale content analysis for mass amount of users. Update users profiles Dashboard Profiles database AdvertisementSystem Track statistics about readers interests
Contributions • User Profiling Framework: • User profile model • KL approach to weight user profile • Large scale implementation: • MapReduce flow • Experiments: • Quality analysis • Scalability analysis
User Profiling Framework- Setting <userID, docID> <u1,d1> <u1,d2> <u2,d2> <docID, content> <d1,{bla,bla,bla}> <d2,{foo,foo}> logging targeting
User Profiling - Definitions • Bag of words model (BOW) • Profile maintenance • User snapshot • Community snapshot
User Profiling - Intuition • Find terms that are highly frequent in the user snapshot and separatethe most between the user and the community snapshots { Travel, Tennis ,Sport }
User Profiling – Naïve approach • Term frequency: number of times a term t appears in document d- tf(t,d) • Document frequency: the number of documents containing the term t – df(t,D) frequent separate average tf over the user snapshot inverse document frequency (df) of a term in the community snapshot probability to find a term in the user snapshot
Kullback-Leibler (KL) Divergence • Measures the difference between two probability distributionsP1andP2 : • KL measures the distance between the Community distribution and the User distribution • Each term is scored according to its contribution to the KL distance between the community and the user distributions. • The top scored terms are then selected as the user important terms. Community User
User Profiling – KL method • Community marginal term distribution: • User marginal term distribution probability normalization factor average tf over the community snapshot Probability to find a term t in community snapshot l=0.001 Relative initial weight of term t Smoothing with the community snapshot
HDFS HDFS HDFS HDFS ¯ TF MapReduce Flow HDFS Mapper: input: (d,text) output ({t,d},1) Reducer: output ({t,d}, tf(t,d)) // Sum Mapper: input: (u,d) output (u,1) Reducer: output (u,|Dj(u)|) // Sum TF |Dj(u)| Mapper: input: (t,tf(t,d),|Dj|) output (t,{tf(t,d),|Dj|,1}) Reducer: output (t, tf(t, Dj)) //Avg Mapper: input: ({u,t,d},{tf(t,Dj(u)),|Dj(u)|}) output ({u,t,|Dj(u)},{1}) Reducer: output ({u,t},{udf(t,Dj(u))}) UDF DF Mapper: input: ({t,d},tf(t,Dj)) output (t,1}) Reducer: output (t, {df(t,Dj),idf(t,Dj),cdf(t,Dj}) Mapper: input: ({t},{tf(t,Dj),cdf(t,Dj)}) output (t,Nj}) Reducer: identity Nj P(t|Dj) Mapper: input: ({t},{tf(t,Dj),|Dj|,cdf(t,Dj),Nj}) output (t,P(t|Dj)}) Reducer: identity
HDFS HDFS HDFS HDFS HDFS MapReduce Flow- cont. w ∑w P(t|Dj(u))
Experimental Data- quality analysis • Open Directory Project (ODP): • Categories are associated with manual labels • Considered as “ground-truth” in this work • Examples: • ODP: Science/Technology/Electronics: Manual label: “Electronics” • ODP: Society/Religion/and/Spirituality/Buddhism: Manual label: “Buddhism” • Data Collection: • 100 different categories randomly selected from ODP • 100 documents randomly selected per category • A total collection size of about 10,000 Web pages • Evaluation: • A match is considered if the suggested label is identical, an inflection, or a Wordnet’s synonym to the manual label
Results In how many cases, we got at least one correct term from the top-K terms. • KL outperforms all other approaches for features selection
Experimental Data- scalability analysis • Blogger.com • Data Collection: • We crawled 973,518 blog posts from March 2007 until January 2009 • Total collection size of 5.45GB, with ~120,000 users • Cluster setting: • 4-node commodity machines cluster (each machine with 4GB RAM, 60GB HD, 4 cores) • Hadoop 0.20.1 http://grannyalong.blogspot.com/ Blog entry
Number of User Profiles Document ratio User profile ratio Time ratio • Runtime ratio is correlated with the number of user profiles ratio
Data Size #user: chose 18,000 users between March-Apr 2007 • Runtime linearly increases with the increasing of data size
Related Work • Content-based user profiling: • Profile contains a taxonomic hierarchy for the long-term model. The Taxonomy is taken from the ODP. Short-term activities update the hierarchy. • Adaptive user profile: Use words that appear in the Web pages and combine them using tfidf, looking on some window and giving different weights according to the recency of the browsing • KL approach to user tasks: • Filter new documents that are not related to the user based on his profile. • Annotate a url with the most descriptive query term for a given user, based on his profile. • User targeting in large-scale systems: • Behavioral targeting system over Hadoop MapReduce. • Large scale CF technique for movies recommendations for users. • Incremental algorithm to construct user profile based on monitoring and user feedback which trades-off between complexity and quality of the profile.
Conclusions & Future Work • We proposed a scalable user profiling solution Implemented on top of Hadoop MapReduce • We showed quality and scalability results • We plan to extend the user model into semantic model • Extend the user profile to include structured data