300 likes | 392 Vues
Research Trends in Multimedia Content Services. Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences. András A. Benczúr. Web 2.0, 3.0 …?. Platform convergence (Web, PC, mobile, television) – information vs. recreation
E N D
Research Trends in Multimedia Content Services Data Mining and Web Search Group Computer and Automation Research Institute Hungarian Academy of Sciences András A. Benczúr
Web 2.0, 3.0 …? • Platform convergence (Web, PC, mobile, television) – information vs. recreation • Emphasis on social content (blogs, Wikipedia, photo and video sharing) • From search towards recommendation (query free, profile based, personalized) • From text towards multimedia • Glocalization (language, geography) • Spam
A sample service RSS Web 2.0 Recommender engine client software • Small screen browsing • Recommendation based on user profile (avoid query typing) • Read blogs, view media, …
The user profile • History stored for each user: • Known ratings, preferences, opinion – scarce! • Items read, weighted by time spent • details seen, scrolling, back button • Terms in documents read, tf.idf weighted top list • User language, region, current location and known sociodemographic data • Multimedia!
Distribution of categories Unknown 0.4% Alias 0.3% Empty 0.4% Non-existent 7.9% Ad 3.7% Weborg 0.8% Spam 16.5% Reputable 70.0%
Keresési találati pozíció hatása Találati pozíció nézésével töltött idő Találathoz érkezés ideje
Segmentation Similar objects
ImageCLEFObject Retrieval Task Class of Query Image Pre-classified Images VOC2007 Query Images Original Training Set
Networked relation • spam • social network analysis • churn
Szociális hálózatok ADSL --- ADSL --- home business
Stacked Graphical Learning ? • Predict churn p(v) of node v • For target node u, aggregate p(v) for neighbors to form new feature f(u) • Rerun classification by adding feature f(.) • Iterate v7 v1 v2 u
Why social networks are hard to analyze Subgraphs of social networks Tentacles induce noise Medium size dense communities attract much algorithmic work
Mapping into 2D plain spectral semidefinite
Research Highlights Recommenders: KDD Cup 2007 Task 1 First Prize Predict the probability that a user rated a movie in 2006, based on year –2005 training data Spam filtering: Web Spam Challenge 1 first place Churn prediction: method presented at KDD Cup 2009 Workshop Task XXXX
Netflix: lessons and differences learned • Ratings 1– 5 stars • Predict an unseen rating • Evaluation: RMSE • 0.8572: $1,000,000 • Current leader: 0.8650 • Oct/07: 0.8712 KDD Cup 2007 • same data set • predict existence of a rating
Results of two separate tasks BellKor team report [Bell, Koren 2007]: • Low rank approximation • Restricted Boltzmann Machine • Nearest neighbor KDD Cup 2007: Predict probability that a user rated a movie in 2006: • Given list of 100,000 user–movie pairs • Users and movies drawn from Netflix Prize data set Winner report [K, B, and our colleauges 2007]
Evaluation and Issue 1 • For a given user i and movie j • where is the predicted value • KDD Cup example: • Our RMSE: 0.256 • First runner up: 0.263 • All zeroes prediction: 0.279 (Place 10-13) • But why do we use RMSE and not precision/recall? • RMSE preferes correct probability guesses for the majority unfrequently visited items • The presence of the recommender changes usage
Method Overview • Probability by naive user-movie independence • Item frequency estimation (Time Series) • User frequency estimation • Reaches RMSE 0.260 in itself (still first place) • Data Mining • SVD • Item-item similarities • Association Rules • Combination (we used linear regression)
Time series prediction Interest remains for long time range (several years)
Short lifetime of online items Publication day Next day usage peak Origo Very different behavior in time: news articles Third day and gone … http://www.origo.hu/filmklub/20060124kiolte.html
SVD user movie news item • K-dim SVD: Noise filtering – the essence of the matrix – optimizes • SVD explains ratings as effect of few linear factors • RMSE (ℓ2 error) 10-30 dim: 0.93 • Issue: too many news items • 18K Netflix movies vs. • potentially infinite set of items • -> may recommend data source but not the item
Lessons learned • Content similarity might be the key feature • Relative success of trivial estimates on KDD Cup! • Data mining techniques overlap, apparently catch similar patterns • Precision/recall is more important than RMSE • Solution must make heavy use of time
Future plans and ideas • New partners and application fields: network infrastructure, new generation services, bioinformatics, …? • Scaling our solutions to multi-core architectures • Use our search (cross-lingual, multimedia etc) and recommender system capabilities in major solutions; mobile, new generation platforms etc. • Expand means of our European level collaboration, e.g. KIC participation
Questions ? benczur@sztaki.hu http://datamining.sztaki.hu Andras A. Benczur