Data Mining: Knowledge Discovery in Databases Peter van der Putten

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005

Overview • Lecture: data mining applications and internals • Collaborative Filtering & Recommender Systems • Decision Management Demo (optional) • Research @LIACS • Lab session: • Review lab work • Data mining projects • Data mining project presentations

To Repeat:What did we do in the first lecture • Definitions of data mining • Data mining tasks • Predictive data ming • Descriptive data mining • Algorithms for classification • Algorithm for association rules

To Repeat:Some working definitions…. • ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably • Data mining = • The process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data • Multidisciplinary field originating from artificial intelligence, pattern recognition, statistics, machine learning, bioinformatics, econometrics, ….

To Repeat:Some working definitions…. • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Example: the relation between patient characteristics and the probability to be diabetic • Instances: the individual, independent examples of a concept • Example: a patient, candidate drug etc. • Attributes: measuring aspects of an instance • Example: age, weight, lab tests, microarray data etc • Pattern or attribute space

To Repeat:Data mining tasks • Predictive data mining • Classification: classify an instance into a category • Regression: estimate some continuous value • Descriptive data mining • Matching & search: finding instances similar to x • Clustering: discovering groups of similar instances • Association rule extraction: if a & b then c • Summarization: summarizing group descriptions • Link detection: finding relationships • …

Case Data Mining in Practice:Recommender Systems

What is a recommender system? • Books, music etc. • Amazon.com, BOL (nl.bol.com), Proxis.nl, CDR.nl, gnod.net. Romanadvies.bibliotheek.nl • Digital Video Recorders • TIVO.com • Movies • IMDB.com, Movielens (http://movielens.umn.edu), reel.com, gnod.net • …. • Down to recommending café’s in Utrecht

What is a recommender system? • Recommender systems provide personalised recommendations to users about products, services or content based on his/her preferences • Preferences are generated from feedback • Explicit feedback: ratings ( or  ), I own this book, etc. • Implicit feedback: browsing, buying etc. • The general attributes of the recommended object are generally not used to make the recommendation

Data Mining Tasks Revisited: Search Finding best matching instances Every instance is a point in pattern space. Attributes are the dimension of an instance, f.e. Age, weight, gender etc. Pattern spaces may be high dimensional (10 to thousands of dimensions) f.e. weight f.e. age

Paradox • How can we recommend objects if we don’t know the attributes • What should be the dimensions? • How can we recommend books if we don’t know or can’t use genre, nr of pages, etc etc • Collaborative Filtering: • Recommending objects without knowing intrinsic attributes • Recommend attributes that are bought (viewed etc) together

Simple Collaborative Filtering • Given person with a profile of items • Find those nearest neighbor persons that have bought similar items (matching/search) • Recommend the products that are bought by these nearest neighbors • Blackboard example

Challenges • Large numbers of products and users (millions) • Recommendations have to be made in real time • A lot of users have rated only few items • Some products are very popular, others are very rare • User profiles are changing very dynamically

Solutions • Quick fixes • Remove the most popular and most rare products • Remove users with few ratings • Weight products by their popularity • Abstract from the user profiles: model based collaborative filtering • Clustering • Item to Item recommendations rather than User to Item recommendation

Data Mining Tasks Revisited: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user f.e. weight f.e. age

Data Mining Tasks Revisited: Clustering Clustering is the discovery of groups in a set of instances Groups are different, instances in a group are similar In 2 to 3 dimensional pattern space you could just visualise the data and leave the recognition to a human end user In >3 dimensions this is not possible f.e. weight f.e. age

Data Mining Tasks Revisited: ClusteringK-Means, a simple clustering algorithm • Randomly distribute k ‘prototype vectors’ into patterns space • Allocate all instances to nearest prototype vector • Move prototype vector in direction of the mean of all allocated instances • Repeat process until convergence

Clustering for recommender systems • Perform a clustering on the pattern space of user profiles down to a smaller number of profile prototypes • When making recommendations, search for the nearest prototype / cluster and generate recommendations from the cluster • Problem: how much clusters to use?

Alternative approach: item to item filtering • Record pairs of items bought by the same person • This computation is done offline for all items. • Use this information to recommend similar or popular books bought by others. • Rather than finding similar persons, find similar items for each item in the profile • This computation is fast and done online.

Questions about recommender systems?

Research @LIACS • Studies • Computer Science, Bio Informatics, Mediatechnology, ICT in Business • Research groups • Algorithms and Programmethodology • Digital Life Technologies • Imaging & BioInformatics • High Performance Computing • Leiden Embedded Research Center • Software Engineering & Information Systems • Theoretical Computer Science

Some examples of my research areas(Jointly with students) • Mix between applications and new algorithms • Video mining: recognize settings, porn filtering • Artificial Immune Systems: copying learning ability of immune systems • Predicting Survival Rate for Throat Cancer Patients • Crime Data Mining • Fusing Data from Multiple Sources • Decisioning: offering the right product to the right customer using predictions • Bias variance evaluation: distinguish between different sources of error for a classifier

What have we learned so far? • Day 1 • Data mining fundamentals • Basic hands on experience using WEKA • Day 2 • Delving deeper into selected applications & algorithms • Zoom in on a data mining case using WEKA

Data Mining: Knowledge Discovery in Databases Peter van der Putten