A Data Mining Course for Computer Science Primary Sources and Implementations

A Data Mining Course for Computer SciencePrimary Sources and Implementations Dave Musicant Saturday, March 4, 2006

Overview • What is data mining? • Why offer a course in data mining? • Why focus on research papers in an undergraduate class? • What topics do I cover? • What research papers do I use in class? • What assignments do I use? • Does it work?

What is data mining? • “The non-trivial discovery of novel, valid, comprehensible and potentially useful patterns from data” (Fayyad et al) • Data Mining and Machine Learning are two sides of the same coin • Data mining focuses more on larger datasets • Machine learning focuses more on connections with artificial intelligence • ... but there is much overlap in the two areas. • My course is titled “Machine Learning and Data Mining” • boosts student enthusiasm

Why offer a course in data mining? • Interesting applied area of CS that uses theoretical techniques • Reinforces and introduces data structures and algorithms • heaps, R-trees, graphs • Privacy and ethics • Personal ownership in assignments • Students choose datasets in areas that interest them • New field, yet accessible • Can be done with only Data Structures as a prereq • It’s my research area

Why research papers? Can it be done? • One approach to course is to use data mining software • Lopez & Ludwig, University of Minnesota-Morris • I wanted students to implement data mining algorithms • Textbook support w/ computer science focus is limited • (I use Margaret Dunham’s text as a side reference) • Primary sources provide a rich experience • With proper selection, papers are accessible to undergraduates • Papers must be supplemented in classroom • e.g. specific topics in linear algebra, statistics • directs classroom activity toward filling gaps and interpreting papers instead of parroting reading

Topics, Papers, Assignments • Each topic consists of one or more papers that are assigned to the students to read before class discussion. • Students post to Caucus (electronic message board): • something they didn’t understand, or something they found interesting • potential exam question • Assignment follows class discussion • Detailed references for all papers and datasets can be found in paper

Topic 0: What is Data Mining? • Paper: J. Friedman. “Data Mining and Statistics: What’s the Connection?” • Entertaining and controversial • Pokes fun at flaws on all sides • Helps to ensure buy-in from computer science students (they haven’t been tricked into taking a stats course) • Assignment: For the “census-income” dataset, determine: • Number of records and features • How many features are continuous, how many are nominal • For continuous features: average, median, minimum, maximum, standard deviation • 2-dimensional scatter plots of two features at a time • Interesting patterns

Topic 1: Classification and Regression • Example: First Trimester Screening Training Set • Use this training set to learn how to classify patients where diagnosis is not known: Testing Set Input Data Classification • The input data is often easily obtained, whereas the classification is not.

Technique: Nearest Neighbor • Envision each example as a point in n-dimensional space • Classify test point same as nearest training point What am I?

Topic 1: Classification and Regression • Focus on scalable nearest neighbor algorithms • Paper: Roussopoulos et. al. “Nearest Neighbor Queries” • How to do NN efficiently when data doesn’t fit in core • Requires R-trees (I cover in class) • Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income data • Experiment with different distance metrics (1-norm, 2-norm, cosine) • Experiment with different values of k • Produce plots showing training and test set accuracies • Interpret results

Topic 2: Clustering • Sometimes referred to as unsupervised learning • Goal: find clusters of similar data • Less accurate than supervised learning, but quite useful when no training set is available • Where are the clusters below? How many are there? tissue (cm) tissue (cm) chemical 1 chemical 2

Topic 2: Clustering • Assignment: Find dataset of interest from UCI Repository • iris plant, letter recognition, liver disorders, Pima Indians diabetes, Congressional voting records, wine recognition, zoo • this dataset is used for most remaining assignments • if dataset has a class label, discard it for this assignment • Implement basic clustering algorithm (k-means) • Try varying number of clusters • Try two different techniques for initializing clusters • Report and interpret results found

Topic 2: Clustering • Paper: Bradley et al, “Scaling Clustering Algorithms to Large Databases” • Describes “Scalable K-means” algorithm • Class discussion around “data mining desiderata” • Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases” • Agglomerative clustering algorithm • completely different approach • Requires use of a heap (as I pose the assignment) • Assignment: Implement stripped-down version of CURE • Run on dataset, interpret results

Topic 3: Association Rules • “Supermarket basket analysis” • What items do people tend do buy together at the same time? • Paper: Agrawal et al, “Fast Algorithms for Mining Association Rules” • presents classic Apriori algorithm (skim other portions of paper) • Assignment: Implement Apriori algorithm and implement on own dataset

Topic 4: Web Mining • How does Google rank importance of web pages? • Every page has a PageRank • PageRank of a page is determined by the PageRank of the pages that link to it • manifests itself as an eigenvalue problem • Paper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web” • describes basic version of Google PageRank algorithm • cover eigenvalues in class • exposure to linear algebra, numerical analysis

Topic 4: Web Mining • Paper: Chakrabarti et al, “Mining the Link Structure of the World Wide Web” • describes HITS algorithm for ranking web pages • Google isn’t the only way to do it • uses Latent Semantic Analysis, which requires singular value decomposition (cover in class) • Assignment: Implement PageRank algorithm • try it on archive of department website • crawling for an assignment is dangerous • sparse data representation • hashing or other form of map for efficiency • interpret results hubs authorities

Topic 5: Collaborative Filtering • a.k.a. Recommender Systems • “I like Pink Floyd, Dream Theater, and Evanescence. Who should I be listening to?” • Amazon.com, Yahoo! Launchcast • Paper: Breese et al, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering” • Algorithms are nearest neighbor-like in flavor • Involve averaging numerical scores • Need to normalize for individual biases • Students already working on final project, so no assignment

Topic 6: Ethical Issues in Data Mining • Privacy concerns • Good vs. evil uses of data mining • Video: Ramakrishnan et al, “Data Mining: Good, Bad, or Just a Tool?” • Panel discussion from KDD 2004 • Before watching video, students post to Caucus: • how data mining could be exploited • how this could be prevented (if possible) • After watching video • followup commentary Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/

Topic 6: Ethical Issues in Data Mining • Students response to video was more engaged than I expected • More problems than solutions are raised in video • Frustrated students that solutions weren’t clear • Many students interested in issue of accountability • If someone’s privacy is violated, who is responsible? • “Who do I sue?” • Lively class discussion

Final Project • “Do almost anything you want regarding data mining, so long as I approve it” • Find a paper and implement the algorithm within • Find a dataset of interest and study it completely, using Weka and/or their own code from throughout the term • Quantitative association rules • Poker association rules • Collaborative filtering (music, art) • Attack KDD Cup problems • KDD Cup 2005: identify categories for web search queries • tried this once: tended to be too big for them in the time that I had • could perhaps be done with right level of support

Conclusions • Papers are most memorable part of course • Students speak very positively about this in evaluations • Significant prep time for me to fill in gaps • Caucus motivates reading papers • Students find this a pain, but are thankful afterwards in evals • Important to set deadline for posting a few hours before class so I have time to read • Programming assignments work (mostly) well • Allow students to work in pairs if they wish • Grading is difficult: unspecified details in algorithms, differing datasets • All materials available on my website at http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05

A Data Mining Course for Computer Science Primary Sources and Implementations

A Data Mining Course for Computer Science Primary Sources and Implementations

Presentation Transcript

Primary Sources

Primary Sources

Data Mining for Earth Science Data

Data Mining Methods Course

Primary Sources

Primary and Secondary Data Sources

Statistics and computer science for a data-rich world

Data Mining Course Overview

Primary data sources for paper

Data Mining : Implementations

A Data Mining Course for Computer Science and non Computer Science Students

Primary data sources

LIACS Data Mining course

Primary Sources

Primary Sources

Data Mining Course

Data Science Course | Data Science Course in Bangalore

Data Mining over Hidden Data Sources

Intro to Data Mining for Data Science

LIACS Data Mining course

LIACS Data Mining course