190 likes | 209 Vues
Data Mining Assignments. Erik Zeitler Uppsala Database Laboratory. Oral exam. Two parts: Validation Your solution is validated using a script If your implementation does not work the examination ends immediately (“fail” grade is given) you may re-do the examination later Discussion
E N D
Data Mining Assignments Erik Zeitler Uppsala Database Laboratory
Oral exam • Two parts: • Validation • Your solution is validated using a script • If your implementation does not work • the examination ends immediately (“fail” grade is given) • you may re-do the examination later • Discussion • Bring one form per student • The instructor will ask questions • about your implementation • about the method • All group members must be able to answer • Group members can get different grades on the same assignment Erik Zeitler
Grades Erik Zeitler
What you need to do • Sign up for labs and examintion • Groups of 2 – 4 students • Forms are on the board outside 1321 • Implement a solution • Deadline: Submit by e-mail no later than 24h before your examination • 1: erik.zeitler@it.uu.se • 2, 3, 4: gyozo.gidofalvi@it.uu.se • Answer the questions on the form • Bring one printed form per student to the examination • Prepare for the discussion • Understand the theory Erik Zeitler
The K Nearest Neighbor Algorithm (kNN) Erik Zeitler Uppsala Database Laboratory
kNN use cases • GIS queries: • ”Tell me where the 5 nearest restaurants are” • Classifier queries: • ”Look at the 5 nearest data points of known class belonging to classify a data point” Erik Zeitler
kNN classifier intuition • If you don’t know what you are • Look at the nearest ones around you • You are probably of the same kind Erik Zeitler
Classifying you using kNN • Each one of you belongs to a group: • [F|STS|Int Masters|Exchange Students|Other] • Classify yourself using 1-NN and 3-NN • Look at your k nearest neighbors! • How do we select our distance measure? • How do we decide which of 1-NN and 3-NN is best? Erik Zeitler
A basic kNN classifier implementation • Distance function • euclid • minkowski • maxnorm • Select a set with least distance • leastk • Count each class occurrence • groupby, count • Select the top voted class • topk • Input: • A test point x • The set of known points P • Number of neighbors k • Output: • Class belonging c • Implementation: • Find the set of k points N P that are nearest to x • Count the number of occurrences of each class in N • c = class to which the most points in N belong Erik Zeitler
kNN classifier in AmosQL • Find the set of k points N P that are nearest to x create function k_nearest(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Object, Object> as leastk(classdistances(x, P), k); • Count the number of occurrences of each class in N create function k_votes(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Integer class, Number count> as groupby((select cl from Number cl, Real dist where <dist, cl> in k_nearest(x, k, P)), #'count'); • c = class to which the most points in N belong create function knnclassify(Vector of Number x, Integer k,Bag of <Vector of Number, Number> P) -> Number class as... Erik Zeitler
kNN classifier in a DBMS • kNN classifier requires no stored procedure • Can be defined entirely through queries • Should be defined entirely through queries • Benefits: • Less amount of code • Less error-prone • No side effects! • Optimizable • DBMS decides how to execute the queries • Best possible use of indexes Erik Zeitler
More kNN classifier intuition • If it walks and sounds like a duck Then it must be a duck • If it walks and sounds like a cow Then it must be a cow Erik Zeitler
Walking and talking • Assume that a duck • has step length 5…15 cm • quacks at 600…700 Hz • Assume that a cow • has step length is 30…60 cm • moos at 100…200 Hz Erik Zeitler
Cows and ducks in a plot Normalize: • subtract mean, divide by stdev • subtract min, divide by (max – min) stdev 30 stdev 300 Erik Zeitler
Enter the chicken Erik Zeitler
Normalize in AmosQL • Use aggv: aggv(bag of vector, function) -> vector • ”function”!? • Aggv applies any function on each dimension in the entire bag! aggv(knowncoordinates(), #'avg'); • Normalization is easy: create function normalnorm(bag of vector of number b)-> bag of vector of number as select (x - a) ./ s from vector of number a, vector of number s, vector of number x where a = aggv(b, #'avg') and s = aggv(b, #'stdev') and x in b; • Free Advice: Think twice when normalizing in kNN. • What should be normalized, and when? Erik Zeitler
Assignment 1 in a nutshell • You will get • 163 data points with known class belonging • 30 data points with unknown class belonging • A kNN implementation skeleton in AmosQL (you have to add some functionality) • Experiment using the 163 data points: • Find the best k • Investigate the impact of normalization • Investigate different metrics • Classify the 30 data points Erik Zeitler
Things to consider: Know your data! • Normalize Pre-process • What are the ranges of the different measurements? • Is one characteristic more important than another? • If so, what should we do? • If not, should we do anything else? • You can assume: no missing points, no noise. • Select training + testing data • Is the data sorted? If so, is this good or bad? • We suggest you to try Leave-one-out cross-validation. • Are there any alternatives? • Choose k • How do you know if the value of k is good? Erik Zeitler
Know your data! • How many points of each class are there? • Should this observation affect k? • Choose distance measure • What distance measure is suitable? Why? • Euclid, Minkowski, and maxnorm are available in AmosII. • You can implement other distance measures, similarity measures, etc… • Classify unknown data • Should the unknown data be normalized? • Which data set should be used to classify the unknown data? Erik Zeitler