Data Mining Assignments

Data Mining Assignments Erik Zeitler Uppsala Database Laboratory

Oral exam • Two parts: • Validation • Your solution is validated using a script • If your implementation does not work • the examination ends immediately (“fail” grade is given) • you may re-do the examination later • Discussion • Bring one form per student • The instructor will ask questions • about your implementation • about the method • All group members must be able to answer • Group members can get different grades on the same assignment Erik Zeitler

Grades Erik Zeitler

What you need to do • Sign up for labs and examintion • Groups of 2 – 4 students • Forms are on the board outside 1321 • Implement a solution • Deadline: Submit by e-mail no later than 24h before your examination • 1: erik.zeitler@it.uu.se • 2, 3, 4: gyozo.gidofalvi@it.uu.se • Answer the questions on the form • Bring one printed form per student to the examination • Prepare for the discussion • Understand the theory Erik Zeitler

The K Nearest Neighbor Algorithm (kNN) Erik Zeitler Uppsala Database Laboratory

kNN use cases • GIS queries: • ”Tell me where the 5 nearest restaurants are” • Classifier queries: • ”Look at the 5 nearest data points of known class belonging to classify a data point” Erik Zeitler

kNN classifier intuition • If you don’t know what you are • Look at the nearest ones around you • You are probably of the same kind Erik Zeitler

Classifying you using kNN • Each one of you belongs to a group: • [F|STS|Int Masters|Exchange Students|Other] • Classify yourself using 1-NN and 3-NN • Look at your k nearest neighbors! • How do we select our distance measure? • How do we decide which of 1-NN and 3-NN is best? Erik Zeitler

A basic kNN classifier implementation • Distance function • euclid • minkowski • maxnorm • Select a set with least distance • leastk • Count each class occurrence • groupby, count • Select the top voted class • topk • Input: • A test point x • The set of known points P • Number of neighbors k • Output: • Class belonging c • Implementation: • Find the set of k points N  P that are nearest to x • Count the number of occurrences of each class in N • c = class to which the most points in N belong Erik Zeitler

kNN classifier in AmosQL • Find the set of k points N  P that are nearest to x create function k_nearest(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Object, Object> as leastk(classdistances(x, P), k); • Count the number of occurrences of each class in N create function k_votes(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Integer class, Number count> as groupby((select cl from Number cl, Real dist where <dist, cl> in k_nearest(x, k, P)), #'count'); • c = class to which the most points in N belong create function knnclassify(Vector of Number x, Integer k,Bag of <Vector of Number, Number> P) -> Number class as... Erik Zeitler

kNN classifier in a DBMS • kNN classifier requires no stored procedure • Can be defined entirely through queries • Should be defined entirely through queries • Benefits: • Less amount of code • Less error-prone • No side effects! • Optimizable • DBMS decides how to execute the queries • Best possible use of indexes Erik Zeitler

More kNN classifier intuition • If it walks and sounds like a duck  Then it must be a duck • If it walks and sounds like a cow  Then it must be a cow Erik Zeitler

Walking and talking • Assume that a duck • has step length 5…15 cm • quacks at 600…700 Hz • Assume that a cow • has step length is 30…60 cm • moos at 100…200 Hz Erik Zeitler

Cows and ducks in a plot Normalize: • subtract mean, divide by stdev • subtract min, divide by (max – min) stdev  30 stdev  300 Erik Zeitler

Enter the chicken Erik Zeitler

Normalize in AmosQL • Use aggv: aggv(bag of vector, function) -> vector • ”function”!? • Aggv applies any function on each dimension in the entire bag! aggv(knowncoordinates(), #'avg'); • Normalization is easy: create function normalnorm(bag of vector of number b)-> bag of vector of number as select (x - a) ./ s from vector of number a, vector of number s, vector of number x where a = aggv(b, #'avg') and s = aggv(b, #'stdev') and x in b; • Free Advice: Think twice when normalizing in kNN. • What should be normalized, and when? Erik Zeitler

Assignment 1 in a nutshell • You will get • 163 data points with known class belonging • 30 data points with unknown class belonging • A kNN implementation skeleton in AmosQL (you have to add some functionality) • Experiment using the 163 data points: • Find the best k • Investigate the impact of normalization • Investigate different metrics • Classify the 30 data points Erik Zeitler

Things to consider: Know your data! • Normalize  Pre-process • What are the ranges of the different measurements? • Is one characteristic more important than another? • If so, what should we do? • If not, should we do anything else? • You can assume: no missing points, no noise. • Select training + testing data • Is the data sorted? If so, is this good or bad? • We suggest you to try Leave-one-out cross-validation. • Are there any alternatives? • Choose k • How do you know if the value of k is good? Erik Zeitler

Know your data! • How many points of each class are there? • Should this observation affect k? • Choose distance measure • What distance measure is suitable? Why? • Euclid, Minkowski, and maxnorm are available in AmosII. • You can implement other distance measures, similarity measures, etc… • Classify unknown data • Should the unknown data be normalized? • Which data set should be used to classify the unknown data? Erik Zeitler

Data Mining Assignments

Data Mining Assignments

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data