1 / 19

Data Mining Assignments

Data Mining Assignments. Erik Zeitler Uppsala Database Laboratory. Oral exam. Two parts: Validation Your solution is validated using a script If your implementation does not work the examination ends immediately (“fail” grade is given) you may re-do the examination later Discussion

bpereyra
Télécharger la présentation

Data Mining Assignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Assignments Erik Zeitler Uppsala Database Laboratory

  2. Oral exam • Two parts: • Validation • Your solution is validated using a script • If your implementation does not work • the examination ends immediately (“fail” grade is given) • you may re-do the examination later • Discussion • Bring one form per student • The instructor will ask questions • about your implementation • about the method • All group members must be able to answer • Group members can get different grades on the same assignment Erik Zeitler

  3. Grades Erik Zeitler

  4. What you need to do • Sign up for labs and examintion • Groups of 2 – 4 students • Forms are on the board outside 1321 • Implement a solution • Deadline: Submit by e-mail no later than 24h before your examination • 1: erik.zeitler@it.uu.se • 2, 3, 4: gyozo.gidofalvi@it.uu.se • Answer the questions on the form • Bring one printed form per student to the examination • Prepare for the discussion • Understand the theory Erik Zeitler

  5. The K Nearest Neighbor Algorithm (kNN) Erik Zeitler Uppsala Database Laboratory

  6. kNN use cases • GIS queries: • ”Tell me where the 5 nearest restaurants are” • Classifier queries: • ”Look at the 5 nearest data points of known class belonging to classify a data point” Erik Zeitler

  7. kNN classifier intuition • If you don’t know what you are • Look at the nearest ones around you • You are probably of the same kind Erik Zeitler

  8. Classifying you using kNN • Each one of you belongs to a group: • [F|STS|Int Masters|Exchange Students|Other] • Classify yourself using 1-NN and 3-NN • Look at your k nearest neighbors! • How do we select our distance measure? • How do we decide which of 1-NN and 3-NN is best? Erik Zeitler

  9. A basic kNN classifier implementation • Distance function • euclid • minkowski • maxnorm • Select a set with least distance • leastk • Count each class occurrence • groupby, count • Select the top voted class • topk • Input: • A test point x • The set of known points P • Number of neighbors k • Output: • Class belonging c • Implementation: • Find the set of k points N  P that are nearest to x • Count the number of occurrences of each class in N • c = class to which the most points in N belong Erik Zeitler

  10. kNN classifier in AmosQL • Find the set of k points N  P that are nearest to x create function k_nearest(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Object, Object> as leastk(classdistances(x, P), k); • Count the number of occurrences of each class in N create function k_votes(Vector of Number x, Integer k, Bag of <Vector of Number, Number> P) -> Bag of <Integer class, Number count> as groupby((select cl from Number cl, Real dist where <dist, cl> in k_nearest(x, k, P)), #'count'); • c = class to which the most points in N belong create function knnclassify(Vector of Number x, Integer k,Bag of <Vector of Number, Number> P) -> Number class as... Erik Zeitler

  11. kNN classifier in a DBMS • kNN classifier requires no stored procedure • Can be defined entirely through queries • Should be defined entirely through queries • Benefits: • Less amount of code • Less error-prone • No side effects! • Optimizable • DBMS decides how to execute the queries • Best possible use of indexes Erik Zeitler

  12. More kNN classifier intuition • If it walks and sounds like a duck  Then it must be a duck • If it walks and sounds like a cow  Then it must be a cow Erik Zeitler

  13. Walking and talking • Assume that a duck • has step length 5…15 cm • quacks at 600…700 Hz • Assume that a cow • has step length is 30…60 cm • moos at 100…200 Hz Erik Zeitler

  14. Cows and ducks in a plot Normalize: • subtract mean, divide by stdev • subtract min, divide by (max – min) stdev  30 stdev  300 Erik Zeitler

  15. Enter the chicken Erik Zeitler

  16. Normalize in AmosQL • Use aggv: aggv(bag of vector, function) -> vector • ”function”!? • Aggv applies any function on each dimension in the entire bag! aggv(knowncoordinates(), #'avg'); • Normalization is easy: create function normalnorm(bag of vector of number b)-> bag of vector of number as select (x - a) ./ s from vector of number a, vector of number s, vector of number x where a = aggv(b, #'avg') and s = aggv(b, #'stdev') and x in b; • Free Advice: Think twice when normalizing in kNN. • What should be normalized, and when? Erik Zeitler

  17. Assignment 1 in a nutshell • You will get • 163 data points with known class belonging • 30 data points with unknown class belonging • A kNN implementation skeleton in AmosQL (you have to add some functionality) • Experiment using the 163 data points: • Find the best k • Investigate the impact of normalization • Investigate different metrics • Classify the 30 data points Erik Zeitler

  18. Things to consider: Know your data! • Normalize  Pre-process • What are the ranges of the different measurements? • Is one characteristic more important than another? • If so, what should we do? • If not, should we do anything else? • You can assume: no missing points, no noise. • Select training + testing data • Is the data sorted? If so, is this good or bad? • We suggest you to try Leave-one-out cross-validation. • Are there any alternatives? • Choose k • How do you know if the value of k is good? Erik Zeitler

  19. Know your data! • How many points of each class are there? • Should this observation affect k? • Choose distance measure • What distance measure is suitable? Why? • Euclid, Minkowski, and maxnorm are available in AmosII. • You can implement other distance measures, similarity measures, etc… • Classify unknown data • Should the unknown data be normalized? • Which data set should be used to classify the unknown data? Erik Zeitler

More Related