1 / 9

Data Mining – Algorithms: Linear Models

Data Mining – Algorithms: Linear Models. Chapter 4, Section 4.6. Numeric Attributes. Numeric prediction and/ or numeric attributes as predictors Linear regression is well established statistical technique Designed to predict numeric value based on numeric attributes

jaunie
Télécharger la présentation

Data Mining – Algorithms: Linear Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6

  2. Numeric Attributes • Numeric prediction and/ or numeric attributes as predictors • Linear regression is well established statistical technique • Designed to predict numeric value based on numeric attributes • Determines optimal set of coefficients for linear equation: • pred = w0 + w1a1 + w2a2 + … + wnan • Optimal means prediction errors squared is minimized • For data mining, this would be done on training data so that it can be tested on test data • I hope that a CSC major could read a statistics book and then write the code to do this • However, there is no need to do this, since this method is so available, unless you are seeking to create an improved version of it

  3. Example • <Show Basketball Spreadsheet – Baskball sheet • NOTE – input values, weights, prediction vs actual • <Show testReg sheet – test on separate instances • NOTE – how it did – prediction vs actual – difference, correlation

  4. Using Regression for Classification • Perform regression for each class • Set output to be predicted = 1 for training instances that belong to a class • Set output to be predicted = 0 for training instances that do NOT belong to the class • Do this for each class, and you will have an “membership function” equation for each class • On test, plug new instance into each equation, and highest value produced will be the prediction to make

  5. Example • <Show discretized sheet • NOTE – prep of data – into low, medium, high • NOTE – Weights for 3 regressions, high, med, low • <Show Test sheet • NOTE – Calcs Hi, Med, Low • (doesn’t do that well, suspect that the data may not be from same source (NBA), and that the discretization was a bit of a problem (very few low)

  6. More sophisticated • Do as many pairwise competitions as necessary • Training – two classes against each other: • temporarily toss training instances that are not one of the two • Set output = 1 for class to be predicted and –1 for other • Test – do all pairwise competitions, winner of each gets a vote • E.g. say – • Medium beats High • Medium beats Low • High beats Low • Medium wins 2-1-0 • Conservative approach would be to predict nothing if no prediction dominates

  7. In Context • Has been used for decades for various applications (e.g. social science research) • Bias – only searches for linear equations – no squares, cubes etc • To work well, data must fit a linear model – e.g must be “linearly separable” – be able to divide with a line (in 2D, a plane in 3D, a hyperplane in multi-D) • To work well, attributes should not be highly correlated with each other • Depends on numeric attributes

  8. Let’s Look at WEKA • Linear Regression with Basketball data • No Correctness measures • Correlations • Error • Discretize Points per minute • Try logistic regression – a categorical prediction approach

  9. End Section 4.6

More Related