1 / 30

Data Mining: Concepts and Techniques — Chapter 2 — Dr. Maher Abuhamdeh Statistical

Learn about descriptive data characteristics, including measures of central tendency, variation, and spread. Explore numerical dimensions, dispersion analysis, boxplot analysis, and more.

mgerald
Télécharger la présentation

Data Mining: Concepts and Techniques — Chapter 2 — Dr. Maher Abuhamdeh Statistical

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining:Concepts and Techniques— Chapter 2 —Dr. Maher Abuhamdeh Statistical Data Mining: Concepts and Techniques

  2. Mining Data DescriptiveCharacteristics • Motivation • To better understand the data: central tendency, variation and spread • Data dispersion characteristics • median, max, min, quantiles, outliers, variance, etc. • Numerical dimensions correspond to sorted intervals • Data dispersion: analyzed with multiple granularities of precision • Boxplot or quantile analysis on sorted intervals • Dispersion analysis on computed measures • Folding measures into numerical dimensions • Boxplot or quantile analysis on the transformed cube Data Mining: Concepts and Techniques

  3. Mean • Consider Sample of 6 Values: 34, 43, 81, 106, 106 and 115 • To compute the mean, add and divide by 6 (34 + 43 + 81 + 106 + 106 + 115)/6  =  80.83     • The population meanisthe average of the entire population and is usually hard to compute. We use the Greek letter μ for the population mean.                                Data Mining: Concepts and Techniques

  4. Mode • The mode of a set of data is the number with the highest frequency.  • In the above example 106 is the mode, since it occurs twice and the rest of the outcomes occur only once. Data Mining: Concepts and Techniques

  5. Median • A problem with the mean, is if there is one outcome that is very far from the rest of the data. • The median is the middle score. If we have an even number of events we take the average of the two middles.   • Assume a sample of 10 house prices. In $100,000, the prices are: 2.7,   2.9,   3.1,   3.4,   3.7,  4.1,   4.3,   4.7,  4.7,  40.8 • mean = 710,000.  it does not reflect prices in the area. • The value 40.8 x $100,000  =  $4.08 million skews the data.  Outlier. • median =   (3.7 + 4.1) / 2 =  3.9 .. That is $390,000. • This is A better Representative of the data. Data Mining: Concepts and Techniques

  6. Variance and Standard Deviation • variance of a sample  • standard deviation of a sample Data Mining: Concepts and Techniques

  7. Example • 44,  50,   38,   96,   42,   47,  40, 39, 46,  50 •       mean =  x ̅  =  49.2 • Calculate the mean, x. • Write a table that subtracts the mean from each observed value. • Square each of the differences. • Add this column. • Divide by n -1 where n is the number of items in the sample  This is the variance. • To get the standard deviation we take the square root of the variance.   Data Mining: Concepts and Techniques

  8. Example Cont. Variance =   2600.4/ (10-1) = 288.7 Standard deviation = square root of  289 = 17 = σ • This means is that most of the numbers probably fit between $32.20 and $66.20. Data Mining: Concepts and Techniques

  9. Properties of Normal Distribution Curve • The normal (distribution) curve • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) • From μ–2σ to μ+2σ: contains about 95% of it • From μ–3σ to μ+3σ: contains about 99.7% of it Data Mining: Concepts and Techniques

  10. Symmetric vs. Skewed Data Symmetric • Median, mean and mode of symmetric, positively and negatively skewed data -vely skewed +vely skewed Data Mining: Concepts and Techniques

  11. Measuring the Dispersion of Data • Quartiles, outliers and boxplots • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max • Outlier: usually, a value higher/lower than 1.5 x IQR • Variance and standard deviation (sample:s, population: σ) • Variance: (algebraic, scalable computation) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques

  12. Boxplot Analysis • Five-number summary of a distribution: Minimum, Q1, M, Q3, Maximum Data Mining: Concepts and Techniques

  13. Relation between Mean and Standard deviation The length of the students as below (in CM) 200 , 147 ,173 , 185 , 160 The mean equal 173 Data Mining: Concepts and Techniques

  14. Relation between Mean and Standard deviation Data Mining: Concepts and Techniques

  15. Data Mining: Concepts and Techniques

  16. Calculate the difference between each of the length of (Mean) Data Mining: Concepts and Techniques

  17. Calculate the (Variance) which is equal 343.60 • Calculate the standard deviation which is equal 18.53   Data Mining: Concepts and Techniques

  18. The first student is unusually long • The second student is short • The others are considered as normal lengths • If Mean close with Standard deviation increased accuracy (homogeneity) • If Mean far away with Standard deviation decreased accuracy (non-homogeneity) Data Mining: Concepts and Techniques

  19. How to Handle Noisy Data? • Binning • first sort data and partition into (equal-frequency) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Data Mining: Concepts and Techniques

  20. Simple Discretization Methods: Binning • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky Data Mining: Concepts and Techniques

  21. Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Data Mining: Concepts and Techniques

  22. How to Handle Noisy Data? 2. Regression • smooth by fitting the data into regression functions A regression is a technique that conforms data values to a function. Linear regression involves finding the “best” line to fit two attributes (or variables) so that one attribute can be used to predict the other. Data Mining: Concepts and Techniques

  23. Regression Error of predication To get best filling line we need to find the minimizes of the sum of the squared error of predication y Y1 y = x + 1 Y1’ x X1 Data Mining: Concepts and Techniques

  24. How to Handle Noisy Data? 3. Clustering Outliers may be detected by clustering, for example, where similar values are organized into groups, or “clusters.” Intuitively, values that fall outside of the set of clusters may be considered outliers, then we need to remove them Data Mining: Concepts and Techniques

  25. Cluster Analysis Data Mining: Concepts and Techniques

  26. Normalization • Normalization: scaled to fall within a small, specified range • min-max normalization • z-score normalization • normalization by decimal scaling • Attribute/feature construction • New attributes constructed from the given ones Data Mining: Concepts and Techniques

  27. Data Transformation: Normalization • Normalization : where the attribute data are scaled so as to fall within a small specified range such as [-1.0 to 1.0] or [0.0 to 1.0] • We study three methods for normalization • Min – max normalization • z - score normalization • Decimal scaling Data Mining: Concepts and Techniques

  28. Data Transformation: Normalization • Min-max normalization: to [new_minA, new_maxA] • Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to • Z-score normalization (μ: mean, σ: standard deviation): • Ex. Let μ = 54,000, σ = 16,000. Then • Normalization by decimal scaling Where j is the smallest integer such that Max(|ν’|) < 1 Data Mining: Concepts and Techniques

  29. Normalization by decimal scaling • Example: Suppose values of A range from -986 to 917 . The maximum absolute value of A is 986 . To normalize by decimal scaling we divide each value by 1000 (j = 3) so that -986 normalizes to -0.986 Data Mining: Concepts and Techniques

  30. Remakes for three Normalization method • Min-max normalization problem Out of bound error if a future input case for normalization falls outside of the original data range. • Z-score normalization is useful when the actual min. and max. of attribute A are unknown or when there outliers that dominate the min – max normalization. Data Mining: Concepts and Techniques

More Related