1 / 95

Data Mining: Concepts and Techniques — Chapter 2 —

Data Mining: Concepts and Techniques — Chapter 2 —. TUGAS 1 dikiumpulkan tanggal 10 April 2010 ( PRogramming ) 2orang 1 kelompok. Chapter 2: Data Preprocessing. Karakteristik data secara umum Diskripsi data dan eksplorasi Mengukur kesamaan data Data cleaning

jsawyers
Télécharger la présentation

Data Mining: Concepts and Techniques — Chapter 2 —

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining: Concepts and Techniques— Chapter 2 — TUGAS 1 dikiumpulkan tanggal 10 April 2010 ( PRogramming ) 2orang 1 kelompok Data Mining: Concepts and Techniques

  2. Chapter 2: Data Preprocessing • Karakteristik data secara umum • Diskripsi data dan eksplorasi • Mengukur kesamaan data • Data cleaning • Integrasi data dan transformasi • Reduksi data • Kesimpulan Data Mining: Concepts and Techniques

  3. Types of Attribute Values • Nominal • E.g., profession, ID numbers, eye color, zip codes • Ordinal • E.g., rankings (e.g., army, professions), grades, height in {tall, medium, short} • Binary • E.g., medical test (positive vs. negative) • Interval • E.g., calendar dates, body temperatures • Ratio • E.g., temperature in Kelvin, length, time, counts Data Mining: Concepts and Techniques

  4. Discrete vs. Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • Continuous Attribute • Has real numbers as attribute values • Examples: temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables Data Mining: Concepts and Techniques

  5. Chapter 2: Data Preprocessing • General data characteristics • Basic data description and exploration • Measuring data similarity • Data cleaning • Data integration and transformation • Data reduction • Summary Data Mining: Concepts and Techniques

  6. Mining Data DescriptiveCharacteristics • Motivasi • Untuk memahami data: sebaran, kecenderungan terpusat, dan variasi • Karakteristik dari sebaran data • median, max, min, quartiles, outliers, variance • Dimensi numerik terkait dengan interval yang terurut • Boxplot atau quantile analysis pada interval yang terurut Data Mining: Concepts and Techniques

  7. Mengukur kecenderungan terpusat ( Central Tendency) • Rata-rata (sample vs. population): • Weighted arithmetic mean: • Trimmed mean: chopping extreme values • Median: A holistic measure • Middle value if odd number of values, or average of the middle two values otherwise • Estimated by interpolation (for grouped data): • Mode • Value that occurs most frequently in the data • Unimodal, bimodal, trimodal • Empirical formula: Data Mining: Concepts and Techniques

  8. Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data symmetric positively skewed negatively skewed Data Mining: Concepts and Techniques

  9. Contoh : Upah Karyawan PT. Satria Semarang F = 82 Me = 82 : 2= 41 Kelas : 260 - 279

  10. Rumus Median: Tepi Kelas Bawah Tepi Kelas Atas atau

  11. Measuring the Dispersion of Data • Quartiles, outliers and boxplots • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number summary: min, Q1, M,Q3, max • Boxplot: ends of the box are the quartiles, median is marked, whiskers, and plot outlier individually • Outlier: usually, a value higher/lower than 1.5 x IQR • Variance and standard deviation (sample:s, population: σ) • Variance: (algebraic, scalable computation) • Standard deviation s (or σ) is the square root of variance s2 (orσ2) Data Mining: Concepts and Techniques

  12. Properties of Normal Distribution Curve • The normal (distribution) curve • From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ: standard deviation) • From μ–2σ to μ+2σ: contains about 95% of it • From μ–3σ to μ+3σ: contains about 99.7% of it Data Mining: Concepts and Techniques

  13. Graphic Displays of Basic Statistical Descriptions • Boxplot: graphic display of five-number summary • Histogram: x-axis are values, y-axis repres. frequencies • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane • Loess (local regression) curve: add a smooth curve to a scatter plot to provide better perception of the pattern of dependence Data Mining: Concepts and Techniques

  14. Histogram Analysis • Graph displays of basic statistical class descriptions • Frequency histograms • A univariate graphical method • Consists of a set of rectangles that reflect the counts or frequencies of the classes present in the given data Data Mining: Concepts and Techniques

  15. Histograms Often Tells More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions Data Mining: Concepts and Techniques

  16. Scatter plot • Provides a first look at bivariate data to see clusters of points, outliers, etc • Each pair of values is treated as a pair of coordinates and plotted as points in the plane Data Mining: Concepts and Techniques

  17. Loess Curve • Adds a smooth curve to a scatter plot in order to provide better perception of the pattern of dependence • Loess curve is fitted by setting two parameters: a smoothing parameter, and the degree of the polynomials that are fitted by the regression Data Mining: Concepts and Techniques

  18. Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated Data Mining: Concepts and Techniques

  19. Not Correlated Data Data Mining: Concepts and Techniques

  20. Scatterplot Matrices Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of C(k, 2) = (k2̶ k)/2 scatterplots] Used by permission of M. Ward, Worcester PolytechnicInstitute Data Mining: Concepts and Techniques

  21. Chapter 2: Data Preprocessing • General data characteristics • Basic data description and exploration • Measuring data similarity(Sec. 7.2) • Data cleaning • Data integration and transformation • Data reduction • Summary Data Mining: Concepts and Techniques

  22. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are • Value is higher when objects are more alike • Often falls in the range [0,1] • Dissimilarity (i.e., distance) • Numerical measure of how different are two data objects • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity Data Mining: Concepts and Techniques

  23. Data Matrix and Dissimilarity Matrix • Data matrix • n data points with p dimensions • Two modes • Dissimilarity matrix • n data points, but registers only the distance • A triangular matrix • Single mode Data Mining: Concepts and Techniques

  24. Example: Data Matrix and Distance Matrix Data Matrix Distance Matrix (i.e., Dissimilarity Matrix) for Euclidean Distance Data Mining: Concepts and Techniques

  25. Minkowski Distance • Minkowski distance: A popular distance measure where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, and q is the order • Properties • d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness) • d(i, j) = d(j, i)(Symmetry) • d(i, j)  d(i, k) + d(k, j)(Triangle Inequality) • A distance that satisfies these properties is a metric Data Mining: Concepts and Techniques

  26. Special Cases of Minkowski Distance • q = 1: Manhattan (city block, L1 norm) distance • E.g., the Hamming distance: the number of bits that are different between two binary vectors • q= 2: (L2 norm) Euclidean distance • q. “supremum” (Lmax norm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse q with n, i.e., all these distances are defined for all numbers of dimensions. • Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures Data Mining: Concepts and Techniques

  27. Example: Minkowski Distance Distance Matrix Data Mining: Concepts and Techniques

  28. Interval-valued variables • Standardize data • Calculate the mean absolute deviation: where • Calculate the standardized measurement (z-score) • Using mean absolute deviation is more robust than using standard deviation • Then calculate the Enclidean distance of other Minkowski distance Data Mining: Concepts and Techniques

  29. Object j Object i Binary Variables • A contingency table for binary data • Distance measure for symmetric binary variables: • Distance measure for asymmetric binary variables: • Jaccard coefficient (similarity measure for asymmetric binary variables): • Note: Jaccard coefficient is the same as “coherence”: Data Mining: Concepts and Techniques

  30. Dissimilarity between Binary Variables • Example • gender is a symmetric attribute • the remaining attributes are asymmetric binary • let the values Y and P be set to 1, and the value N be set to 0 Data Mining: Concepts and Techniques

  31. Nominal Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching • m: # of matches, p: total # of variables • Method 2: Use a large number of binary variables • creating a new binary variable for each of the M nominal states Data Mining: Concepts and Techniques

  32. Ordinal Variables • An ordinal variable can be discrete or continuous • Order is important, e.g., rank • Can be treated like interval-scaled • replace xif by their rank • map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable by • compute the dissimilarity using methods for interval-scaled variables Data Mining: Concepts and Techniques

  33. Ratio-Scaled Variables • Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or Ae-Bt • Methods: • treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted) • apply logarithmic transformation yif = log(xif) • treat them as continuous ordinal data treat their rank as interval-scaled Data Mining: Concepts and Techniques

  34. Variables of Mixed Types • A database may contain all the six types of variables • symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio • One may use a weighted formula to combine their effects • f is binary or nominal: dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise • f is interval-based: use the normalized distance • f is ordinal or ratio-scaled • Compute ranks rif and • Treat zif as interval-scaled Data Mining: Concepts and Techniques

  35. Vector Objects: Cosine Similarity • Vector objects: keywords in documents, gene features in micro-arrays, … • Applications: information retrieval, biologic taxonomy, ... • Cosine measure: If d1 and d2 are two vectors, then cos(d1, d2) = (d1d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d • Example: d1= 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1d2 = 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2 = 5 ||d1||= (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5=(6) 0.5 = 2.245 cos( d1, d2 ) = .3150 Data Mining: Concepts and Techniques

  36. Chapter 2: Data Preprocessing • General data characteristics • Basic data description and exploration • Measuring data similarity • Data cleaning • Data integration and transformation • Data reduction • Summary Data Mining: Concepts and Techniques

  37. Tugas Pokok dalam Pemrosesan awal data • Data cleaning • Mengisi nilai yang hilang, memperhalus data noise, mengidentifikasi atau menghilangkan outlier dan memecahkan ketidak konsistenanan • Integrasi data • Mengintegrasikan berbagai database, data cube atau file-file • Transformasi data Data transformation • Normalisasi dan aggregation • Reduksi data • Mendapatkan representasi dalam volume data yung sudah terkurangi tetapi menghasilkan hasil analitis yang sama atau serupa • Diskritisasi data : bagian dari reduksi data, bagian penting untuk data numerik Data Mining: Concepts and Techniques

  38. Data Cleaning • Data yang tidak berkualitas , hasil data mining yang tidak berkualitas! • Keputusan yang berkualitas harus didasarkan pada data yang berkualitas • e.g., data ganda atau data yang hilang mungkin menyebabkan ketidakbenaran atau bahkan menyesatkan • Ekstaksi data, pembersihan, dan transformasi data merupakan tugas utama dalam data warehouse • Tugas-tugas data cleaning • Mengisi nilai-nilai yang hilang • Mengidentifikasi outliers dan memperhalus data noise • Memperbaiki ketidakkonsitenan data • Memecahkan redudansi yang disebabkan oleh integrasi data Data Mining: Concepts and Techniques

  39. Data in the Real World Is Dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • e.g., children=“ ” (missing data) • noisy: containing noise, errors, or outliers • e.g., Salary=“−10” (an error) • inconsistent: containing discrepancies in codes or names, e.g., • Age=“42” Birthday=“03/07/1997” • Was rating “1,2,3”, now rating “A, B, C” • discrepancy between duplicate records Data Mining: Concepts and Techniques

  40. Why Is Data Dirty? • Data yang tidak lengkap mungkin diperoleh dari • Different considerations between the time when the data was collected and when it is analyzed. • Human/hardware/software problems • Noisy data (incorrect values) may come from • Faulty data collection instruments • Human or computer error at data entry • Errors in data transmission • Inconsistent data may come from • Different data sources • Duplicate records also need data cleaning Data Mining: Concepts and Techniques

  41. Missing Data • Data is not always available • E.g., many tuples have no recorded value for several attributes, such as customer income in sales data • Missing data may be due to • equipment malfunction • inconsistent with other recorded data and thus deleted • data not entered due to misunderstanding • certain data may not be considered important at the time of entry • not register history or changes of the data • Missing data may need to be inferred Data Mining: Concepts and Techniques

  42. Bagaimana mengatasi Missing Value ( data yang hilang ) • Mengabaikan record-record: biasanya dilakukan bila label class hilang (tidak efektif bila % dari nilai yang hilang per atribut sangat diperhatikan • Mengisi nilai yang hilang secara manual • Mengisi secara otomatis dengan • Global konstant : e.g., “unknown”, a new class?! • Rata-rata dari atribut • Rata-rata atribut untuk seluruh sample dengan kelas yang sama : smarter • nilai yang lebih memungkinkan: yaitu dengan menggunakan metode Bayesian Data Mining: Concepts and Techniques

  43. Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values may due to • faulty data collection instruments • data entry problems • data transmission problems • technology limitation • Other data problems which requires data cleaning • duplicate records • incomplete data • inconsistent data Data Mining: Concepts and Techniques

  44. How to Handle Noisy Data? • Binning • first sort data and partition into (equal-frequency) bins • then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. • Regression • smooth by fitting the data into regression functions • Clustering • detect and remove outliers • Combined computer and human inspection • detect suspicious values and check by human (e.g., deal with possible outliers) Data Mining: Concepts and Techniques

  45. Simple Discretization Methods: Binning • Equal-width (distance) partitioning • Divides the range into N intervals of equal size: uniform grid • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N. • The most straightforward, but outliers may dominate presentation • Skewed data is not handled well • Equal-depth (frequency) partitioning • Divides the range into N intervals, each containing approximately same number of samples • Good data scaling • Managing categorical attributes can be tricky Data Mining: Concepts and Techniques

  46. Binning Methods for Data Smoothing • Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34 Data Mining: Concepts and Techniques

  47. Regression y Y1 y = x + 1 Y1’ x X1 Data Mining: Concepts and Techniques

  48. Cluster Analysis Data Mining: Concepts and Techniques

More Related