2. Data Preparation and Preprocessing

2. Data Preparation and Preprocessing Data and Its Forms Preparation Preprocessing and Data Reduction

Data Types and Forms • Attribute-vector data: • Data types • numeric, categorical (see the hierarchy for its relationship) • static, dynamic (temporal) • Other data forms • distributed data • text, Web, meta data • images, audio/video • You have seen most of them after the invited talks. CSE 575 Data Mining by H. Liu

Data Preparation • An important & time consuming task in KDD • High dimensional data (20, 100, 1000) • Huge size data • Missing data • Outliers • Erroneous data (inconsistent, misrecorded, distorted) • Raw data CSE 575 Data Mining by H. Liu

Data Preparation Methods • Data annotation as in driving data analysis • Data normalization • Another example is of image mining • Dealing with sequential or temporal data • Transform it to tabular form • Removing outliers • Different types CSE 575 Data Mining by H. Liu

Normalization • Decimal scaling • v’(i) = v(i)/10k for the smallest k such that max(|v’(i)|)<1. • For the range between -991 and 99, k is 1000, -991  .991 • Min-max normalization into the new max/min range: • v’ = (v - minA)/(maxA - minA) * (new_maxA - new_minA) + new_minA • v = 73600 in [12000,98000]  v’= 0.716 in [0,1] (new range) • Zero-mean normalization: • v’ = (v - meanA) / std_devA • (1, 2, 3), mean and std_dev are 2 and 1, (-1, 0, 1) • If meanIncome = 54000 and std_devIncome = 16000, then v = 73600  1.225 CSE 575 Data Mining by H. Liu

Temporal Data • The goal is to forecast t(n+1) from previous values • X = {t(1), t(2), …, t(n)} • An example with two features and widow size 3 • How to determine the window size? CSE 575 Data Mining by H. Liu

Outlier Removal • Data points inconsistent with the majority of data • Different outliers • Valid: CEO’s salary, • Noisy: One’s age = 200, widely deviated points • Removal methods • Clustering • Curve-fitting • Hypothesis-testing with a given model CSE 575 Data Mining by H. Liu

Data Preprocessing • Data cleaning • missing data • noisy data • inconsistent data • Data reduction • Dimensionality reduction • Instance selection • Value discretization CSE 575 Data Mining by H. Liu

Missing Data • Many types of missing data • not measured • truly missed • wrongly placed, and ? • Some methods • leave as is • ignore/remove the instance with missing value • manual fix (assign a value for implicit meaning) • statistical methods (majority, most likely,mean, nearest neighbor, …) CSE 575 Data Mining by H. Liu

Noisy Data • Random error or variance in a measured variable • inconsistent values for features or classes (process) • measuring errors (source) • Noise is normally a minority in the data set • Why? • Removing noise • Clustering/merging • Smoothing (rounding, averaging within a window) • Outlier detection (deviation-based or distance-based) CSE 575 Data Mining by H. Liu

Inconsistent Data • Inconsistent with our models or common sense • Examples • The same name occurs differently in an application • Different names appear the same (Dennis vs. Denis) • Inappropriate values (Male-Pregnant, negative age) • One bank’s database shows that 5% of its customers were born in 11/11/11 • … CSE 575 Data Mining by H. Liu

Dimensionality Reduction • Feature selection • select m from n features, m≤ n • remove irrelevant, redundant features • the saving in search space • Feature transformation (PCA) • form new features (a) in a new domain from original features (f) • many uses, but it does not reduce the original dimensionality • often used in visualization of data CSE 575 Data Mining by H. Liu

Feature Selection • Problem illustration • Full set • Empty set • Enumeration • Search • Exhaustive/Complete (Enumeration/BAA) • Heuristic (Sequential forward/backward) • Stochastic (generate/evaluate) • Individual features or subsets generation/evaluation CSE 575 Data Mining by H. Liu

Feature Selection (2) • Goodness metrics • Dependency: depending on classes • Distance: separating classes • Information: entropy • Consistency: 1 - #inconsistencies/N • Example: (F1, F2, F3) and (F1,F3) • Both sets have 2/6 inconsistency rate • Accuracy (classifier based): 1 - errorRate • Their comparisons • Time complexity, number of features, removing redundancy CSE 575 Data Mining by H. Liu

Feature Selection (3) • Filter vs. Wrapper Model • Pros and cons • time • generality • performance such as accuracy • Stopping criteria • thresholding (number of iterations, some accuracy,…) • anytime algorithms • providing approximate solutions • solutions improve over time CSE 575 Data Mining by H. Liu

Feature Selection (Examples) • SFS using consistency (cRate) • select 1 from n, then 1 from n-1, n-2,… features • increase the number of selected features until pre-specified cRate is reached. • LVF using consistency (cRate) • randomly generate a subset S from the full set • if it satisfies prespecified cRate, keep S with min #S • go back to 1 until a stopping criterion is met • LVF is an any time algorithm • Many other algorithms: SBS, B&B, ... CSE 575 Data Mining by H. Liu

Transformation: PCA • D’ = DA, D is mean-centered, (Nn) • Calculate and rank eigenvalues of the covariance matrix • Select largest ’s such that r > threshold (e.g., .95) • corresponding eigenvectors form A (nm) • Example of Iris data m n r = (  i ) / (  i ) i=1 i=1 CSE 575 Data Mining by H. Liu

Instance Selection • Sampling methods • random sampling • stratified sampling • Search-based methods • Representatives • Prototypes • Sufficient statistics (N, mean, stdDev) • Support vectors CSE 575 Data Mining by H. Liu

Value Descritization • Binning methods • Equal-width • Equal-frequency • Class information is not used • Entropy-based • ChiMerge • Chi2 CSE 575 Data Mining by H. Liu

Binning • Attribute values (for one attribute e.g., age): • 0, 4, 12, 16, 16, 18, 24, 26, 28 • Equi-width binning – for bin width of e.g., 10: • Bin 1: 0, 4 [-,10) bin • Bin 2: 12, 16, 16, 18 [10,20) bin • Bin 3: 24, 26, 28 [20,+) bin • We use – to denote negative infinity, + for positive infinity • Equi-frequency binning – for bin density of e.g., 3: • Bin 1: 0, 4, 12 [-,14) bin • Bin 2: 16, 16, 18 [14,21) bin • Bin 3: 24, 26, 28 [21,+] bin • Any problems with the above methods? CSE 575 Data Mining by H. Liu

Entropy-based • Given attribute-value/class pairs: • (0,P), (4,P), (12,P), (16,N), (16,N), (18,P), (24,N), (26,N), (28,N) • Entropy-based binning via binarization: • Intuitively, find best split so that the bins are as pure as possible • Formally characterized by maximal information gain. • Let S denote the above 9 pairs, p=4/9 be fraction of P pairs, and n=5/9 be fraction of N pairs. • Entropy(S) = - p log p - n log n. • Smaller entropy – set is relatively pure; smallest is 0. • Large entropy – set is mixed. Largest is 1. CSE 575 Data Mining by H. Liu

Entropy-based (2) • Let v be a possible split. Then S is divided into two sets: • S1: value <= v and S2: value > v • Information of the split: • I(S1,S2) = (|S1|/|S|) Entropy(S1)+ (|S2|/|S|) Entropy(S2) • Information gain of the split: • Gain(v,S) = Entropy(S) – I(S1,S2) • Goal: split with maximal information gain. • Possible splits: mid points b/w any two consecutive values. • For v=14, I(S1,S2) = 0 + 6/9*Entropy(S2) = 6/9 * 0.65 = 0.433 • Gain(14,S) = Entropy(S) - 0.433 • maximum Gain means minimum I. • The best split is found after examining all possible split points. CSE 575 Data Mining by H. Liu

ChiMerge and Chi2 • Given attribute-value/class pairs • Build a contingency table for every pair of intervals (I) • Chi-Squared Test (goodness-of-fit), • Parameters: df = k-1 and p% level of significance • Chi2 algorithm provides an automatic way to adjust p 2 k 2 =   (Aij – Eij)2 / Eij i=1 j=1 CSE 575 Data Mining by H. Liu

Summary • Data have many forms • Attribute-vectors is the most common form • Raw data need to be prepared and preprocessed for data mining • Data miners have to work on the data provided • Domain expertise is important in DPP • Data preparation: Normalization, Transformation • Data preprocessing: Cleaning and Reduction • DPP is a critical and time-consuming task • Why? CSE 575 Data Mining by H. Liu

Bibliography • H. Liu & H. Motoda, 1998. Feature Selection for Knowledge Discovery and Data Mining. Kluwer. • M. Kantardzic, 2003. Data Mining - Concepts, Models, Methods, and Algorithms. IEEE and Wiley Inter-Science. • H. Liu & H. Motoda, edited, 2001. Instance Selection and Construction for Data Mining. Kluwer. • H. Liu, F. Hussain, C.L. Tan, and M. Dash, 2002. Discretization: An Enabling Technique. DMKD 6:393-423. CSE 575 Data Mining by H. Liu

2. Data Preparation and Preprocessing

2. Data Preparation and Preprocessing

Presentation Transcript

Chapter 2: Data Preprocessing

Data Preprocessing

Data Preprocessing

Data preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

2. Data Preparation and Preprocessing

Data preparation: Selection, Preprocessing, and Transformation

Data Preprocessing

Data Preprocessing

Data Preprocessing

Data Preprocessing

Chapter 2: Data Preprocessing

UNIT-2 Data Preprocessing

UNIT-2 Data Preprocessing

Chapter 2: Data Preprocessing

Data Preprocessing

Chapter 2: Data Preprocessing

Chapter 2: Data Preprocessing