Data Preparation: Patterns, Models & Transformations

COMP 3503Data Preparation and Meta Data with Daniel L. Silver

The KDD Process Interpretation and Evaluation Data Mining Knowledge Selection and Pre-processing p(x)=0.02 Data Consolidation Patterns & Models Prepared Data Warehouse Consolidated Data Data Sources

Selection and Pre-processing Core Problems & Approaches • Problems: • identification of relevant data • representation of data • search for valid pattern or model • Approaches: • top-down verification by expert • interactive visualization of data/models • * bottom-up inductionfrom data * Probability of sale Age Income OLAP Data Mining

Selection and Pre-processing • As much effort is expended preparing data as applying a data mining tool • Iterative approach: prepare develop data model • Data Mining phase will benefit from any insight that leads to improved set of attributes • Representation can facilitate or frustrate the Search for the most accurate model (hypothesis) Spreadsheet, OLAP and visualization tools are very helpful

Selection and Pre-processing Data and Variable Characteristics • Three basic variable data types: • Nominal (catagorical) qualitative values marital status = single, married, divorced, widowed • Ordinal (ranked) values have rank order grade = A,B,C • Interval values have order plus a metric scale for comparisons and arithmetic operations temperature = 2, 10, 20 or 10.5, 15.2, 19.3 date = 12Aug99, 13Feb02

Selection and Pre-processing Data and Variable Characteristics • Variables can be either discrete or continuous • Only interval numeric values are continuous • In addition data can be of various formats: • Text, numeric, logical, binary, date, money, … • Data mining software will vary in its ability to accept these types and formats

Selection and Pre-processing Data Selection and Sampling • Select response (dependent) variable • determine prior probability of categories • deal with volume bias issues • Select predictor (independent) attributes • Generate a set of examples • choose sampling method (random, stratified) • consider sample complexity: How many examples do I need to develop a reliable model? • Handle outliers (obvious exceptions) • Remove the row or replace with imputed value

Selection and Pre-processing Data Reduction • The curse of dimensionality • number of attributes / number of values • Reduce number of attributes • remove redundant and correlating attributes • combine attributes (logically, arithmetically, statistically (Principal Components Analysis) • Reduce attribute value ranges • group symbolic discrete values • quantize continuous numeric values

pos r neg r ? Y X X X Selection and Pre-processing Preliminary Statistical Analysis • Coefficient of correlation , r, measures the linear dependence of two variables X and Y, -1 < r < +1; r shows magnitude of r. • Select attributes which correlate strongly with the response variable and pertain to problem 2

Selection and Pre-processing Preliminary Statistical Analysis • Remove or combine attributes that correlate with each other or try to de-correlate through transformation • Factor Analysis - ANOVA can be used to compare relative contribution of each attribute to outcomes • Principal Component Analysis - generates variates - linear combinations of original attributes Tools such as Minitab, SAS, SPSS can be used

Selection and Pre-processing • Transform data • de-correlate and normalize values • map time-series data to static representation • Encode data • representation must be appropriate for the Data Mining tool which will be used • continue to reduce attribute dimensionality where possible without loss of information Use spreadsheet functions or transformation and encoding software within DM tool

Selection and Pre-processing Transformation and Encoding Discrete variable values • If necessary transform to discrete numeric values • Example, encode the value 4 as follows: • Nominal: one-of-N code (0 1 0 0 0) - five inputs • Ordinal: thermometer code ( 1 1 1 1 0) - five inputs • Interval: real value (0.4)* - one input • Consider relationship between values • (single, married, divorce) vs. (youth, adult, senior)

Selection and Pre-processing Transformation and Encoding Continuous numeric values • De-correlate via normalization of values: • Min-max: x’ = [(newmax – newmin) (x – min) / (max – min)] + newmin • Euclidean: x’ = x / sqrt(sum of all x^2) • Percentage: x’ = x/(sum of all x) • Variance based: x’ = (x - (mean of all x))/variance • Scale values using a linear transform if data is uniformly distributed or use non-linear (log, power) if skewed distribution

Selection and Pre-processing Transformation and Encoding Other encodings for continuous numeric values Example: 1.6 meters could be encoded as: • Single real-valued number (0.16)* • OK! But what if data is skewed • Bits of a binary number (010000) • BAD! Modeling system must now learning binary encoding • One-of-N quantized intervals (0 1 0 0 0) • NOT GREAT! Presents discontinuties • Distributed (fuzzy) overlapping intervals ( 0.3 0.8 0.1 0.0 0.0) • BEST! Deals well with skewing but no discontinuities

Selection and Pre-processing Extracting Features from a Single Variable • From dates: • Day, week, month, quarter, holiday, weekend day • From time: • Hour, minute, morning, afternoon, evening • From address: • Postal Code components mean something • Telephone number: • NPA-NNX-9999

Selection and Pre-processing Time Series Data • Of great interest to business, science, medicine • Time series data has high dimensional • T1, T2, … , Tn • Approaches to summary/characterization • Current value = Ti • Moving average = MAi = ( Ti + Ti-1 + Ti-2) / 3 • Trends = Ti - MAi or = MAi - MAi-k

Selection and Pre-processing Textual Data • A difficult data type • freeform, open-ended, syntax -vs- semantics • Can have very high dimensions • thousands of potential values • Approaches to summary/characterization • define a fixed set of N word classes based on frequency analysis • map word combinations to one of the N classes • automate via specialty software

Data Warehousing and Preparation Access to Recent Information • www.datawarehousing.com • DWI - Data Warehouse Institute www.dw-institute.com • Wikipedia http://en.wikipedia.org/wiki/Data_warehouse • DW Information Centre http://www.dwinfocenter.org • A DW Tutorial: http://www.planet-source-code.com/vb/scripts/ShowCode.asp?lngWId=5&txtCodeId=378 • Text Books: Data Warehousing texts by W.H. Inmon, Claudia Imhoff, Ralph Kimball D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.

THE ENDdanny.silver@acadiau.ca

Data Preparation: Patterns, Models & Transformations

Data Preparation: Patterns, Models & Transformations

Presentation Transcript

Exploratory Data Mining and Data Preparation

Meta Data Architectures

COMP 3503 Deductive Modeling and Visualization

Data preparation and data capturing

Comp 3503 Web Mining

DATA PREPARATION AND SCREENING

Meta data - data that provides information about data.

DATA PREPARATION and storage

IMS Meta-Data

Meta Data

COMP 3503 Deductive Modeling and Visualization

COMP 3503 Data Warehousing

Data Preparation

Chapter 3 Meta Data and Electronic Data Interchange

Meta Data Services

Exploratory Data Mining and Data Preparation

Lecture 1-2 Data and Data Preparation

Data Preparation and Reduction