IPIAC

IPIAC
Multidimensional data processing

Multivariate data Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing. Removal of variables before data analysis leads to information loss. Unknown information is never recovered. One of the most common task is clustering or classification.

Classification × Clustering classification target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes clustering target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation

Classification × Clustering

Data mining we are trying to extract information from data measurements, observations, surveys data preparation data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting extracting information we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration

Data exploration preliminary analysis of the data better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns also known as Exploratory Data Analysis (EDA) a different approach – mind shift is required concentrates on the larger view 1977+ aka visual data mining

The purpose of computing is insight, not numbers
Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962

Exploratory data analysis steps maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings heavily relies on graphics numbers are very abstract

Example Characteristics: N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Have we realized something important?

Run-sequence plot, Histogram Run-sequence plot similar to line-chart in excel shifts in variations shifts in location outliers Histogram center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud)

Lag plot check whether the data set is random or no random data should have no observable structure lag = fixed time displacement can be arbitrary most common is 1 observe week autocorrelation strong autocorrelation sinusoidal model outliers

Lag plot – both X and Y

Scatter plot

Lag plot – same data

Visualization of data 1 dimension – piece of cake (pie) 2 dimensions – still easy – Cartesian coordinate system 3 dimensions – still doable in Cartesian system 4 and more dimensions – only Chuck Norris can do that in Cartesian system other types of visualization are required some may be useful only for some types of data

Multidimensional visualization understanding the data is very important good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods some options: bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates

Bubble chart also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3rd dimension – point size optional 4th dimension – point color advantages allows to uncover clusters and variable dependencies easy to understand disadvantages different combinations need to be tried

Scatter plot array extension to common scatter plot 2 dimensional array of scatter plots each combination of variables is drawn (twice) diagonal descriptions easy to create messy dependencies between more than two variables are still hidden

Sepal length Sepal width Petal length Petal width

Star plot, Radviz, polyviz axes radiate from central point Star plot values of a data point are connected to form a polygon can display only a small number of points order of variables may be important Radviz values of a data point act as spring stiffness values normalized into interval <0, 1> object is placed in equilibrium of all forces order of variables becomes very important

Iris-virginica

Iris-versicolor

Iris-setosa

Polyviz similar principle to Radviz data points are not attracted to a single point data points are attracted to an axis circle becomes polygon → Polyviz order of variables is less important polygon edges become very important candidates for classification rules different combinations of variables exact position of point is displayed – no information loss

Parallel Coordinates orthogonal system uses up the plane very fast geometrical transformation unlike the before mentioned methods has other uses, than just visualization low representational complexity – scatter plot array has equidistant parallel axes same positive orientation a point C in is represented by polygonal line a plane in is represented by lines

Parallel Coordinates advantages determine correlation between variables both positive and negative determine partial correlations only some values of some variable are correlated with some values of other variable very important disadvantages dependent on variable ordering not that useful without interactive software may be hard to understand for newbies

References Exploratory data analysis: http://www.itl.nist.gov/div898/handbook/eda/eda.htm Have a look at the graphical techniques: http://www.itl.nist.gov/div898/handbook/eda/section3/eda33.htm Orange Canvas – open-source data mining http://orange.biolab.si/ interface similar to IBM Clementine (SPSS Modeler) widget documentation: http://orange.biolab.si/doc/widgets/ Sample data http://archive.ics.uci.edu/ml/index.html http://www-958.ibm.com/software/data/cognos/manyeyes/

IPIAC

IPIAC

Presentation Transcript

IPIAC

The purpose of computing is insight, not numbers

IPIAC

JSWEC workshop experiences of developing the IPIAC e-learning resource