1 / 0

IPIAC

IPIAC. Multidimensional data processing. Multivariate data. Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing.

varden
Télécharger la présentation

IPIAC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IPIAC

    Multidimensional data processing
  2. Multivariate data Multivariate data consist of several variables for each observation. Actually, serious data is always multivariate. Some variables are usually not collected to simplify collecting and processing. Removal of variables before data analysis leads to information loss. Unknown information is never recovered. One of the most common task is clustering or classification.
  3. Classification × Clustering classification target classes are known properties of target classes are usually unknown goal: find rules which separate observed data into target classes clustering target classes are unknown goal: find observations with common properties which may (or may not) represent classes in real world difficult situation
  4. Classification × Clustering
  5. Classification × Clustering
  6. Data mining we are trying to extract information from data measurements, observations, surveys data preparation data adjustment – removal of invalid or incomplete observations/measurements normalization? – best handled when collecting extracting information we know what we are looking for – testing of an hypothesis trying to discover something new – data exploration
  7. Data exploration preliminary analysis of the data better understanding of its characteristics allows to select the right tools for preprocessing or analysis wrong tools may yield invalid information or hide important patterns also known as Exploratory Data Analysis (EDA) a different approach – mind shift is required concentrates on the larger view 1977+ aka visual data mining
  8. The purpose of computing is insight, not numbers

    Richard Wesley Hamming, Numerical Methods for Scientists and Engineers, 1962
  9. Exploratory data analysis steps maximize insight into a data set uncover underlying structure extract important variables detect outliers and anomalies test underlying assumptions develop minimalistic models determine optimal property settings heavily relies on graphics numbers are very abstract
  10. Example Characteristics: N = 11 Mean of X = 9.0 Mean of Y = 7.5 Intercept = 3 Slope = 0.5 Residual standard deviation = 1.237 Correlation = 0.816 Have we realized something important?
  11. Run-sequence plot, Histogram Run-sequence plot similar to line-chart in excel shifts in variations shifts in location outliers Histogram center, spread, skew, multimodality outliers very useful – know how to create it! nice presentations (e.g. word-cloud, tag-cloud)
  12. Lag plot check whether the data set is random or no random data should have no observable structure lag = fixed time displacement can be arbitrary most common is 1 observe week autocorrelation strong autocorrelation sinusoidal model outliers
  13. Lag plot – both X and Y
  14. Scatter plot
  15. Lag plot – same data
  16. Visualization of data 1 dimension – piece of cake (pie) 2 dimensions – still easy – Cartesian coordinate system 3 dimensions – still doable in Cartesian system 4 and more dimensions – only Chuck Norris can do that in Cartesian system other types of visualization are required some may be useful only for some types of data
  17. Multidimensional visualization understanding the data is very important good visualization can help us understand the contained information results need to be presented to other people sanity check, intuition – people capture patterns, which are missed by automated methods some options: bubble chart (3dim scatter plot) scatter plot array star plot, Radviz, Polyviz parallel coordinates
  18. Bubble chart also called: 3 dimensional scatter plot 2 data dimensions – graph X and Y 3rd dimension – point size optional 4th dimension – point color advantages allows to uncover clusters and variable dependencies easy to understand disadvantages different combinations need to be tried
  19. Scatter plot array extension to common scatter plot 2 dimensional array of scatter plots each combination of variables is drawn (twice) diagonal descriptions easy to create messy dependencies between more than two variables are still hidden
  20. Sepal length Sepal width Petal length Petal width
  21. Star plot, Radviz, polyviz axes radiate from central point Star plot values of a data point are connected to form a polygon can display only a small number of points order of variables may be important Radviz values of a data point act as spring stiffness values normalized into interval <0, 1> object is placed in equilibrium of all forces order of variables becomes very important
  22. Iris-virginica
  23. Iris-versicolor
  24. Iris-setosa
  25. Polyviz similar principle to Radviz data points are not attracted to a single point data points are attracted to an axis circle becomes polygon → Polyviz order of variables is less important polygon edges become very important candidates for classification rules different combinations of variables exact position of point is displayed – no information loss
  26. Parallel Coordinates orthogonal system uses up the plane very fast geometrical transformation unlike the before mentioned methods has other uses, than just visualization low representational complexity – scatter plot array has equidistant parallel axes same positive orientation a point C in is represented by polygonal line a plane in is represented by lines
  27. Parallel Coordinates advantages determine correlation between variables both positive and negative determine partial correlations only some values of some variable are correlated with some values of other variable very important disadvantages dependent on variable ordering not that useful without interactive software may be hard to understand for newbies
  28. References Exploratory data analysis: http://www.itl.nist.gov/div898/handbook/eda/eda.htm Have a look at the graphical techniques: http://www.itl.nist.gov/div898/handbook/eda/section3/eda33.htm Orange Canvas – open-source data mining http://orange.biolab.si/ interface similar to IBM Clementine (SPSS Modeler) widget documentation: http://orange.biolab.si/doc/widgets/ Sample data http://archive.ics.uci.edu/ml/index.html http://www-958.ibm.com/software/data/cognos/manyeyes/
More Related