1 / 17

Lecture 07: Data Transform II

Lecture 07: Data Transform II. September 28, 2010 COMP 150-12 Topics in Visual Analytics. Lecture Outline. Data Retrieval Methods for increasing retrieval speed: Pre-computation Pre-fetching and Caching Levels of Detail (LOD) Hardware support Data transform (pre-processing)

yamka
Télécharger la présentation

Lecture 07: Data Transform II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 07:Data Transform II September 28, 2010 COMP 150-12Topics in Visual Analytics

  2. Lecture Outline • Data Retrieval • Methods for increasing retrieval speed: • Pre-computation • Pre-fetching and Caching • Levels of Detail (LOD) • Hardware support • Data transform (pre-processing) • Aggregate (clustering) • Sampling (sub-sampling, re-sampling) • Simplification (dimension reduction) • Appropriate representation (finding underlying mathematical representation)

  3. Dimension Reduction • Lots of possibilities, but can be roughly categorized into two groups: • Linear dimension reduction • Non-linear dimension reduction • Related to machine learning…

  4. Dimension Reduction • Can think of clustering as a dimension reduction mechanism: • Assume the dataset has n dimensions • Using clustering, results in k dimensions (k < n) • Instead of representing the data as n dimensional vector • Present the data using the k-dimensions

  5. Some Common Techniques • Principle Component Analysis • demo • Multi-Dimensional Scaling • draw • Kohonen Maps / Self-Organizing Maps • demo • Isomap • draw

  6. Principle Component Analysis height GPA 0.5*GPA + 0.2*age + 0.3*height = ? • Quick Refresher of PCA • Find most dominant eigenvectors as principle components • Data points are re-projected into the new coordinate system • For reducing dimensionality • For finding clusters • Problem: PCA is easy to understand mathematically, but difficult to understand “semantically”. age

  7. Principle Component Analysis • Pseudo code • Pose data such that each column is a dimension, and each row is a data entry (a nxm matrix, n = rows, m = cols) • Subtract the mean of a dimension from each value • Compute the covariance matrix (M) • Compute the eigenvectors and eigenvalues of (M) • Use singular value decomposition (SVD) • where and are mxm matrices, • is an mxndiagnoal matrix (of positive real numbers) • Sort the eigenvectors in based on their associated eigenvalues in from highest eigenvalue to lowest • Project your original data onto the first (highest) eigenvectors

  8. Multi-Dimensional Scaling • Minimize distances between low-d and high-d representations. • Where is the position of point i in low dimensional space, and is the distance between two points i and j in n dimensions

  9. Multi-Dimensional Scaling Image courtesy of Jing Yang

  10. Multidimensional Scaling

  11. Multidimensional Scaling

  12. Multidimensional Scaling

  13. Self-Organizing Maps • Pseudo code • Assume input of n rows of m dimensional data • Define some number of nodes (e.g. 40x40 grid) • Give each node m values (vector of size m) • Randomize those values • Loop k number of times: • Select one of the n rows of data as “input vector” • Find within the 40x40 grid nodes the one most similar to the input vector (call this node Best Matching Unit – BMU) • Find the neighbors of the BMU on the grid • Update the BMU and its neighbors based on the following equation: • where is the gaussian function of distance (decays over time) • is the learning function (decays over time) • is the input vector, and is the grid node’s vector

  14. Isomap Image courtesy of Wikipedia: Nonlinear Dimensionality Reduction

  15. Many Others! • To name a few: • Latent Semantic Indexing • Support Vector Machine • Linear Discriminant Analysis (LDA) • Locally Linear Embedding • “manifold learning” • Etc. • Consider the characteristics of the data, and choose the appropriate. • e.g. are the data labeled? Apply supervised vs. unsupervised methods.

  16. Support Vector Machine

  17. Questions?

More Related