Biometrics and High Dimensional Data

Biometrics and High Dimensional Data Václav Snášel VŠB-Technical University of Ostrava CzechRepublic

Motto Krakow 2014

Outline • Curse of dimensionality, dimensionality reduction • Singular Value Decomposition (SVD), Definition, Interpreting an SVD, Factor interpretation, • Geometric interpretation, Component interpretation, Algorithm issues, Algorithms and complexity Krakow 2014

Outline • Semi Discrete Decomposition (SDD), Interpreting an SDD, Factor interpretation, • Geometric interpretation, Component interpretation, Graph interpretation, Algorithm issues, Algorithms and complexity, • Applications of SVD and SDD together • Geometric interpretation, Component interpretation, Graph interpretation, Algorithm issues, Algorithms and complexity, Krakow 2014

Outline • Tensor Decomposition, Basic Tensor Concepts, Tensor SVD, Approximating a Tensor by HOSVD • Data Mining Applications, • Classification of Handwritten Digits, Handwritten Digits and a Simple Algorithm • Text Mining, Ontology • Face Recognition Using Tensor SVD Krakow 2014

Modern data Facts Computers make it easy to collect and store data. Costs of storage are very low and are dropping very fast. (most laptops have a storage capacity of more than 500GB …) When it comes to storing data The current policy typically is “store everything in case it is needed later” instead of deciding what could be deleted. Data Mining Extract useful information from the massive amount of available data. Krakow 2014

Why linear (or multilinear) algebra? Data are represented very often by matrices Numerous modern datasets are in matrix form. Data are represented by tensors. Data in the form of tensors (multi-mode arrays) are becoming very common in the data mining and information retrieval literature in the last few years. Krakow 2014

Why matrix decompositions? • Matrix decompositions – spectral analysis • (e.g., SVD, SDD, CX and CUR, NMF, MMMF, etc.) • They use the relationships between the available data in order to identify components of the underlying physical system generating the data. • Some assumptions on the relationships between the underlying components are necessary. • Very active area of research; some matrix decompositions are more than one century old, whereas others are very recent. Krakow 2014

Spectral analysis - simple analog illustration • Hidden Components in Light – Separated by a Prism • Our purpose – finding hidden components by data analysis Krakow 2014

Images matrices A collection of images is represented by an m-by-n matrix npictures • Data mining tasks • Cluster or classify images • Find “nearest neighbors” • Feature selection • find a subset of features that (accurately) clusters or classifies images. mpixels (points) (features) Aij = color valuesof i-thpixelin j-thimage Krakow 2014

Data representation • For example, take the following typical cases: A face recognition/classification system based on m x n greyscale images which, by row concatenation, can be transformed into mn-dimensional real vectors. In practice, one could have images of m = n = 256 or 65536 dimensional vectors. p1 p65536 = (p1,p2 ,p3, …,p65536) Krakow 2014

Retrieval model • Similarity between two points (documents or a document and a query) is usually calculated as normalized scalar product of their vectors (cosine measure). Krakow 2014

Document-term matrices A collection of documents is represented by an m-by-n matrix ndocuments • Data mining tasks • Cluster or classify documents • Find “nearest neighbors” • Feature selection • find a subset of terms that (accurately) clusters or clasifies documents. Aij = frequency of i-thterm in j-thdocument mterms (words) Krakow 2014

Market basket matrices n products (e.g., milk, bread, wine, etc.) Common representation for association rule mining. • Data mining tasks • Find association rules • E.g., customers who buy product x buy product y with probability 89%. • Such rules are used to make item display decisions, advertising decisions, etc. Aij = quantity of j-th product purchased by the i-th customer m customers Krakow 2014

Social networks (e-mail graph, FaceBook, MySpace, etc.) n users Represents the email communications (relationships) between groups of users. • Data mining tasks • cluster the users • identify “dense” networks of users (dense subgraphs) Aij = number of emails exchanged between users i and j during a certain time period m users Krakow 2014

Recommendation systems The m-by-n matrix A representsm customers andn products. products Aij = utility of j-th product to i-th customer Data mining task Given a few samples from A, recommend high utility products to customers. customers Krakow 2014

Intrusion detection The m-by-n matrix A representsmrecords andnattributes. The data for our experiments was prepared by the 1998 DARPA intrusion detection evaluation program by MIT Lincoln Labs attributes Aij = utility of j-thattributeto i-threcord records Data mining task Reduce noise in the data. Krakow 2014

m customers n products n products Tensors: recommendation systems • Economics: • Utilityis ordinal and not cardinal concept. • Compare products; don’t assign utility values. • Recommendation Model Revisited: • Every customer has an n-by-n matrix (whose entries are +1,-1) and represent pair-wise productcomparisons. • There are m such matrices, forming an n-by-n-by-m 3-mode tensor A. Krakow 2014

Curse of Dimensionality Krakow 2014

N - Dimensions • Sommerville, D. M. Y. 1929. An Introductionto the Geometry of N Dimensions. New York:Dover Publications. The graph of n-ball volume as a function of dimension was plotted more than 100 years agoby Paul RennoHeyl, who was then a graduate student at the University of Pennsylvania. Thevolume graph is the lower curve, labeled “content.” The upper curve gives the ball’s surfacearea, for which Heyl used the term “boundary.” The illustration is from Heyl’s 1897 thesis,“Properties of the locus r = constant in space of n dimensions.” Krakow 2014

N - Dimensions The graph of n-ball volume as a function of dimension was plotted more than 100 years agoby Paul RennoHeyl, who was then a graduate student at the University of Pennsylvania. The upper curve gives the ball’s surfacearea, for which Heyl used the term “boundary.” Thevolume graph is the lower curve, labeled “content.” The illustration is from Heyl’s 1897 thesis,“Properties of the locus r = constant in space of n dimensions.” Brian Hayes: An Adventure in the Nth Dimension. American Scientist, Vol. 99, No. 6, November-December 2011, pages 442-446. Krakow 2014

N - Dimensions Beyond the fifth dimension, the volume of unit n-ball decreases as n increases. When we compute few larger values of n, finding that V(20,1) = 0,0258 and V(100,1) = 10-40 Brian Hayes: An Adventure in the Nth Dimension. American Scientist, Vol. 99, No. 6, November-December 2011, pages 442-446. Krakow 2014

Curse of Dimensionality • The curse of dimensionality is a term coined by Richard Bellman to describe the problem caused by the exponential increase in volume associated with adding extra dimensions to a space. • Bellman, R.E. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ. Krakow 2014

Curse of Dimensionality • When dimensionality increases, data becomes increasingly sparse in the space that it occupies • Definitions of density and distance between points, which is critical for data mining, become less meaningful Randomly generate 500 points Compute difference between max and min distance between any pair of points any pair of points Krakow 2014

Curse of Dimensionality The volume of an n-dimensional sphere with radius r is Ratio of the volumes of unit sphere and embedding hypercube of side length 2 up to the dimension 14. dimension Krakow 2014

Curse of Dimensionality The volume of an n-dimensional sphere with radius r is circular ring with radius 1 Ratio of volume of n-dimensional sphere with radius 20 volume of circular ring with radius 1 Krakow 2014

Curse of Dimensionality 2-dimension case circular ring with radius 1 Krakow 2014

Curse of Dimensionality 20-dimension case circular ring with radius 1 Krakow 2014

Curse of Dimensionality • The model space is EMPTY! (in huge dimension all volume is in surface) • Distribution of data is uniform! (in huge dimension all distance is being uniform) Krakow 2014

N-dimensional cube For convenience, let α(n, i) denote the maximum area of an i dimensional cross section of In. As for the shapes of the cross sections of In, our knowledge is very limited. Krakow 2014

Motivation • An important feature of modern science and engineering is that data of various kinds is being produced at an unprecedented rate. • The nature of this data is significantly different. Krakow 2014

Introduction • Since the volume (and dimensionality) of data is typically large, the emphasis of new algorithms must be on efficiency and scalability to large data sets. • Analysis of continuous attribute data generally takes the form of eigenvalue/singular value problems (PCA/rank reduction), clustering, least squares problems, etc. Krakow 2014

Introduction • The design of approximation algorithms for hard optimization problems can be viewed as two-step process: (i) Find a Relaxation of the given problem, i.e., a problem whose feasible set encloses that of the original problem and can be solved in polynomial time, and (ii) Find a Rounding, i.e., a mapping from a solution of the relaxation to a solution of the original problem. Krakow 2014

Dimensionality Reduction Krakow 2014

Matrix Decomposition • Lars Elden, Matrix Methods in Data Mining and Pattern Recognition, SIAM 2007 • David Skillicorn, Understanding Complex Datasets: Data Mining with Matrix Decomposition, Chapman & Hall/CRC, 2007 Krakow 2014

Dimensionality Reduction • Purpose: • Avoid curse of dimensionality • Reduce amount of time and memory required by data mining algorithms • Allow data to be more easily visualized • May help to eliminate irrelevant features or reduce noise • Techniques • Principle Component Analysis • Singular Value Decomposition • Others: supervised and non-linear techniques Krakow 2014

Dimensionality Reduction = optimization problem • For any set of n points X in Rd dimensionality reduction is a map from f: Rd->Rk . (where d >> k) Krakow 2014

mm mn V is nn Singular value decomposition For an m n matrix A (Document term) of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are orthogonal eigenvectors of AAT. The columns of V are orthogonal eigenvectors of ATA. Eigenvalues 1 … rof AATare the eigenvalues of ATA. Krakow 2014

Dokuments = * * Terms Ak (n x m) Uk (n x k) (k x k) VkT (k x m) Singular value decomposition s1 s2 sk Krakow 2014

Latent semantic indexing 1/1 • LSI – k-reducedsingulardecompositionof the term-by-document matrix • Latent semantics – hidden connections between both terms and documents determined on documents’ content • DocumentmatrixDk= SkVkT(orDk’ =VkT) • Term matrixTk= UkSk(orTk’ = Uk) • Queryin reduced qk= UkTq (or qk’ = Sk-1 UkTq) dimension Krakow 2014

Latent semantic indexing 1/2 In another words. Documentsare represented as linear combination of meta terms. d1 = Σ w1imi d2 = Σ w2imi ……………. dn = Σ wnimi Krakow 2014

Retrieval in LSI • Similarity between two documents or a document and a query is usually calculated as normalized scalar product of their vectors of meta term. Krakow 2014

Picture matrix pic Krakow 2014

Retrieval in LSI Fig08 0.9769 Query Fig09 0.9165 Krakow 2014

Retrieval in LSI Fig02 0.9740 Query Fig06 0.3011 Krakow 2014

Retrieval in LSI Fig14 0.3640 Fig11 0.3482 Query Krakow 2014

Latent semantic indexing What is meta term? Meta term is linear combination of terms. What does the meta term mean? There is some interpretation of meta terms? Krakow 2014

Building collection Krakow 2014

Meta point Krakow 2014

DCT Krakow 2014

Biometrics and High Dimensional Data

Biometrics and High Dimensional Data

Presentation Transcript

High Dimensional Chaos

Handling of High-Dimensional Data Sets

Seeking Interpretable Models for High Dimensional Data

Biometrics and Data Protection

High Dimensional Indexing

High dimensional genomic data, identifiability , and query-response

High-Dimensional Data

Metric, Data and Analysis Biometrics

ICS 278: Data Mining Lecture 5: Low-Dimensional Representations of High-Dimensional Data

Efficient Clustering of High-Dimensional Data Sets

High Dimensional Chaos

High Dimensional Data Analysis

High-dimensional data analysis: Microarrays and multiple testing

Seeking Interpretable Models for High Dimensional Data

Finding Local Correlations in High Dimensional Data

Clustering High Dimensional Data Using SVM

Privacy Preserving Approaches for High Dimensional Data

Booster in High Dimensional Data Classification

Foundation of High-Dimensional Data Visualization

Clustering and Testing in High-Dimensional Data

High Dimensional Data

Estimating Dependency and Significance for High-Dimensional Data