1 / 48

Dimension reduction techniques for distributional data

Antonio Irpino and Rosanna Verde Dipartimento di Scienze Politiche «Jean Monnet» Seconda Università di Napoli. Dimension reduction techniques for distributional data. Outline. We present: Distributional data (sources and definitions)

afra
Télécharger la présentation

Dimension reduction techniques for distributional data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Antonio Irpino and Rosanna Verde Dipartimento di Scienze Politiche «Jean Monnet» Seconda Università di Napoli Dimension reduction techniques for distributional data

  2. Outline We present: • Distributional data (sources and definitions) • Current strategies for the dimension reduction of distribution data • Our proposal: a univariate (in the sense that we analyze a single distributional variable) PCA analysis using quantiles • Interpretation by using location-scale-shape characteristics of data • Interpretation by using only scale-shape characteristics • Some results on simulated data

  3. Main sources of distributional data • Result of clustering procedures: • From surveys • From large databases • Data from sensors • Temperatures • Pollutant concentration • Network activity • Data streams • Description of time windows • Image analysis • etc.

  4. The exploratory data analysis of distributions (Pearson, Fisher and co. are really turning in their graves) • Practically: • My sensor network can return into my server only the distributions of the sensed phenomenon • The bank can return me only the distributions of cash flows of a set of individuals • Etc. We need to explore such data like in the classic case.

  5. Symbolic Data Analysis • SDA allows us to describe observations using • List of categories (also weighted) • Intervals of values • Histograms, distributions • Models • And so on… SV 1 SV 2 SV 3 SV 4 object Distributional variables

  6. An example of distributional data: Histogram data • For a generic variable, the i-th histogram data is a model to represent an empirical distribution described as a set of H ordered pairs Y(i)={(Ih,h)} such that: Y(i)=[([0-10],0.2);([10-20],0.5); ([20-30],0.1);([30-40],0.2)] 0 10 20 30 40

  7. Comparing distributions (or empirical PDFs) According to their means (LOCATION) According to their standard deviations (SCALE) According to their shapes (SHAPE)

  8. Dimension reduction of histogram data • Multidimensional scaling • MDS based on majorization (Groenen): but this is for histogram of distances (a bit different). • Different PCA approaches for histogram data: • (1997) Rodriguez, O., Diday, E., Winsberg, S. • Generalization of the Principal Components Analysis to Histogram Data, PKDD2000, Lyon, (2000) from Cazes, P., Chouakria, A., Diday, E., Schektman, Y. : Extension de lanalyse en composantes principales des donnes de type intervalle, Revue de Statistique Applique XIV(3), 524 (1997) • (2002) Cazes, P. • Analyse factorielle dun tableau de lois de probabilit. Rev Statistique Appliquée 50(3): 524 (2002) • (2007) Nagabhushan, P., Pradeep Kumar,R. • Histogram PCA. In: Lecture Notes in Computer Science Volume 4492, Advances in Neural Networks ISNN 2007, pp 1012-1021 (2007). • (2011) Ichino, M. • The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2): 184-198 (2011) • (2012) Makosso-Kallyth, S., Diday, E. • Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification 6 (2), pp 147-159 (2012)

  9. 1997 PCA Rodriguez et al • (1997) Rodriguez, O., Diday, E., Winsberg, S. • Generalization of the Principal Components Analysis to Histogram Data, extending Cazes, P., Chouakria, A., Diday, E., Schektman, Y. : Extension de l’analyse en composantes principales des donnes de type intervalle, Revue de Statistique Applique XIV(3), 524 (1997) All the histograms share the same partition of the domain. The analysis is performed on the intervals of cumulative frequencies. The decomposed matrix is related only to the frequencies, the domain is lost (or is considered as a set of ordered categories), thus the inertia of data is the inertia of a set of frequencies

  10. 2002 Cazes (2002) Cazes, P. • Analyse factorielle d’un tableau de lois de probabilité. Rev Statistique Appliquée 50(3): 524 (2002) The decomposed matrix is the Covariance (or correlation) matrix of the multiple distributions (one for each unit). But, having only the marginal distributions (one for each variable), it is reduced to the Covariance matrices of the expected values (the means) of the distributions. The multivariate distribution is visually reconstructed on the factorial planes by the vertices of the bins projected as supplementary points. Thus, the decomposed inertia is the inertia of the means of the distributions. (The principal components are linear combinations of the means.)

  11. 2007 Nagabushan et al (2007) Nagabhushan, P., Pradeep Kumar,R. • Histogram PCA. In: LNCS Volume 4492, Advances in Neural Networks ISNN 2007, pp 1012-1021 (2007). All the histograms shares the same number N of bins, for all the variables. The methods work only on frequencies (the domain is lost or is treated as a set of categories). After defining some basic operations on histograms, the decomposed matrix is A, a sort of covariance matrix of the frequencies. The PCs are linear combination of frequencies.

  12. 2011 PCA Ichino (2011) Ichino, M. • The quantile method for symbolic principal component analysis. Stat Anal Data Min 4(2): 184-198 (2011) The method proposes to work on quantiles of distributions. Each histogram is represented by a set of m (an integer) quantiles. The decomposed matrix is a sort of correlation matrix of the corresponding quantiles. The PCs are linear combinations of quantiles (the coefficients are related to the whole variable, i.e. the quantiles of each variable have the same coefficient). Trajectories are the conjunctions of the quantiles projected as supplementary points.

  13. 2012 Makosso & Diday (2012) Makosso-Kallyth, S., Diday, E. • Adaptation of interval PCA to symbolic histogram variables. Advances in Data Analysis and Classification 6 (2), pp 147-159 (2012) In this approach, distribution or histograms are binned in the same number of bins and a PCA on a transformation of frequencies is performed (The domain is treated as the domain of a compositional variable). The steps are: - Coding of bins - Angular transformation of frequencies using an arcsin((Hijk)1/2) - PCA on the centers of variables - transformation of data into interval (using Tchebicheff inequality) - construction of hipercubes an projections on factorial planes

  14. Summarizing

  15. Some questions not clearly resolved • What is a linear combination of distributional variables (that is different from the linear combination of distributions, i.e., convolution) • It has always a sense? • Can we define the concept of variability of a distributional variable?

  16. The variability of a distributional variable • Each measure of variability should be positive • A measure of variability is null when all the observed values are the same. TWO MAIN APPROACHES in SDA 1) Billard-Diday(2007) and Goupil (2000) 2) Irpino-Verde (2008)

  17. Billard Diday (2007) • Starting from the two level paradigm • Given two uniforms x1~U(a,b) x2~U(c,d) • the mean is M=((a+b)/2+(c+d)/2)/2 • The variance is V=1/2((b-a)2/12+(d-c)2/12) • If the two uniform are the same, the variance is not null!!! • The Correlation of two identical vectors of uniforms is not 1

  18. Irpino-Verde (2008) • Starting from a probabilistic distance between descriptions,the L2 Wasserstein metric. • Given two uniform x1~U(a,b) x2~U(c,d) • The mean is a uniform M~U((c+d)/2,(b+d)/2) • The variance is V=d2W(x1,x2)=[(a+b)/2-(c+d)/2]2+1/3[(b-a)/2-(c-d)/2]2 • The variance of two equal uniforms is null • The correlation of two identical vectors of uniforms is ONE

  19. But what measure can be used as a distance between histograms? • In order to compare two histogram we propose to use the Wasserstein-Kantorovich metric: in particular the derived L2-Wasserstein distance between two quantile functions • L2 Wasserstein metric: it is a natural extension of the Euclidean metric, then it can be applied to different distributions and is easily interpretable in term of location, scale and shape

  20. An interpretative decomposition of L2-Wasserstein metric (Irpino, Romano 2006)

  21. Location – Scale - Shape a b u R R R

  22. In general: L2-Wasserstein metric for multivalued data • Single valued data (Dirac’s Delta or impulse functions) • Interval data (cosidered as uniforms) • Data described by empyrical PDFs • Sometimes (Normal distributions, uniforms)

  23. A second point • A distribution is a multivariate description. • It is inherently a multidimensional description (it is a weighted set valued description).

  24. A proposal • Is it possible to perform a PCA on a single distributional variable? • Can we use a functional data approach? • Yes, but we should define a set of basis functions etc. but these choices can be extremely subjective. • We propose a non-parametric approach, i.e. we propose, like Ichino, to work on quantiles.

  25. Why quantiles (Gilchrist, 2000) • Quantiles do not require strong hypothesis on the underlying distribution • Quantiles are natural to understand (the Tuckeyboxplot is the representation of the main 5 order statistics [min, Q1,Me,Q3,max]) • Quantiles are expressed in the same unit of measure of the phenomenon (this can be an advantage for the interpretation of results) • Quantile functions, being the inverse of cdf’s, are always defined in [0,1] (while cdf does not)

  26. The proposal: a PCA on Quantiles • Considering the complex nature of the data we firstly studied a PCA for a single distributional variable using a fixed set of quantiles • We have a set of N realization (N empirical distributions given by N histograms) of a single X distributional variable

  27. The matrix X • We fix a set of m quantiles • Each individual is represented by a sequence of m+1(including the minimum value) ordered values

  28. Average quantiles vector • The m+1 quantile column variables are centered X – IN • x = Xc with IN the unitary vector of N elements X = Average quantiles x =

  29. Quantile co-variance matrix • Let us define the co-variance quantiles matrix, as: Xc TW Xc = S Having diagonal elements equals to the variances of the quantiles and extra-diagonal elements equals to the covariance between different quantiles.

  30. PCA of a single distributional variable PC Analysis in Rm(quantile space) to represent the N distributions The factorial axes uaare obtained as solution of the characteristics PCA equation: • That corresponds to the diagonalization of the matrix • Because we look for the axis of maximum inertia, that is: we order the eigenvaluesla in decreasing order and we take the eigenvector u1 associated to the largest eigenvaluesl1; the second eigenvector u2, orthogonal to u1, is the one associated to the next largest eigenvaluel2, and so on (for a number of eigenvalues which explain a certain percentage of total variability of quantile variables). fora=1, …, m+1

  31. The trace of S • The trace of S is an approximation of the variance of a histogram variable defined by Irpino and Verde (2008) • So we decompose a quantity that can be expressed in three components related to the differences in: position, scale and shape of a set of N distributions. It is the Wasserstein distance defined between quantile functions

  32. Coordinates of the distribution on factorial axes • The N distribution are represented on factorial planes by Ya=Xcua • That correspond to a linear combination of the m quantile variables. According to the meaning of the decomposed distance in the criterion function, the variability explained is referred to the location, scale and shape distrib. components on the different axes.

  33. Representation of the quantile variables • The PC Analysis in RN of the quantile variables in the space of the individuals (distributions) is obtained as solution of dual problem: • The coordinates of the quantile variables on the factorial axes are then obtained as: Fa=Xcva It is possible to connect the sequence of quantiles on the factorial plane in order to analyse the structure of global variability fora=1, …, m+1

  34. A simulation study • We generate 50 distribution from the Pearson Family of distributions • A Pearson distribution is uniquely defined by four parameters: the mean, the standard deviation, the Skewnessand the Kurtosis (3rd and 4thst. moments) • The Normal distribution is defined by choosing the mean and the std • The four parameters are randomly generated • We extract 10 quantiles and we set up M

  35. Simulation • We choose m=10 (we have 11 quantiles, or deciles) • We center the matrix (the barycenter is a distribution having an “average” distribution) • We do not standardize data (differently from Ichino, because the trace is the variance of the distributional variable.) • We perform the PCA on the Covariance matrix of the quantiles.

  36. Eigenvalues and explained inertia Inertia: 2.076

  37. Plot of individuals 1-2 plane Variable plot High Std Low Std Low Mean High Mean

  38. Low Std High Std Low Mean High Mean

  39. The first factorial plane (the “variable” quantile representation)

  40. The second factorial plane Ipokurtic Leptokurtic Left Skewed Right Skewed

  41. Plot of individuals 3-4 plane

  42. Plot of “variables” (quantiles) 3-4 plane

  43. It was the same of performing a PCA on moments? • Not exactly, because the Covariance of moments is something strange (the four parameters are described into a different scale) • There exists known relationships among moments • M1=mean • Variance=M2-(M1)2 • Skewness= • Kurtosis= • But let’s visualize it!!

  44. PCA on moments first factorial plane

  45. PCA on moments 3-4 factorial plane

  46. A further example: multi-modal distributions

  47. Open questions • More than one distributional variable. • What is the sense of the variance between two distributional variables? (We defined a covariance between two distributional variables, but the practical meaning is difficult to understand) • We are verifying if a Multiple Factor Analysis approach is useful (integrating the Makosso-Diday approach on quantiles, each block is a distributional variable)

  48. Thank you very much!

More Related