360 likes | 533 Vues
Functional Data Classification via Subspace Projection Pai-Ling Li Department of Statistics, Tamkang University plli@stat.tku.edu.tw 29 August 2011 Joint work with Che-Chiu Wang. Outline. Outline Introduction Subspace Projected Functional Classification (SPFC) Simulations
 
                
                E N D
Functional Data Classification via Subspace ProjectionPai-Ling LiDepartment of Statistics,Tamkang Universityplli@stat.tku.edu.tw29 August 2011Joint work with Che-Chiu Wang
Outline Outline • Introduction • Subspace Projected Functional Classification (SPFC) • Simulations • Data Application • Concluding Remarks
Introduction Functional Data Classification Data – : the th recording time of the th subject : the measurement of the th subject observed at time . Functional Modeling – Assume that the data are realizations of independent random functions Functional Data Classification – Find homogenous subgroups ofthe n curves according to the patterns of the curves.
Introduction Conventional Multivariate Data Approach • Multivariate data approach • Conventional heuristic approaches (Ward,1963; Ball & Hall, 1976; MacQueen, 1967) • Coupled with dimension reduction techniques for a large number of observations for each individual. e.g. PCA or SVD (Jolliffe, 2002) • Model-based approaches that require certain probability model assumptions (Yeung et al., 2002; Fraley & Rafterly, 2002; Li, 2005) • Some disadvantages in multivariate data approach irregular design of recording times, large number of measurements in index sets, measurement errors (Abraham et al., 2003)
Introduction Functional Data Approach • Unsupervised Classification / Clustering • Cluster the finite-dimensional coefficients of basis function expansions using a classical multivariate clustering algorithm. (Abraham et al., 2003; García-Escudero & Gordaliza,2005 Serban & Wasserman, 2005; Tarpey, 2007) • Model-based techniques (Luan & Li, 2003; James & Sugar, 2003; Ray & Mallick, 2006; Ma and Zhong, 2008) • Nonparametric kernel approach (Ferraty & Vieu, 2006) • Group curves based on shape similarity via nonparametric rank correlation. (Heckman & Zamar, 2000)
Introduction • Supervised Classification • Functional linear discrimination analysis (James & Hastie, 2001) • Generalized linear model or functional logistic regression (James, 2002; Müller & Stadtmüller, 2005; Leng & Müller, 2005; Aguilera et al., 2008) • Nonparametric kernel method (Ferraty & Vieu, 2006) • Support vector machine (SVM) (Rossi & Villa, 2006; Park et al., 2008) • Baysian analysis (Mallick et al., 2005; Morris et al., 2008) • Depth-based methods (Cuevas et al., 2007; Lópes- Pintado & Romo, 2006)
SPFC Subspace Projected Functional Classification (SPFC) Clusters are defined via the projection of cluster subspaces, and each cluster subspace comprise the mean function and eigenfunctions of the covariance kernel. For discovering homogeneous subgroups of curve data according to the means and modes of variation differentials through the projections of FPC subspaces. (Ref: Chiou & Li, 2007; Chiou & Li, 2008)
SPFC Functional Random-Effects Model • Suppose that independent random functions are sampled from a stochastic process in , where represents a Hilbert space of square integrable functions w.r.t. the measure on a real interval , and is a constant weight function. ( ) • Assume that the process has a smooth mean function and a smooth covariance function (twice continuously differentiable)
SPFC • We consider a Karhunen-Loève expansionof the random function such that • The functions are orthonormal eigenfunctions associated with the corresponding nonnegative eigenvalues such that • The random coefficients are uncorrelated random variables with zero mean and variance with the property • We assume that the mean function does not belong to the space spanned by the eigenfunctions for indentifiability concerns.
SPFC • The additive measurement error model of the observed trajectory of is given by where are random measurement errors that are i.i.d. with and . The random errors are assumed to be independent of . It follows that the covariance function of is given by • In practical applications, can be spanned effectively by the leading principal components through a truncated expansion of model (1) as A common practice to choose according to the proportion of total variance explained by the first few leading principal components
SPFC Subspace Projected Functional Classification • If the data curves considered comprise clusters, then can be viewed as a mixture of subprocesses in . • We assume that each subprocess has the mean and covariance structures associated with cluster in which are conditionally defined by and
SPFC • Much like the marginal model (1), the realization of the random function corresponding to cluster is given by a conditional model where are the orthonormal bases associated with , are uncorrelated random effects with zero mean and variance . The uncorrelated random errors have zero mean and constant variance and are assumed to be independent of .
SPFC • Let denote the FPC subspace of cluster that comprise the components and • Consider the projection of onto , a truncated nonparametric random effect model , to approximate such that where the number of has to be chosen to expand random process effectively.
SPFC • Under the SPFC framework, the best cluster membership of an observed curve is determined a metric that properly measures the distance between and its projection onto the FPC subspaces. Metric: Projection:( ) Criterion: (SPFC) (k-means) The criterion (7) suggests that each individual is associated with a cluster that centers on the corresponding mean and eigenfucntions via projection.
SPFC Estimation of Model Components Mean Function Apply local linear regression to the pooled data of curves . Covariance Function Use a two-dimensional scatterplot smoothing by local polynomial fitting to the raw covariance where Eigenvalues and Eigenfunctions Solve the equation under the orthonormal constraint through discrete approximations. FPC scores Approximate the FPC scores by the numerical integral where are quadrature weights. (Shrinkage estimator (Yao et al., 2003))
Simulations Simulations • Two clusters : Each cluster has 2nccurves (training and test data), c = 1,2, n1 + n2 = n. • Equally spaced time points on [0,1]: • Random effects and measurement errors : • Synthetic curves : where • Methods for comparison : LR (Multivariate logistic regression) FLR (Functional logistic regression based on FPCA model) (Leng & Müller, 2005) SPFC (Subspace projected FC)
Simulations Table 1 : Six cases of simulation designs.
Simulations (A2) (A3) (A1) (B1) (B3) (B2) Figure 1. A sample of simulated curves for the simulation designs (without measurement errors).
Simulations Figure 2. Average accuracy of test data resulted from the SPFC, FLR and LR methods under m=11,and different sample sizes n based on 1000 synthetic samples. SPFC FLR LR (B3)
Simulations Table 2: Averages (SD) of accuracy for the test data obtained by LR, FLR, and SPFC methods based on 1000 simulation replications for Cases A1-B3 under unbalanced design of cluster sizes .
MALDI MS data Data Application: Mass Spectrometry Proteomic Data • Yildiz et al. (2007) used matrix assisted laser desorption ionization mass spectrometry (MALDI MS) to analyze 288 serum proteomic profiles. • This study aims to distinguish lung cancer cases from matched controls through MALDI MS analysis of the most abundant peptides in the serum. (142 cases and 146 controls) • The cases and controls were matched to avoid confounding variables such as age, sex, and total pack-year history. • Yildiz et al. selected seven MS features based on the preprocessed spectra and applied the selected features to multivariate class-prediction models.
MALDI MS data • In our study, we are interested in applying the functional classification methods to the densely collected serum proteomic profiles, especially the FPCA-based approaches. • We consider the 288 MALDI MS serum spectra after preprocessing, in which each spectrum represents the ion current intensities measured at 184 m/z locations for a subject. • For classification purpose, we treat the m/z values as equally spaced with the aim to capture the major patterns of proteomic profiles. This should not affect the classification results since all the spectra are treated the same on the realigned m/z values. • The local linear smoothing methods are applied in estimation but the measurement errors are not taken into account. The bandwidths of one- and two-dimensional smoothing methods are chosen by the cross-valuation method.
MALDI MS Data Figure 3. Raw trajectories of MALDI MS data (upper panels) and the marginal mean function (lower right panel) and conditional mean functions (lower left panel) of Case and Control groups. The notation ‘’ denote the seven discriminate features selected by Yildiz et al.
MALDI MS Data Figure 4. The first four marginal eigenfunctions (diagonal panels) and the scatter plots of pairwise FPC scores based on marginal model. The notations ‘◦’ and ‘•’ represent the case and control groups, respectively. The percentages in parentheses indicate the proportion of total variance explained by the principal components.
MALDI MS Data Figure 5. The first six conditional eigenfunctions of two groups based on the conditional model. The percentages in parentheses indicate the proportions of total variance explained by the principal components.
MALDI MS Data Table 3. Accuracy, sensitivity, and specificity of the blinded test set of MALDI MS data selected by Yildiz et al. (2007). • WFCCM (weighted flexible compound covariate method, Shyr and Kim, 2003) • For the FLR and SPFC methods, the number of principal components are automatically chosen by the CV method based on maximizing the classification accuracy of training data.
MALDI MS Data Table 4. Averages (SD) of accuracy, sensitivity, and specificity based on 100 four-fold CV replicates of MALDI MS data.
Remarks Concluding Remarks • The SPFC discovers homogeneous subgroups of curve data according to the structure of the means as well as the modes of variation differentials through the projections of the FPC subspaces. • The distance measure can be chosen by data characteristics and the target or interest in clustering. • One extension of SPFC is allowing for multiplicative random scaling. (Chiou & Li, 2008, 2011) • Classification of MS data via functional approach avoids reliance on feature detection and improves the accuracy, sensitivity, and specificity for considering the whole profiles of a spectrum. • The SPFC method takes the within-spectrum correlation into account, which facilitates classifying the proteomic MS data. • In a future work, it is interesting to extend the proposed method by adding informative clinical covariates.
Remarks ACKNOWLEGEMENTS We thank Dr. Yu Shyr of the Cancer Biostatistics Center, Vanderbilt University, for allowing the use of MALDI serum data set.
SPFC Identifiability Property of SPFC Assumption: (C) Under some regular conditions, there exist and as such that and for fixed Ref : Yao et al. (2005), Hall et al. (2006)
SPFC • Let be the function drawn from cluster and let be the function defined as Let be the projection function of on the subspace of cluster such that • The correct cluster membership prediction for depends on the distance whichunravels thepredictability of cluster membership. If the distance is large, it is anticipated that the curve can be easily classified in to its true cluster c; conversely, the curve can be arbitrarily classified into cluster c or d.
SPFC Lemma .Under (C), and given the values and the squared distance between and can be expressed as where and If the eigenvalues decay rapidly for such that as then the remainder term converges to 0 in probability.
SPFC Theorem. Let and be the spaces spanned by the orthonormal basis functions and respectively, for given values and . Conditions (A1) and (A2) are defined as follows: (A1) ; (A2) Either , or and Under (C) , if both (A1) and (A2) hold (non-identifiability conditions), then as and
Simulations SPFC FLR LR Figure A1. Average accuracy of test data resulted from the SPFC, FLR and LR methods under n = 100 and different number of time points m based on 1000 synthetic samples.