Classification of Microarray Gene Expression Data

Classification of Microarray Gene Expression Data Geoff McLachlan Department of Mathematics & Institute for Molecular Bioscience University of Queensland

Institute for Molecular Bioscience, University of Queensland

“A wide range of supervised and unsupervisedlearning methods have been considered to better organize data, be it to infer coordinated patterns of gene expression, to discover molecular signatures of disease subtypes, or to derive various predictions. ” Statistical Methods for Gene Expression: Microarrays and Proteomics

Outline of Talk • Introduction • Supervised classification of tissue samples – selection bias • Unsupervised classification (clustering) of tissues – mixture model-based approach

Vital Statistics byC. Tilstone Nature 424, 610-612, 2003. “DNA microarrays have given geneticists and molecular biologists access to more data than ever before. But do these researchers have the statistical know-how to cope?” Branching out: cluster analysis can group samples that show similar patterns of gene expression.

Two Groups in Two Dimensions. All cluster information would be lost by collapsing to the first principal component. The principal ellipses of the two groups are shown as solid curves.

bioArray News (2, no. 35, 2002) Arrays Hold Promise for Cancer Diagnostics Oncologists would like to use arrays to predict whether or not a cancer is going to spread in the body, how likely it will respond to a certain type of treatment, and how long the patient will probably survive. It would be useful if the gene expression signatures could distinguish between subtypes of tumours that standard methods, such as histological pathology from a biopsy, fail to discriminate, and that require different treatments.

In principle, gene activities that determine the biological behaviour of a tumour are more likely to reflect its aggressiveness than general parameters such as tumour size and age of the patient. (indistinguishable disease states in diffuse large B-cell lymphoma unravelled by microarray expression profiles – Shipp et al., 2002, Nature Med. 8) van’t Veer & De Jong (2002, Nature Medicine8) The microarray way to tailored cancer treatment

Microarray to be used as routine clinical screen by C. M. Schubert Nature Medicine 9, 9, 2003. The Netherlands Cancer Institute in Amsterdam is to become the first institution in the world to use microarray techniques for the routine prognostic screening of cancer patients. Aiming for a June 2003 start date, the center will use a panoply of 70 genes to assess the tumor profile of breast cancer patients and to determine which women will receive adjuvant treatment after surgery.

Microarrays also to be used in the prediction of breast cancer by Mike West (Duke University) and the Koo Foundation Sun Yat-Sen Cancer Centre, Taipei Huang et al. (2003, The Lancet, Gene expression predictors of breast cancer).

We OBSERVE the CLASS LABELSy1, …, yn where yj= i if jth tissue sample comes from the ith class (i=1,…,g). AIM: TO CONSTRUCT A CLASSIFIER C(x) FOR PREDICTING THE UNKNOWN CLASS LABEL y OF A TISSUE SAMPLE x. e.g. g = 2 classes G1 - DISEASE-FREE G2 - METASTASES CLASSIFICATION OF TISSUES SUPERVISED CLASSIFICATION (DISCRIMINANT ANALYSIS)

LINEAR CLASSIFIER FORM for the production of the group label y of a future entity with feature vector x.

FISHER’S LINEAR DISCRIMINANT FUNCTION where and Sare the sample means and pooled sample and covariance matrix found from the training data

SUPPORT VECTOR CLASSIFIER Vapnik (1995) whereβ0andβare obtained as follows: subject to relate to the slack variables separable case

with non-zero only for those observations jfor which the constraints are exactly met (the support vectors).

Support Vector Machine (SVM) by REPLACE where the kernel function is the inner product in the transformed feature space.

HASTIE et al. (2001, Chapter 12) The Lagrange (primal function) is which we maximize w.r.t. β, β0,andξj. Setting the respective derivatives to zero, we get with and

By substituting (2) to (4) into (1), we obtain the Lagrangian dual function We maximize (5) subject to In addition to (2) to (4), the constraints include Together these equations (2) to (8) uniquely characterize the solution to the primal and dual problem.

Leo Breiman (2001)Statistical modeling: the two cultures (with discussion).Statistical Science 16, 199-231.Discussants include Brad Efron and David Cox

Selection bias in gene extraction on the basis of microarray gene-expression data Ambroise and McLachlan Proceedings of the National Academy of Sciences Vol. 99, Issue 10, 6562-6566, May 14, 2002 http://www.pnas.org/cgi/content/full/99/10/6562

GUYON, WESTON, BARNHILL & VAPNIK (2002, Machine Learning) • COLON Data (Alon et al., 1999) • LEUKAEMIA Data (Golub et al., 1999)

Since p>>n, consideration given to selection of suitable genes SVM: FORWARD or BACKWARD (in terms of magnitude of weight βi) RECURSIVE FEATURE ELIMINATION (RFE) FISHER: FORWARD ONLY (in terms of CVE)

GUYON et al. (2002) LEUKAEMIA DATA: Only 2 genes are needed to obtain a zero CVE (cross-validated error rate) COLON DATA: Using only 4 genes, CVE is 2%

GUYON et al. (2002) “The success of the RFE indicates that RFE has a built in regularization mechanism that we do not understand yet that prevents overfitting the training data in its selection of gene subsets.”

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

Figure 2: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of leukemia tissue samples

Figure 3: Error rates of Fisher’s rule with stepwise forward selection procedure using all the colon data

Figure 4: Error rates of Fisher’s rule with stepwise forward selection procedure using all the leukemia data

Figure 5: Error rates of the SVM rule averaged over 20 noninformative samples generated by random permutations of the class labels of the colon tumor tissues

Error Rate Estimation Suppose there are two groups G1 andG2 C(x)is a classifier formed from the data set (x1, x2, x3,……………, xn) The apparent error is the proportion of the data set misallocated byC(x).

From the original data set, removex1to give the reduced set (x2, x3,……………, xn) Cross-Validation Then form the classifierC(1)(x )from this reduced set. Use C(1)(x1)to allocate x1 to either G1 or G2.

Repeat this process for the second data point,x2. So that this point is assigned to either G1 or G2 on the basis of the classifier C(2)(x2). And so on up to xn.

Figure 1: Error rates of the SVM rule with RFE procedure averaged over 50 random splits of colon tissue samples

ADDITIONAL REFERENCES Selection bias ignored: XIONG et al. (2001, Molecular Genetics and Metabolism) XIONG et al. (2001, Genome Research) ZHANG et al. (2001, PNAS) Aware of selection bias: SPANG et al. (2001, Silico Biology) WEST et al. (2001, PNAS) NGUYEN and ROCKE (2002)

BOOTSTRAP APPROACH Efron’s (1983, JASA) .632 estimator where B1 is the bootstrap when rule is applied to a point not in the training sample. A Monte Carlo estimate of B1 is where

Toussaint & Sharpe (1975) proposed the ERROR RATE ESTIMATOR where McLachlan (1977) proposed w=wowhere wo is chosen to minimize asymptotic bias of A(w)in the case of two homoscedastic normal groups. Value of w0was found to range between 0.6 and 0.7, depending on the values of

.632+ estimate of Efron & Tibshirani (1997, JASA) where (relative overfitting rate) (estimate of no information error rate) If r = 0, w = .632,and soB.632+ = B.632 r = 1, w = 1, and so B.632+ = B1

One concern is the heterogeneity of the tumours themselves, which consist of a mixture of normal and malignant cells, with blood vessels in between. Even if one pulled out some cancer cells from a tumour, there is no guarantee that those are the cells that are going to metastasize, just because tumours are heterogeneous. “What we really need are expression profiles from hundreds or thousands of tumours linked to relevant, and appropriate, clinical data.” John Quackenbush

UNSUPERVISED CLASSIFICATION (CLUSTER ANALYSIS) INFER CLASS LABELSy1, …, yn of x1, …,xn Initially, hierarchical distance-based methods of cluster analysis were used to cluster the tissues and the genes Eisen, Spellman, Brown, & Botstein (1998, PNAS)

Hierarchical (agglomerative) clustering algorithms are largely heuristically motivated and there exist a number of unresolved issues associated with their use, including how to determine the number of clusters. “in the absence of a well-grounded statistical model, it seems difficult to define what is meant by a ‘good’ clustering algorithm or the ‘right’ number of clusters.” (Yeung et al., 2001, Model-Based Clustering and Data Transformations for Gene Expression Data, Bioinformatics 17)

Attention is now turning towards a model-based approach to the analysis of microarray data For example: • Broet, Richarson, and Radvanyi (2002). Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. Journal of Computational Biology9 • Ghosh and Chinnaiyan (2002). Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18 • Liu, Zhang, Palumbo, and Lawrence(2003). Bayesian clustering with variable and transformation selection. In Bayesian Statistics 7 • Pan, Lin, and Le, 2002, Model-based cluster analysis of microarray gene expression data. Genome Biology 3 • Yeung et al., 2001, Model based clustering and data transformations for gene expression data, Bioinformatics 17

The notion of a cluster is not easy to define. There is a very large literature devoted to clustering when there is a metric known in advance; e.g. k-means. Usually, there is no a priori metric (or equivalently a user-defined distance matrix) for a cluster analysis. That is, the difficulty is that the shape of the clusters is not known until the clusters have been identified, and the clusters cannot be effectively identified unless the shapes are known.

In this case, one attractive feature of adopting mixture models with elliptically symmetric components such as the normal or t densities, is that the implied clustering is invariant under affine transformations of the data (that is, under operations relating to changes in location, scale, and rotation of the data). Thus the clustering process does not depend on irrelevant factors such as the units of measurement or the orientation of the clusters in space.

where where constant constant MAHALANOBIS DISTANCE EUCLIDEAN DISTANCE MIXTURE OF g NORMAL COMPONENTS

k-means k-means SPHERICAL CLUSTERS MIXTURE OF g NORMAL COMPONENTS

Equal spherical covariance matrices

Classification of Microarray Gene Expression Data