Name of student: Kung-Hua Chang Date: July 8, 2005 SoCalBSI

Dimension Reduction-Based Penalized logistic Regression for cancer classification Using Microarray Data By L. Shen and E.C. Tan Name of student: Kung-Hua Chang Date: July 8, 2005 SoCalBSI California State University at Los Angeles The Chicken Project

Background • Microarray data have the characteristics that the number of samples ismuch less than the number of variables. • This causes the “curse of dimensionality” problem. • In order to solve this problem, many dimension reduction methods are used such as Singular Value Decomposition and Partial Least Squares.

Background (cont’d) • Singular Value Decomposition and Partial Least Squares. • Given a m x n matrix X that stores all of the gene expression data. Then X can be approximated as:

Background (cont’d)

Background (cont’d) • Logistic regression and least square regression. • They are ways to draw a line that can approximate a set of points.

Background (cont’d) • The difference is that logistic regression equations are solved iteratively. A trial equation is fitted and tweaked over and over in order to improve the fit. Iterations stop when the improvement from one step to the next is suitably small. • Least square regression can be solved explicitly.

Background (cont’d) • Penalized logistic regression is just a logistic regression method except that there is a cost function associated with it.

Background (cont’d) • Support Vector Machine (SVM) • SVM tries a find a hyper-plane that can separate different sets of data. • Not a linear model.

Hypothesis • The combination of dimension reduction-based penalized logistic regression has the best performance compared to support vector machine and least squares regression.

Data Analysis The above table shows the number of training/testing cases in the seven publicly available cancer data sets.

Data Analysis (cont’d)

Data Analysis

Data Analysis • Generally, the partial least square based classifier uses less time than the singular value decomposition based classifier.

Data Analysis (cont’d) • The penalized logistic regression training requires solving a set of linear equations iteratively until convergence, while the least square regression training requires solving a set of linear equations only once. So it’s reasonable to see that penalized logistic regression uses more time than the least square regression.

Data Analysis (cont’d) • The overall time required by partial least squares and SVD-based regression method is much less than that of support vector machine.

Data Analysis

Conclusion The combination of dimension reduction based penalized logistic regression has the best performance compared to support vector machine and least squares regression.

References • [1] L. Shen and E.C. Tan (to appear in June, 2005) "Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data", IEEE/ACM Trans. Computational Biology and Bioinformatics • [2] SoCalBSI: http://instructional1.calstatela.edu/jmomand2/ • [3] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning; Data mining, Inference and Prediction. Springer Verlag, New York, 2001.

Name of student: Kung-Hua Chang Date: July 8, 2005 SoCalBSI