1 / 84

Gene Expression Profiling Illustrated Using BRB-ArrayTools

Gene Expression Profiling Illustrated Using BRB-ArrayTools. Good Microarray Studies Have Clear Objectives. Class Comparison (gene finding) Find genes whose expression differs among predetermined classes Class Prediction

liesel
Télécharger la présentation

Gene Expression Profiling Illustrated Using BRB-ArrayTools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Expression Profiling Illustrated Using BRB-ArrayTools

  2. Good Microarray Studies Have Clear Objectives • Class Comparison (gene finding) • Find genes whose expression differs among predetermined classes • Class Prediction • Prediction of predetermined class using information from gene expression profile • Response vs no response • Class Discovery • Discover clusters of specimens having similar expression profiles • Discover clusters of genes having similar expression profiles

  3. Class Comparison and Class Prediction • Not clustering problems • Supervised methods

  4. Levels of Replication • Technical replicates • RNA sample divided into multiple aliquots • Biological replicates • Multiple subjects • Multiple animals • Replication of the tissue culture experiment

  5. Comparing classes or developing classifiers requires independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates

  6. Microarray Platforms • Single label arrays • Affymetrix GeneChips • Dual label arrays • Common reference design • Other designs

  7. Common Reference Design A1 A2 B1 B2 RED R R R R GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B R = aliquot from reference pool

  8. The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. • The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. • The reference is not the object of comparison.

  9. The relative measure of expression (e.g. log ratio) will be compared among biologically independent samples from different classes for class comparison and as the predictor variables for class prediction • Dye swap technical replicates of the same two rna samples are not necessary with the common reference design.

  10. For two-label direct comparison designs for comparing two classes, dye bias is of concern. • It is more efficient to balance the dye-class assignments for independent biological specimens than to do dye swap technical replicates

  11. Balanced Block Design A1 B2 A3 B4 RED B1 A2 B3 A4 GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B

  12. Class ComparisonGene Finding

  13. t-test Comparisons of Gene Expression for gene j

  14. Estimation of Within-Class Variance • Estimate separately for each gene • Limited degrees-of-freedom (precision) unless number of samples is large • Gene list dominated by genes with small fold changes and small variances • Assume all genes have same variance • Poor assumption • Random (hierarchical) variance model • Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003 • Variances are independent samples from a common distribution; Inverse gamma distribution used • Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance • For any normal linear model

  15. Randomized Variance t-test • Pr(-2=x) = xa-1exp(-x/b)/(a)ba

  16. Controlling for Multiple Comparisons • Bonferroni type procedures control the probability of making any false positive errors • Overly conservative for the context of DNA microarray studies

  17. Simple Procedures • Control expected the number of false discoveries to u • Conduct each of k tests at level u/k • e.g. To limit of 10 false discoveries in 10,000 comparisons, conduct each test at p<0.001 level • Control FDR • Benjamini-Hochberg procedure

  18. False Discovery Rate (FDR) • FDR = Expected proportion of false discoveries among the genes declared differentially expressed

  19. If you analyze n probe sets and select as “significant” the k genes whose p ≤ p* • FDR = number of false positives divided by number of positives ~ n p* / k

  20. Multivariate Permutation Procedures(Korn et al. Stat Med 26:4428,2007) Allows statements like: FD Procedure: We are 95% confident that the (actual) number of false discoveries is no greater than 5. FDP Procedure: We are 95% confident that the (actual) proportion of false discoveries does not exceed .10.

  21. Multivariate Permutation Tests • Distribution-free • even if they use t statistics • Preserve/exploit correlation among tests by permuting each profile as a unit • More effective than univariate permutation tests especially with limited number of samples

  22. Additional Procedures • “SAM” - Significance Analysis of Microarrays

  23. Class Prediction

  24. Components of Class Prediction • Feature (gene) selection • Which genes will be included in the model • Select model type • E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … • Fitting parameters (regression coefficients) for model • Selecting value of tuning parameters

  25. Feature Selection • Genes that are differentially expressed among the classes at a significance level  (e.g. 0.01) • The  level is selected only to control the number of genes in the model • For class comparison false discovery rate is important • For class prediction, predictive accuracy is important

  26. Complex Gene Selection • Small subset of genes which together give most accurate predictions • Genetic algorithms • Little evidence that complex feature selection is useful in microarray problems

  27. Linear Classifiers for Two Classes

  28. Linear Classifiers for Two Classes • Fisher linear discriminant analysis • Requires estimating correlations among all genes selected for model • Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated • Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers

  29. Linear Classifiers for Two Classes • Compound covariate predictor Instead of for DLDA

  30. Linear Classifiers for Two Classes • Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplane that minimizes the length of the weight vector

  31. Support Vector Machine

  32. Other Linear Methods • Perceptrons • Principal component regression • Supervised principal component regression • Partial least squares • Stepwise logistic regression

  33. Other Simple Methods • Nearest neighbor classification • Nearest k-neighbors • Nearest centroid classification • Shrunken centroid classification

  34. Nearest Neighbor Classifier • To classify a sample in the validation set, determine it’s nearest neighbor in the training set; i.e. which sample in the training set is its gene expression profile is most similar to. • Similarity measure used is based on genes selected as being univariately differentially expressed between the classes • Correlation similarity or Euclidean distance generally used • Classify the sample as being in the same class as it’s nearest neighbor in the training set

  35. Nearest Centroid Classifier • For a training set of data, select the genes that are informative for distinguishing the classes • Compute the average expression profile (centroid) of the informative genes in each class • Classify a sample in the validation set based on which centroid in the training set it’s gene expression profile is most similar to.

  36. When p>>n The Linear Model is Too Complex • It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. • Why consider more complex models?

  37. Other Methods • Top-scoring pairs • Claim that it gives accurate prediction with few pairs because pairs of genes are selected to work well together • Random Forest • Very popular in machine learning community • Complex classifier

  38. Comparative studies indicate that linear methods and nearest neighbor type methods often work as well or better than more complex methods for microarray problems because they avoid over-fitting the data.

  39. When There Are More Than 2 Classes • Nearest neighbor type methods • Decision tree of binary classifiers

  40. Decision Tree of Binary Classifiers • Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and S2 • Develop a binary classifier for distinguishing the composite classes S1 and S2 • Compute the cross-validated classification error for distinguishing S1 and S2 • Repeat the above steps for all possible partitions in order to find the partition S1and S2 for which the cross-validated classification error is minimized • If S1and S2 are not singleton sets, then repeat all of the above steps separately for the classes in S1and S2 to optimally partition each of them

  41. Evaluating a Classifier • Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data • Goodness of fit vs prediction accuracy

  42. Class Prediction • A set of genes is not a classifier • Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier

More Related