1 / 26

JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks. JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001. The Small, Round Blue Cell Tumors (SRBCTs) of Childhood.

linus
Télécharger la présentation

JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification and Diagnostic Prediction of Cancers using Gene Expression Profiling and Artificial Neural Networks JAVED KHAN ET AL. NATURE MEDICINE – Volume 7 – Number 6 – JUNE 2001

  2. The Small, Round Blue Cell Tumors (SRBCTs) of Childhood • Four categories – Neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS). • Similar in appearance on routine histology. • However accurate diagnosis is essential – as treatment options , response to therapy, etc, vary. • No single test can precisely distinguish SRBCTs – Immunohistochemistry, cytogenetics, interphase fluorescent in situ hybridization and reverse transcription.

  3. Gene Expression Profiling using cDNA Microarrays. • Micoarrays measure the activities of several thousand genes simultaneously. • Can be used for Cancer Classification. • This will give better therapeutic measurements to cancer patients by diagnosing cancer types with improved accuracy. • and furthermore cancers belonging to several diagnostic categories – SRBCTs.

  4. Artificial Neural Networks (ANNs) – put to the task. • Modeled on the structure and behavior of neurons in the human brain. • Can be trained to recognize and categorize complex patterns. • Pattern recognition achieved by adjusting of the ANN by a process of error minimization through learning from experience. • ANNs were applied to decipher gene-expression signatures of SRBCTs and then used for diagnostic classification.

  5. Error Minimization Mean Squared Error Summed Square Error

  6. Network Architecture and Parameters • Due to limited amount of calibration data and the fact that four output nodes are needed, the network architecture was limited to Linear perceptrons. • 10 input nodes were used representing the 10 PCA components described later on. • 4 output nodes modeled by the Sigmoid function. • Calibration is performed using JETNET, with learning rate η = 0.7, momentum coefficient p = 0.3. • The learning rate is decreased with a factor of 0.99 after each iteration. • Initial weight values are chosen randomly from [-r,r], where r = 0.1/max[Fi], where Fi is the number of nodes connecting to node i. • Weight values are updated after every 10 samples.

  7. Back-propagation • Minimizing by gradient descent is the least sophisticated but nevertheless in many cases a sufficient method. • It amounts to updating the weights according to the Back-propagation learning rule. • The partial derivative ∂Et/∂w represents a sensitivity factor, determining the direction of search in weight space for the synaptic weights. where Delta rule

  8. ….continue • A momentum is often added to stabilize the learning. where α < 1

  9. Calibration and validation of the ANN Models. • cDNA microarrays containing 6567 genes: • 63 training samples comprised of 13 EWS and 10 RMS from tumor biopsy and 10 EWS, 10 RMS, 12 NB, 8 BL from cell lines. • 25 test samples comprised of 5 EWS, 5 RMS, 4 NB, from tumors and 1EWS, 2 NB, 3BL from cell lines. Plus 5 non-SRBCT samples (test ability reject diagnosis). • Filtering for the minimal number of expression reduced the genes to 2308. • Principle Component Analysis (PCA) further reduced dimensionality.

  10. ….continue • 10 dominant PCA components per sample were used as inputs…. • and four outputs – (EWS, RMS, NB, BL). • A three-fold cross-validation procedure was used and 3750 ANN models were produced (Figure 1). • No sign of “over-training” of the models as would be shown by a rise in the summed square error for the validation set with increasing iterations (epochs) -see figure 2.

  11. The Artificial Neural Network • Quality Filtering • PCA • 25 test samples set aside and the 63 training samples are randomly partitioned into 3 groups • One group is reserved for validation and the other two used for calibration. • For each model the calibration was optimized with 100 iterative cycles (epochs). • This was repeated using each of the three groups for validation. • The samples were again randomly partitioned and the entire training process repeated. For each selection of a validation group one model was calibrated, resulting in a total of 3750 trained models. • Once the models were calibrated they were used to rank the genes according to their importance for classification. • The entire process was repeated using only top ranked genes.

  12. ….continue Validation • Each validation sample is then passed through 1250 models and hence 1250 predictions for each validation sample are produced. • Each ANN model gives a number between 0 (not this cancer type) and 1(this cancer type) as an output for each cancer type. • The average for all model outputs for every validation sample is then computed (denoted the average committee vote). • Each sample is classified as belonging to the cancer type corresponding to the largest committee vote. • Using these ANN models, all 63 training samples were correctly classified to their respective categories.

  13. Optimization of Genes used for Classification. • The contribution of each gene to the classification by the ANN models was then assessed. • Feature extraction was performed in a model dependent way due to relatively few samples. • This was achieved by monitoring the sensitivity of classification to a change in the expressionlevelof each gene, using the 3750 previously calibrated models.

  14. Sensitivity (S) of the outputs (o) with respect to any 2308 input varaibles (xk) is defined as: • Where Ns is the number of samples (63) and No is the number of outputs (4). The procedure for computing Sk involves a committee of 3750 models.

  15. ….continue • In this way genes were ranked according to the significance of classification and the classification error rate using increasing numbers of these ranked genes was determined. • The classification Error rate minimized at 0% at 96 genes. • Using only these 96 genes, recalibration of the ANN models was performed and again all 63 samples were correctly classified.

  16. Assessing the Quality of Classification - Diagnoses. • The aim of diagnoses is to be able to reject test samples which do not belong to any of the four categories. • To do this a distance dc from a sample to the ideal vote for each cancer type was calculated:

  17. ….continue • Where c is the cancer type, oiis the average committee vote for cancer i, and δi,cis unity if i corresponds to cancer type c and zero otherwise. • The distance is normalized such that the distance between two ideal samples belonging to different disease categories is unity. • Based on the validation set, an empirical probability distribution of distances for each cancer type was generated. • The empirical probability distributions are built using each ANN model independently. • Thus, the number of entries in each distribution is given by 1250 multiplied with the number of samples belonging to the caner type.

  18. ….continue • For a given test sample it is thus possible to reject possible classifications based on the these probability distributions. • Hence for each disease category a cuttoff distance from the ideal sample was defined within which it is expected a sample of this category to fall in. • The distance given by the 95th percentile of the probability distribution was chosen. • This is the basis of diagnoses, as a sample that falls outside the cuttoff distance cannot be confidently diagnosed.

  19. Diagnostic Classification and Hierarchical Clustering. • The diagnostic capabilities of all 3750 ANN models were then tested using the 25 blinded test samples. • A sample is classified to a diagnostic category if it receives the highest vote for that category and because this classifier has only four possible outputs, all samples will be classified to one of the four categories. • If a sample falls outside the 95th percentile of the probability distribution of distances between samples and their ideal output (for example for EWS it is EWS = 1, RMS = NB = BL = 0), its diagnosis is rejected. • Using the 3750 ANN models calibrated with the 96 genes, 100% classification was achieved for the 20 SRBCT test samples and furthermore all of the 5 non-SRBCT samples were excluded from any of the four diagnostic categories, since they fell outside the 95 percentile.

  20. ….continue • Hierarchical clustering using the 96 genes, identified from the ANN models, correctly clustered all 20 of the test samples

More Related