1 / 14

Spanish Inquisition

Chris. Leon. Spanish Inquisition. Yan. Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas. Weka Filtering. Used CFS with BestFirst Search Reduced the number of attributes from 1544 to 125

hollie
Télécharger la présentation

Spanish Inquisition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chris Leon Spanish Inquisition Yan Final Project Week 2 - 4/29/09 Breast Cancer Gene Expression Data Leon Kay, Yan Tran, Chris Thomas

  2. Weka Filtering • Used CFS with BestFirst Search • Reduced the number of attributes from 1544 to 125 • CFS stands for Correlation-based Feature Selection. Basic hypothesis: “A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other.” [1]

  3. CFS Algorithm - Searching • Any search algorithm can be plugged into CFS – author describes three - forward selection, backward elimination, and best first. They are all essentially greedy heuristic search algorithms. The greedy search approach reduces the complexity of generating the feature subset. • “Best first can start with either no features or all features. In the former, the search progresses forward through the search space adding single features; in the latter the search moves backward through the search space deleting single features. To prevent the best first search from exploring the entire feature subset search space, a stopping criterion is imposed. The search will terminate if five consecutive fully expanded subsets show no improvement over the current best subset.” [1]

  4. CFS Algorithm Visual Diagram [1]

  5. Accuracy (Error Rate) of algorithms before and after applying CFS/BestFit filtering

  6. ROC – Receiver Operating Characteristic • ROC graphs “depict the tradeoff between hit rates and false alarm rates of classifiers “ [2] • “one point in ROC space is better than another if it is to the northwest (tp rate is higher, fp rate is lower, or both) of the first” [2] • Therefore, Area Under Curve, or AUC is an accurate numerical value that can be used to compare classifiers.

  7. ROC Data – Area under Curve

  8. Example ROC – Random Forests

  9. MeV Analysis • Initial Hierarchical Clustering

  10. Analyze the Cluster

  11. FLJ13710 and GATA3 Lowly expressed in basal-like samples. Highly expressed in luminal samples.

  12. GATA3 • GATA3 levels are a known indication of breast cancer prognosis. (Basal-like is worse than Luminal.) • Associated with estrogen receptor alpha, which is often highly expressed in the early stages of breast cancer.

  13. FLJ13710 • Mentioned in a paper on finding prognostic signatures for breast cancer. • Couldn’t find any in-depth studies on this gene.

  14. References • Mark Hall, “Correlation-based Feature Selection for Machine Learning”, http://www.cs.waikato.ac.nz/~mhall/thesis.pdf • Tom Fawcett, “An introduction to ROC analysis“, doi:10.1016/j.patrec.2005.10.010 – enter into http://dx.doi.org/ 3) Wilson, Brian J., Giguère, Vincent. “Meta-analysis of human cancer microarrays reveals GATA3 is integral to the estrogen receptor alpha pathway”,Molecular Cancer 2008, 7:49. http://www.molecular-cancer.com/content/7/1/49 4) Hayashi, SI., et al. “The expression and function of estrogen receptor alpha and beta in human breast cancer and its clinical application”, http://erc.endocrinology-journals.org/cgi/content/abstract/10/2/193 5) “Suppl. Table 2: List of probe sets significantly differentially expressed between luminal cell lines and basal cell lines. Probe sets are ordered according to decreasing DS (discriminating score). “www.nature.com/onc/journal/v25/n15/extref/1209254x4.xls 6) Carrivick, L., et al. “Identification of Prognostic Signatures in Breast Cancer Microarray Data using Bayesian Techniques.” http://www.enm.bris.ac.uk/cig/pubs/2005/rs4.pdf

More Related