1 / 72

Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases

Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases. Limsoon Wong Institute for Infocomm Research. Plan. Some accomplishments and challenges in knowledge discovery from biological and clinical data Data mining in microarray analysis

jlikens
Télécharger la présentation

Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases Limsoon Wong Institute for Infocomm Research

  2. Plan • Some accomplishments and challenges in knowledge discovery from biological and clinical data • Data mining in microarray analysis • diagnosis of disease state and subtype • derivation of treatment plan • understanding of gene interaction network

  3. Knowledge Discovery from Biological and Clinical Data: MOTIVATION

  4. Complete genomes are now available • Knowing the genes is not enough to understand how biology functions • Proteins, not genes, are responsible for many cellular activities • Proteins function by interacting with other proteins and biomolecules INTERACTOME GENOME PROTEOME Driving Forces: Genes, Proteins, Interactions, Diagnosis, & Cures

  5. If we figure out how these work, we get these Benefits • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science

  6. To figure these out,we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

  7. Knowledge Discovery from Biological and Clinical Data: ACCOMPLISHMENT

  8. Protein Interactions Extraction (PIES) MHC-Peptide Binding (PREDICT) Gene Expression & Medical Record Datamining (PCL) Molecular Connections Cleansing & Warehousing (FIMM) Integration Technology (Kleisli) Gene Feature Recognition (Dragon) Venom Informatics Biobase GeneticXchange 1994 1996 1998 2002 2000 ISS LIT/I2R KRDL 8 years of bioinformatics R&D in Singapore

  9. Vaccines are often the only solution for viral diseases Finding & developing effective vaccine targets is slow and expensive process • Develop systems to recognize • protein peptides that bind • MHC molecules • Develop systems to recognize • hot spots in viral antigens Predict Epitopes,Find Vaccine Targets

  10. Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments Data mining of bio seqs to find rules for recognizing & understanding functional sites Recognize Functional Sites,Help Scientists Dragon’s 10x reduction of TSS recognition false positives

  11. Diagnose Leukaemia, Benefit Children • Childhood leukaemia is a heterogeneous disease • Treatment is based on subtype • 3 different tests and 4 different experts are needed for accurate diagnosis • Curable in USA, • fatal in Indonesia • A single platform diagnosis • based on gene expression • Data mining to discover • rules that are easy for • doctors to understand

  12. Understand Proteins,Fight Diseases • Understanding function and role of protein needs organised info on interaction pathways • Such info are often reported in scientific paper but are seldom found in structured databases • Knowledge extraction • system to process free text • extract protein names • extract interactions Jak1

  13. Data Mining in Microarray Analysis:MICROARRAY BACKGROUND

  14. What’s a Microarray? • Contain large number of DNA molecules spotted on glass slides, nylon membranes, or silicon wafers • Measure expression of thousands of genes simultaneously

  15. Affymetrix GeneChip Array

  16. quartz is washed to ensure uniform hydroxylation across its surface and to attach linker molecules exposed linkers become deprotected and are available for nucleotide coupling Making Affymetrix GeneChip

  17. Gene Expression Measurement by GeneChip

  18. A Sample Affymetrix GeneChip File (U95A)

  19. Data Mining in Microarray Analysis: DISEASE SUBSTYPE DIAGNOSIS

  20. Pediatric Acute Lymphoblastic Leukemia • A heterogeneous disease with more than 12 subtypes, e.g., T-ALL, E2A-PBX1, TEL-AML1, BCR-ABL, MLL, and Hyperdip>50. • Treatment response is subtype dependent • 80% continuous remission if subtype is correctly diagnosed and the corresponding treatment plan is applied

  21. Subtype Diagnosis • Require different tests: • immunophenotyping • cytogenetics • molecular diagnostics • Require different experts: • hematologist • oncologist • pathologist • cytogeneticist

  22. Difficulties and Implications • The different tests and experts are not commonly available within a single hospital, especially in less advanced countries • An 80%-curable disease in USA can be a fatal disease in Indonesia! • Is there a single diagnostic platform that does not need multiple human specialists?

  23. BCR-ABL Hyperdiploid >50 T-ALL MLL Novel Diagnostic ALL BM samples (n=327) E2A-PBX1 TEL-AML1 Genes for class distinction (n=271) E2A-PBX1 MLL T-ALL Hyperdiploid >50 BCR-ABL Novel TEL-AML1 3 -1 1 -3 -2 2 0  = std deviation from mean A Potential Solution by MicroarraysYeoh et al., Cancer Cell 1:133--143, 2002

  24. Some Caveats • Study was performed on Americans • May not be applicable to Singaporeans, Malaysians, Indonesians, etc. • Large-scale study on local populations currently in the works

  25. Typical Procedure in Analysing Gene Expression for Diagnosis • Gene expression data collection • Gene selection • Classifier training • Classifier tuning (optional for some machine learning methods) • Apply classifier for diagnosis of future cases

  26. Feature Selection Methods A refresher of feature selection methods

  27. Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance

  28. Signal Selection (eg., t-statistics)

  29. Signal Selection (eg., 2)

  30. Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • Correlation-based Feature Selection • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other

  31. Gene Expression Profile Classification An introduction to gene expression profile classification by the example on ALL subtype diagnosis

  32. Subtype Classification of ALL A tree-structured diagnostic workflow was recommended by the doctors, as per Yeoh et al., Cancer Cell 1:133--143, 2002

  33. Training and Testing Sets

  34. Our procedure for ALL subtype diagnosis • Gene expression data collection • Gene selection by entropy • Classifier training by emerging pattern • Classifier tuning (optional for some machine learning methods) • Apply classifier for diagnosis of future cases by PCL

  35. Signal Selection (eg., entropy)

  36. Emerging Patterns (EPs) • An EP is a set of conditions • usually involving several features • that most members of a class satisfy • but none or few of the other class satisfy • A jumping EP is an EP that • some members of a class satisfy • but no members of the other class satisfy • We use only most general jumping EPs

  37. PCL: Prediction by Collective Likelihood

  38. PCL 0:1 0:2 0:1 0:1 1:1 1:1 1:1 2:2 1:1 0:0 1:6 0:2 4 14 5 7 5 Accuracy (using 20 genes of lowest entropy)

  39. Comprehensibility

  40. Gene Expression Profile ClassificationHow about other feature selection and classification methods?

  41. Some gene selection heuristics • all-CFS: all features from CFS • top20-2: 20 features w/ highest 2 stats • top20-t: 20 features w/ highest t-stats • top20-mit: 20 features w/ highest MIT stats • entropy: 20 features w/ lowest entropy • all-2: all features meeting 5% significance level of 2 stats

  42. Some other classification methods • k-NN (k=1) • majority votes of the k nearest neighbours determined by Euclidean distance • C4.5 • widely used decision tree method. • Naïve Bayes (NB) • probabilistic prediction using Bayes’ rule • SVM • (linear) discriminant function that maximizes separation of boundary samples

  43. Accuracy • Feature selection improves performance • Entropy+PCL has consistent high performance

  44. When 20 genes are selected randomly Average over 100 experiments Cf. 7-15 mistakes total with good feature selection

  45. Data Mining in Microarray Analysis: TREATMENT PLAN DERIVATION A pure speculation!

  46. Can we do more with EPs? • Detect gene groups that are significantly related to a disease • Derive coordinated gene expression patterns from these groups • Derive “treatment plan” based on these patterns

  47. Colon Tumour DatasetAlon et al., PNAS 96:6745--6750, 1999 • We use the colon tumour dataset above to illustrate our ideas • 22 normal samples • 40 colon tumour samples

  48. Detect Gene Groups • Feature Selection • Use entropy method • 35 genes have cut points • Generate EPs • 19501 EPs in normals • 2165 EPs in tumours • EPs with largest support are gene groups significantly co-related to disease

  49. Top 20 EPs

  50. Some EPs contain large number of genes and still have high freq E.g., {2, 3, 6, 7, 13, 17, 33} has freq 90.91% in normal and 0% in cancer samples Nearly all normal sample’s gene expr. values satisfy all conds. implied by these 7 items Observation 1

More Related