390 likes | 402 Vues
Discover the journey of bioinformatics from data mining to advanced medical insights and applications. Learn about gene expression analysis, medical record analysis, and epitope prediction. Explore the benefits and technology behind this field.
E N D
From Datamining to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore
Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases
Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science
From Informatics to Bioinformatics MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT KRDL
Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN
1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS
Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable
Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression
Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)
Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….
Jonathan’s blocks Jessica’s blocks Whose block is this? What is Datamining? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest
What is Datamining? Question: Can you explain how?
The Steps of Data Mining • Training data gathering • Signal generation • k-grams, colour, texture, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...
A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?
Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame
Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms
Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?
Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other • Homework: find a formula that captures the key idea of CFS above
Sample k-grams Selected Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias
Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...
Acknowledgements • Roland Yap • Zeng Fanfan • A.G. Pedersen • H. Nielsen
Self-fulfilling Oracle • Consider this scenario • Given classes C1 and C2 w/ explicit signals • Use 2 to C1 and C2 to select signals s1, s2, s3 • Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90% • Is the accuracy really 90%? • What can be wrong with this?
Phil Long’s Experiment • Let there be classes C1 and C2 w/ 100000 features having randomly generated values • Use 2 to select 20 features • Run k-fold x-validation on C1 and C2 w/ these 20 features • Expect: 50% accuracy • Get: 90% accuracy! • Lesson: choose features at each fold
Apples vs Oranges • Consider this scenario: • Fanfan reported 89% accuracy on his TIS prediction method • Hatzigeorgiou reported 94% accuracy on her TIS prediction method • So Hatzigeorgiou’s method is better • What is wrong with this conclusion?
Apples vs Oranges • Differences in datasets used: • Fanfan’s expt used Pedersen’s dataset • Hatzigeorgiou’s used her own dataset • Differences in counting: • Fanfan’s expt was on a per ATG basis • Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis • When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!