Comprehensive Overview of Bioinformatics: Data Mining to Medical Insights

From Datamining to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore

What is Bioinformatics?

Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases

Benefits of Bioinformatics • To the patient: • Better drug, better treatment • To the pharma: • Save time, save cost, make more $ • To the scientist: • Better science

From Informatics to Bioinformatics MHC-Peptide Binding (PREDICT) Protein Interactions Extraction (PIES) 8 years of bioinformatics R&D in Singapore Gene Expression & Medical Record Datamining (PCL) Cleansing & Warehousing (FIMM) Gene Feature Recognition (Dragon) Integration Technology (Kleisli) Venom Informatics 1994 1996 1998 2002 2000 ISS LIT KRDL

Quick Samplings

Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN

1 66 100 Epitope Prediction Results • Prediction by our ANN model for HLA-A11 • 29 predictions • 22 epitopes • 76% specificity • Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) Rank by BIMAS

Transcription Start Prediction

Transcription Start Prediction Results

Medical Record Analysis • Looking for patterns that are • valid • novel • useful • understandable

Gene Expression Analysis • Classifying gene expression profiles • find stable differentially expressed genes • find significant gene groups • derive coordinated gene expression

Medical Record & Gene Expression Analysis Results • PCL, a novel “emerging pattern’’ method • Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks • Works well for gene expressions Cancer Cell, March 2002, 1(2)

Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang Behind the Scene and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators….

Questions?

A More Detailed Account

Jonathan’s blocks Jessica’s blocks Whose block is this? What is Datamining? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest

What is Datamining? Question: Can you explain how?

The Steps of Data Mining • Training data gathering • Signal generation • k-grams, colour, texture, domain know-how, ... • Signal selection • Entropy, 2, CFS, t-test, domain know-how... • Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ...

Translation Initiation Recognition

A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG 80 CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA 160 GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA 240 CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ 80 ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE 160 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 240 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site?

Signal Generation • K-grams (ie., k consecutive letters) • K = 1, 2, 3, 4, 5, … • Window size vs. fixed position • Up-stream, downstream vs. any where in window • In-frame vs. any frame

Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms

Signal Selection (Basic Idea) • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • Which of the following 3 signals is good?

Signal Selection (eg., t-statistics)

Signal Selection (eg., MIT-correlation)

Signal Selection (eg., 2)

Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS • A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other • Homework: find a formula that captures the key idea of CFS above

Sample k-grams Selected Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream • TAA, TAG, TGA, • CTG, GAC, GAG, and GCC Kozak consensus Stop codon Codon bias

Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, ...

Results (on Pedersen & Nielsen’s mRNA)

Acknowledgements • Roland Yap • Zeng Fanfan • A.G. Pedersen • H. Nielsen

Questions?

Common Mistakes

Self-fulfilling Oracle • Consider this scenario • Given classes C1 and C2 w/ explicit signals • Use 2 to C1 and C2 to select signals s1, s2, s3 • Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90% • Is the accuracy really 90%? • What can be wrong with this?

Phil Long’s Experiment • Let there be classes C1 and C2 w/ 100000 features having randomly generated values • Use 2 to select 20 features • Run k-fold x-validation on C1 and C2 w/ these 20 features • Expect: 50% accuracy • Get: 90% accuracy! • Lesson: choose features at each fold

Apples vs Oranges • Consider this scenario: • Fanfan reported 89% accuracy on his TIS prediction method • Hatzigeorgiou reported 94% accuracy on her TIS prediction method • So Hatzigeorgiou’s method is better • What is wrong with this conclusion?

Apples vs Oranges • Differences in datasets used: • Fanfan’s expt used Pedersen’s dataset • Hatzigeorgiou’s used her own dataset • Differences in counting: • Fanfan’s expt was on a per ATG basis • Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis • When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also!

Questions?

Comprehensive Overview of Bioinformatics: Data Mining to Medical Insights

Comprehensive Overview of Bioinformatics: Data Mining to Medical Insights

Presentation Transcript

Exciting Media Limsoon Wong Institute for Infocomm Research

Dynamics For Information Technology

Limsoon Wong KRDL

Information Technology for CRM

Limsoon Wong Kent Ridge Digital Labs Singapore

Limsoon Wong Laboratories for Information Technology Singapore

Limsoon Wong Kent Ridge Digital Labs

INTERACTIVE TECHNOLOGY LABORATORIES

Mathematics for Information Technology

ISO Certification For Information Technology in Singapore

Israr Wong - Eye bag removal Singapore

Quantum Information Technology Group in NUS Singapore

INTERACTIVE TECHNOLOGY LABORATORIES

Limsoon Wong Institute for Infocomm Research Singapore

Exciting Media Limsoon Wong Institute for Infocomm Research