1 / 82

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina. Giuseppe Lancia Università di Udine. Human Genome Project ( 1990): read and understand human (and other species ) genome (DNA)

devlin
Télécharger la présentation

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine

  2. Human Genome Project (1990): read and understand human (and otherspecies) genome (DNA) Scientific and practicalapplications (medicine, agriculture, forensic, ….) multi-millionprojectinvolvingseveralcountries MOLECULAR BIOLOGY

  3. COMPUTER SCIENCE To meet the goal weneedcomputers and programs to deal with problemssuchas • Huge data sets (billions of informations to be analized) • Data Interpretation (contradictory, erroneous or inconsistent data) - Data sharingand networks (online genomic data banks)

  4. Computational (molecular) Biology “Optimization problems arising in the analysis, interpretation and management of large sets of genomic data” • Combinatorics, Discrete Math • Combinatorial Optimization • Integer Programming • Complexity theory (Approximations and Hardness) • Graph theory • (but also Stringology, Data Bases, Neural Networks....)

  5. FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001 World Consortiumuniversities and labs Celera Genomics (Craig Venter)

  6. FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001

  7. Computational Biology born around ’80-’90 • Algorithmic approaches (e.g. Dynamic Programming for alignment) • Computational complexity (e.g. NP-hardness of folding) • String-related problems, Information retrieval, Genomic data base…. • …… •  mostly computer scientists dominated the field

  8. Some NP-hard problems in C.B. are OPTIMIZATION PROBLEMS • These can be solved via mathematical programming techniques • INTEGER LINEAR PROGRAMMING • LAGRANGIAN RELAXATION • SEMIDEFINITE PROGRAMMING • QUADRATIC PROGRAMMING •  O.R. people (and O.R. techniques) entered the field

  9. The mostimportantapproach of thistypeisprobably IntegerLinear Programming Itallows the solution of NP-hard problems via specialized Branchand Boundalgorithms The lowerboundcomes from the Linear Programming relaxation of the model (and can be computed in polynomial time)

  10. The I.P. approach

  11. The I.P. approach • Define integer (usually binary) variables

  12. The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy

  13. The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize

  14. The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

  15. The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

  16. The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy There can be an exponential number of constraints. We need to make them implicit  BRANCH-AND-CUT 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

  17. Some Integer Programming models in C.B. Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia

  18. We’ll see some examples Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia

  19. BIOLOGY 101

  20. A genome is a long string over the DNA alphabet {A,C,G,T} In man it is some 3.000.000.000 letters DNA is responsible for our diversity as well as our similarity Small changes in a genome can make a big difference, like from... to...

  21. Eukariotic diploid organisms CELL Nucleus Chromosomes (pairs) TCATCGA AGTAGCT

  22. THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE

  23. THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN

  24. THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN The protein folds to a 3D shape to perform its function CENTRAL DOGMA: 1 gene  1 protein

  25. From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

  26. From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

  27. From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

  28. From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA

  29. From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly

  30. assembly ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG Weneed to merge the fragments in order to retrieve the original DNA sequence -50,000,000 fragments -1000 chareach… Webetter use computers!

  31. Understanding the problem - Take 10 copies of «Corriere della Sera» - Cuteach in tinypieces (1cm2) - Put the pieces in a bag and shuffle - Grabfivehandful of pieces and throwthemaway Problem: retrieve the newspaper from the remaining pieces Difficulties: Ripeatedwords (e.g., «ha», «dopo», «quando», «governo»…) • Let’smake the problemharder: - Ultra-tinypieces (1mm2) - Itisnot the CDS, but the encyclopedia Treccani (20 volumes) - Itiswritten in chinese!! Stillthrproblemwould be easierthansequencing the human genome

  32. Assembly • Repeatedwords and missingwords create problems to be solved by sophisticated programs, based on statistics and mathematicalmodels. • The basicunderlyingproblem (notwithstanding the abovecomplications) iscalledShortestSuperstringProblem (SSP)

  33. Shortestsuperstring: • Given a set of strings s1, .., sn, find a string s thatcontainseach sias a substring..

  34. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  35. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  36. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  37. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  38. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  39. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  40. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  41. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

  42. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acctcattgtgtgccacctg cattgtgccacctg

  43. Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg The space of potential (non-redundant) solutionshassize O(n!) The problemis NP-Hard Thereis an effectivegreedyalgorithm

  44. Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon

  45. Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere

  46. Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene

  47. Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon

  48. Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon caneronereneon

  49. Greedyalgorithm: • In this case itfound the optimum, butthereis no guarantee • There’s, however, a guaranteethat v(GREEDY) <= 2.5v(OPT) (i.e., itis a 2.5-APPROX ALGORITHM (Sweedyk)) • OPEN PROBLEM (Conjecture holding from > 20 years): Prove that GREEDY is a 2-APPROX ALGORITHM For more info see http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf

  50. SequenceAlignments • Sequences evolve and change • E.g.: deletion, insertion, mutation attcgattgat attcggatdeletion attcgattgat attcggatinsertion attcgatgcgmutation Giventwogenomicsequences (e..g., man and mouse) wewant to compare them

More Related