Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine

Human Genome Project (1990): read and understand human (and otherspecies) genome (DNA) Scientific and practicalapplications (medicine, agriculture, forensic, ….) multi-millionprojectinvolvingseveralcountries MOLECULAR BIOLOGY

COMPUTER SCIENCE To meet the goal weneedcomputers and programs to deal with problemssuchas • Huge data sets (billions of informations to be analized) • Data Interpretation (contradictory, erroneous or inconsistent data) - Data sharingand networks (online genomic data banks)

Computational (molecular) Biology “Optimization problems arising in the analysis, interpretation and management of large sets of genomic data” • Combinatorics, Discrete Math • Combinatorial Optimization • Integer Programming • Complexity theory (Approximations and Hardness) • Graph theory • (but also Stringology, Data Bases, Neural Networks....)

FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001 World Consortiumuniversities and labs Celera Genomics (Craig Venter)

FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001

Computational Biology born around ’80-’90 • Algorithmic approaches (e.g. Dynamic Programming for alignment) • Computational complexity (e.g. NP-hardness of folding) • String-related problems, Information retrieval, Genomic data base…. • …… •  mostly computer scientists dominated the field

Some NP-hard problems in C.B. are OPTIMIZATION PROBLEMS • These can be solved via mathematical programming techniques • INTEGER LINEAR PROGRAMMING • LAGRANGIAN RELAXATION • SEMIDEFINITE PROGRAMMING • QUADRATIC PROGRAMMING •  O.R. people (and O.R. techniques) entered the field

The mostimportantapproach of thistypeisprobably IntegerLinear Programming Itallows the solution of NP-hard problems via specialized Branchand Boundalgorithms The lowerboundcomes from the Linear Programming relaxation of the model (and can be computed in polynomial time)

The I.P. approach

The I.P. approach • Define integer (usually binary) variables

The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy

The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize

The I.P. approach • Define integer (usually binary) variables 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

The I.P. approach • Define integer (usually binary) variables There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE 2. Define linear constraints that feasible solutions must satisfy There can be an exponential number of constraints. We need to make them implicit  BRANCH-AND-CUT 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

Some Integer Programming models in C.B. Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia

We’ll see some examples Haplotyping Clark’srule (Gusfield) Parsimony(Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragmentassembly (Lancia+Bafna+Schwartz+Istrail) Proteinfolding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller) Foldcomparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) SequenceAlignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos PhysicalMapping Alizadeh+Karp+Weisser+Zweig GenomeRearrangements Caprara+Lancia

BIOLOGY 101

A genome is a long string over the DNA alphabet {A,C,G,T} In man it is some 3.000.000.000 letters DNA is responsible for our diversity as well as our similarity Small changes in a genome can make a big difference, like from... to...

Eukariotic diploid organisms CELL Nucleus Chromosomes (pairs) TCATCGA AGTAGCT

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY introns exons attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE codon triplets att agc tat atc gtt gat gta tat gct acg aaa tta R N C A S S F C W Y Q V amino acids: a PROTEIN The protein folds to a 3D shape to perform its function CENTRAL DOGMA: 1 gene  1 protein

From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA

From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA

From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA

From DNA to strings (i.e. readingourgenome): Shotgunsequencing ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA amplification ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA ACTGAGCCTAGAGATTTCTAGGCGTATCTATCTTACACTGCATCGATCGATCGATCGA fragmentation ACTGA GATTT GCCTAG CTATCTT ATAGATA GAGATTTC TAGAAATC TGAGCCTAG TAGAGATTTC TCCTAAAGAT CGCATAGATA sequencing TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA assembly

assembly ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG Weneed to merge the fragments in order to retrieve the original DNA sequence -50,000,000 fragments -1000 chareach… Webetter use computers!

Understanding the problem - Take 10 copies of «Corriere della Sera» - Cuteach in tinypieces (1cm2) - Put the pieces in a bag and shuffle - Grabfivehandful of pieces and throwthemaway Problem: retrieve the newspaper from the remaining pieces Difficulties: Ripeatedwords (e.g., «ha», «dopo», «quando», «governo»…) • Let’smake the problemharder: - Ultra-tinypieces (1mm2) - Itisnot the CDS, but the encyclopedia Treccani (20 volumes) - Itiswritten in chinese!! Stillthrproblemwould be easierthansequencing the human genome

Assembly • Repeatedwords and missingwords create problems to be solved by sophisticated programs, based on statistics and mathematicalmodels. • The basicunderlyingproblem (notwithstanding the abovecomplications) iscalledShortestSuperstringProblem (SSP)

Shortestsuperstring: • Given a set of strings s1, .., sn, find a string s thatcontainseach sias a substring..

Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg

Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acctcattgtgtgccacctg cattgtgccacctg

Shortestsuperstring: caneneroneonerereneneon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg The space of potential (non-redundant) solutionshassize O(n!) The problemis NP-Hard Thereis an effectivegreedyalgorithm

Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon

Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere

Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene

Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon

Greedyalgorithm: • Merge the twostrings with maxoverlap. Replacethem with their fusion. • Repeatuntilonlyonestringisleft. caneneroneonerereneneon neronere neronerene neronereneon caneronereneon

Greedyalgorithm: • In this case itfound the optimum, butthereis no guarantee • There’s, however, a guaranteethat v(GREEDY) <= 2.5v(OPT) (i.e., itis a 2.5-APPROX ALGORITHM (Sweedyk)) • OPEN PROBLEM (Conjecture holding from > 20 years): Prove that GREEDY is a 2-APPROX ALGORITHM For more info see http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf

SequenceAlignments • Sequences evolve and change • E.g.: deletion, insertion, mutation attcgattgat attcggatdeletion attcgattgat attcggatinsertion attcgatgcgmutation Giventwogenomicsequences (e..g., man and mouse) wewant to compare them

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

Presentation Transcript

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina

Normative e linee guida per la validazione dei metodi di analisi