410 likes | 579 Vues
Towards a model for -1 frameshift sites. Alain Denise 1,2 , Michaël Bekaert 1 , Laure Bidou 1 , Guillemette Duchateau-Nguyen 1 , Jean-Paul Forest 2 , Christine Froidevaux 2 , Isabelle Hatin 1 , Jean-Pierre Rousset 1 , Michel Termier 1 1 IGM (Institut de Génétique et Microbiologie)
E N D
Towards a model for -1 frameshift sites Alain Denise1,2, Michaël Bekaert1, Laure Bidou1, Guillemette Duchateau-Nguyen1, Jean-Paul Forest2, Christine Froidevaux2, Isabelle Hatin1, Jean-Pierre Rousset1, Michel Termier1 1 IGM (Institut de Génétique et Microbiologie) 2 LRI (Laboratoire de Recherche en Informatique) Université Paris-Sud, Orsay
Translation mRNA CAUAUGGAUUAC AUG GUCUAAGAU 5’ 3’
Translation ribosome CAUAUG GAUUAC AUG GUCUAAGAU 5’ 3’ The ribosome reads bases by triplets (or codons)from aSTART codon
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’ The ribosome synthetizes one amino-acid per codon
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’
Translation CAUAUGGAU UAC AUG GUCUAAGAU 5’ 3’ The synthesis goes on until a STOPcodon is read 1 mRNA gives 1 protein
Experimental fact • Some mRNAs encode two distinct proteins with same 5’ end
STOP-1 START0 STOP0 0 phase ORF1a -1 phase ORF1b usual translation -1frameshift Programmed -1 frameshifting Non-deterministic event 1 mRNA gives 2 distinct proteinswith accurate ratio
Typical -1 frameshift site [Brierley, 1989] S2 3’ L1 L’1 S1 L2 5’ AUG NNXXXY YYZ P SP Secondary structure Slippery sequence
IBV frameshift site S2 U C C G A G C GAAA 3’ A G G C U C G G UGACGAUGGGG GCUG AUACCCC S1 5’ AUG UAU UUA AAC GGGUAC UUGC Pseudoknot Slippery sequence
Translation with frameshift U C C G A G C GAAA 3’ A G G C U C G G UGACGAUGGGG GCUG AUACCCC UUGC 5’ AUG UAUUUA AACGGG UAC
Translation with frameshift U C C G A G C GAAA 3’ A G G C U C G G UGACGAUGGGG GCUG AUACCCC UUGC 5’ UAU UUA AAC GGG UAC
-1 shift Translation with frameshift U C C G A G C GAAA 3’ A G G C U C G G UGACGAUGGGG GCUG AUACCCC UUGC 5’ UAU UUA AAC GGG UAC
Translation with frameshift 3’ 5’ UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift 3’ 5’ UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift 3’ 5’ UA UUU AAA CGG GUA CGG GGU AGC AGU
Translation with frameshift 3’ 5’ UA UUU AAA CGG GUA CGG GGU AGC AGU
Goals • To improve the known model for viral frameshift sites • To identify new frameshift sites in viral and non viral genomes
Our approach Biologicalsequences In silico andin vivovalidation representexplain predict Formalmodels Predictiontools Applications to other genomes
IBV frameshift site: spacer 3’ 5’ GGGUAC
HAST-1 UAC AAA BEV UGU UG EAV UGA GAG HCV GAG UC IBV GGG UAC MHV GGG UU TGEV GAG RCNMV UAG GC BWYV GGA GUG PLRV GGG CAA BLV UAA UAG A FIV UGG AAG GC HIV-1 GGG AAG AU HTLV-2 UCC UUA A JSR UGG GUG A MMTV gag-proUUG UAA A MMTV pro-polUGA U RSV UAG GGA SRV-1 GGA CUG A Consensus UGG UAG A GAA GUA Spacer consensus
Lab experiments Test construct -1 phase pSV40 lacZ luc FS signal FS reporter Expression reporter pSV40 lacZ luc FS signal N Control construct 0 phase
Spacer: lab experiments Spacer relative FS ratewild-type IBV GGGUA 100U mutant UGGUA 100 A mutant AGGUA 55C mutant CGGUA32CC mutant CCGUA70CCU mutant CCUUA49
Refining the model: Machine learning • To identify relevant properties that characterize FS sites • Disjunctive learning: all sequences do not frameshift for the same reasons [Giedroc et al., 2000]
Annotating data: spacer 3’ 5’ GGGUAC
Example of data: SP • SP = GGGUAC • number of A = 1; C = 1; G = 3; U = 1; • % of A = 33; C = 33; G = 50; U = 33; • first = G; • last = C;
Annotating data: stem 1 3’ UGACGAUGGGG GCUG AUACCCC 5’
Example of data: stem 1 • S1 = • 5' side :GGGGUAGCAGU • 3' side : CCCCAUAGUCG • stability : -20,7 kcal/mol
Annotating data: full sequence U C C G A G C GAAA 3’ A G G C U C G G UGACGAUGGGG GCUG AUACCCC 5’ U UUA AAC GGGUAC UUGC
Example of data : FS rate FS rate = 22 %
GloBo • Disjunctive learning algorithm • Suited to small amount of data • Won the PTE challenge on analogous data
Example of rules If SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4and %T in S2.5’ 30 and %G in S2.5’ 70 thenFS rate 5% If %G in S1.5' bottom half 80 and %C in L1 45 thenFS rate 5% If SP length 5 and S1.3' length 6 and %C in S1.3' 45 thenFS rate 5% ...
Covering and prediction If SP length 5 and number of G in S1.5’ bottom half 3 and number of G in S1.5’ 4and %T in S2.5’ 30 and %G in S2.5’ 70 thenFS rate 5% Covering of examples : 70 % Examples predicted in test set : 80 %
Is R1relevant for frameshift ? Stem 1 5’-side relative FS R1 rate wild-type IBV GGGGU AUCAGU 100 yesmutant 1 GGUCG AUCAGU 41 yesmutant 2 GGGGUUCUACA 55 yes mutant 3 GCUCG AUCAGU 36 nomutant 4 GCCCUAUCAGU 73 no
Covering and prediction If SP length 5 and S1.3' length 6 and %C in S1.3' 45 thenFS rate 5% Covering of examples : 45 % Examples predicted in test set : 40 %
Conclusion • Spacer: • correlation between primary sequence and FS rate has been established • systematic experimentation going on
Conclusion Biologicalsequences In silico andin vivovalidation Formalmodels Predictiontools Applications to other genomes