1 / 25

Gene Finding

Gene Finding. Charles Yan. Gene Finding. C ontent s ensors Extrinsic content sensors Compare with protein sequences Compare with cDNA and ESTs Genomic comparisons Intrinsic content sensors Prediction methods S ignal sensors. Intrinsic content sensors.

feleti
Télécharger la présentation

Gene Finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gene Finding Charles Yan

  2. Gene Finding Content sensors • Extrinsic content sensors • Compare with protein sequences • Compare with cDNA and ESTs • Genomic comparisons • Intrinsic content sensors • Prediction methods Signal sensors

  3. Intrinsic content sensors • Originally, intrinsic content sensors were defined for prokaryotic genomes. • In such genomes, onlytwo types of regions are usually considered: the regions thatcode for a protein and will be translated, and intergenicregions.

  4. Intrinsic content sensors • Since coding regions will be translated, they arecharacterized by the fact that three successive bases in thecorrect frame define a codon which, using the genetic coderules, will be translated into a specific amino acid in the finalprotein.

  5. The Genetic Code

  6. Intrinsic content sensors • In prokaryotic sequences, genes define (long) uninterruptedcoding regions that must not contain stop codons. • Therefore,the simplest approach for finding potential coding sequences isto look for sufficiently long open reading frames (ORFs),defined as sequences not containing stops, i.e. as sequencesbetween a start and a stop codon.

  7. Intrinsic Content Sensors

  8. Intrinsic Content Sensors In eukaryotic sequences,however, the translated regions may be very short and theabsence of stop codons becomes meaningless.

  9. Intrinsic Content Sensors Several other measures have therefore been defined that tryto more finely characterize the fact that a sequence is `coding‘for a protein: • Nucleotide composition and especially (G+C)content (introns being more A/T-rich than exons, especially inplants) • Codon composition • Hexamer frequency

  10. Codon Composition In random DNA Leucine : Alanine : Tryptophan = 6 : 4 : 1

  11. Codon Composition

  12. Codon Composition • Compare to the background frequency

  13. Hexamer Frequency Among the large variety of codingmeasures that have been tested, hexamer usage (i.e. usage of6 nt long words) was shown in 1992 to be the mostdiscriminative variable between coding and non-codingsequences

  14. Intrinsic Content Sensors • In general, most currently existing programs use two typesof content sensors: one for coding sequences and one for noncodingsequences, i.e. introns, UTRs and intergenic regions. Afew software refine this by using a different model for thedifferent types of non-coding regions (e.g. one model forintrons, one for intergenic regions and an optional specific 3’-and 5’-UTR model in EuGene).

  15. Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors

  16. Signals • Transcription (transcriptionfactor binding sites and TATA boxes) • Splicing(donor and acceptor sites and branch points) • Polyadenylation[poly(A) site], • Translation (initiation site, generally ATG withexceptions, and stop codons)

  17. Signal Sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction

  18. Splice site prediction • The basic and natural approach to finding a signal that mayrepresent the presence of a functional site is to search for amatch with a consensus sequence (with possible variationsallowed), the consensus being determined from a multiplealignment of functionally related documented sequences. • e.g. for splice site predictionsSPLICEVIEW and SplicePredictor

  19. Splice site prediction • A more flexible representation of signals is offered by theso-called positional weight matrices (PWMs), which indicatethe probability that a given base appears at each position of thesignal (again computed from a multiple alignment offunctionally related sequences). • The PWM weights can also beoptimized by a neural network method. e.g. NetPlantGene and NetGene2

  20. Splice site prediction • In order to capture possible dependencies between adjacentpositions of a signal, one may use higher order Markovmodels or hidden Markov models. • VEIL, MORGAN, and NetGene2

  21. Splice site prediction

  22. Splice site prediction When using splice site prediction programs, oneends up with a list of potential splice sites, from which variousgene structures may be built. The main purpose of suchprograms is not to find the gene structure but to try to find thecorrect exon boundaries. They are thus very useful in additionto an exon or gene predictor in order to refine an existing genestructure.

  23. Signal Sensors HMMs have also been used to represent other typesof signals, such as poly(A) sites and promoters. Promoter predictions deserve another chapter.

  24. Signal Sensors Another important signal to identify when trying to predicta coding sequence is the translation initiation codon. A fewprograms exist specifically dedicated to this problem, but most of them have a rather limited efficiency, which ismaybe related to the lack of proper learning sets for eukaryoticgenomes.

  25. Gene Finding Content sensors • Extrinsic content sensors • Intrinsic content sensors Signal sensors • Splice site prediction • Promoter prediction • Poly(A) sites prediction • Translation initiation codon prediction Combining the evidence to predict gene structures

More Related