1 / 14

Gone Fishing: An Introduction to Gene Finding Methods

Gone Fishing: An Introduction to Gene Finding Methods. Jarek Meller Biomedical Informatics, CHRF Additional materials for those who missed The Intro to Functional Genomics course. A couple of definitions:.

rramsey
Télécharger la présentation

Gone Fishing: An Introduction to Gene Finding Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gone Fishing:An Introduction to Gene Finding Methods Jarek Meller Biomedical Informatics, CHRF Additional materials for those who missed The Intro to Functional Genomics course Introduction to bioinformatics

  2. A couple of definitions: • A short history of genes: from “hereditary basis for traits” to “one gene – one polypeptide” • Modern definition of the gene: “a complete chromosomal segment responsible for making a functional product” • Codon: a triplet of nucleotides encoding an amino acid • Open Reading Frame (ORF): a string of codons bounded by start and stop signals (codons) • Pseudogene: a potential gene with an impaired ability to make viable transcription (or translation) product

  3. ATG…….… GT.. …AG E2 Donor Intron Acceptor The Canonical Structure of Eukaryotic Genes: …TATA… …AATAAA… 5’ Pr 5’UTR E1 I1 E2 I2 E3 3’UTR polyA 3’ Eukaryotic genes are in general neither contiguous nor continuous: coding regions are typically split in a number of coding fragments (exons), separated by non-coding intervening fragments known as introns.

  4. Motifs and Processes: TATA – TBP – transcription initiation AATAAA – poly-A polymerase – poly-A tail attachment (pre-mRNA processing) GT … AG – splicesome complex – splicing (pre-mRNA processing) ATG – ribosome complex – translation initiation TGA, TAA, TAG – ribosome complex – translation termination

  5. 5’ CAP AAA…AA 5’ UTR 3’ UTR From Transcription to pre-RNA Processing to Splicing to Translation:

  6. Finding Genes in Prokaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f0 AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAa f1 aAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA f2 aaAATGGGGGTGGGTGATGAGAGACTTAGATGAATaa • A simple algorithm – for each reading frame on both strands: • Find start (ATG) and stop (TGA, TAG, TAA) codons • Find sufficiently long (threshold) Open Reading Frames (ORF) • For each ORF compute a “coding potential”, e.g. using codon usage • ORFs with sufficiently high score become candidate genes • Refinements: alternative coding measures, homology, regulatory motifs

  7. Finding Genes in Eukaryotic Genomes: AAAATGGGGGTGGGTGATGAGAGACTTAGATGAATAA g1 aaaATGGGGgtgggtgatgagagACTTAGatgaataa MetGly Thr g2 aaaATGGGGGTGGgtgatgagagACTTAGATGAATAA MetGlyValA spLeuAspGlu A legal parse (candidate gene) must have a single ORF spanning all coding regions from the start to the stop codon.

  8. Further complications: • Alternative splicing • Alternative transcription initiation sites and start codons • Overlapping (and embedded) genes • Regulatory sites often separated by long intervening non-coding sequences • Pseudogenes

  9. where P(b,i) is the probability of observing base b at position i, derived from a set of true examples used for the training, and P(b) is the prior (background) probability of observing b in the data. Signal score is then a sum over individual scores for a window around the splice site. Refinements: conditional probabilities and Markov models: Signals: a simple approach by using Weight Matrices GAGGTAAGC CAGGTCAGT TCGGTAATT ATGGTAACT TAGGTCATT Further refinements: supervised machine learning approaches e.g. NN

  10. Coding measures: a simple codon usage model The decomposition of sequence S into codons Ck is reading frame dependent and all reading frames are considered for prediction (that is maximum score over all reading frames with a sufficiently long sliding window is taken). However, only the reading frames are used to generate probabilities of each codon (see Codon Usage Table) in the training set of true exons. The background probabilities, in turn, may be computed from all the sequences (including introns) in the training, taking into account all the reading frames. Refinement: use homology and splice alignments

  11. R. Guigo, sliding window of length 120 b, human beta-globulin

  12. Combining Sites and Coding Statistics • Variety of approaches proposed, e.g. MORGAN, FGENES, GeneID, GRAIL • The dynamic programming framework: find the best legal parse up to position n, given the best scoring and consistent parses up to position n-1 (analogy to sequence alignment) • Hidden Markov Model statistical learning framework for gene finding Introduction to bioinformatics

  13. Problems and assignments: • Use a eukaryotic genomic sequence from the GENBANK of length larger than 20 kb to estimate the frequency of putative donor and acceptor sites • Use true splice sites in your sequence to derive 0-th and first order Markov models (weight matrices) • Compare the results of the two models for false sites in the sequence • Consider splice alignments into protein (cDNA) sequence databases as a method to detect coding sequences. What would be the role of the six reading frames in such an exercise? Introduction to bioinformatics

More Related