1 / 15

BIOINFORMATICS

BIOINFORMATICS. Ayesha M. Khan Spring 2013. GENE PREDICTION/GENE FINDING. The vast amount of raw sequence data generated because of advancement in sequencing technology needs biological interpretation Known as ‘annotation’ To find genes and determine their functions. Protein coding genes

emmly
Télécharger la présentation

BIOINFORMATICS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8

  2. GENE PREDICTION/GENE FINDING • The vast amount of raw sequence data generated because of advancement in sequencing technology needs biological interpretation • Known as ‘annotation’ • To find genes and determine their functions Lec-8

  3. Protein coding genes • Prokaryotic • No introns, simpler regulatory features • Eukaryotic • Exon-intron structure • Complex regulatory features Lec-8

  4. Coding sequence • Actual region of DNA that is translated to form proteins. While the ORF may contain introns as well, the CDS refers to those nucleotides that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery. In prokaryotes the ORF and the CDS are the same. Lec-8

  5. What is gene prediction? • Which region codes for a protein? • Which DNA strand is used to encode the gene? • Where does the gene start and end? • Where are the exon-intron boundaries in eukaryotes? • Where (optionally) are the regulatory sequences for that gene? The characterization of genomic features using computational and experimental methods is called gene prediction or annotation. Lec-8

  6. Computational methods of gene prediction Computational gene finding is a process of: • Identifying common phenomena in known genes • Building a computational framework/model that can accurately describe the common phenomena • Using the model to scan uncharacterized sequence to identify regions that match the model, which become putative genes • Test and validate the predictions Lec-8

  7. Biological overview of ‘gene’ • Gene: defined as a segment of DNA that contains the necessary information to produce a functional product, usually a protein. • DNA (or RNA in some viruses) • Promoter: controls the activity of a gene • Coding sequence: determines what the gene produces Core promoter-minimal portion of the promoter required to initiate transcription properly Proximal promoter-tends to contain primary regulatory elements ; serves as a binding site for specific transcription factors ORF -Open reading frame Starts with ATG (start codon) though not always Terminates with TAA, TAG or TGA (stop codons) Lec-8

  8. Structure of eukaryotic gene Lec-8

  9. Methods of gene prediction Lec-8

  10. Extrinsic/Homology Method • Based on sequence similarity of query sequence with annotated genes present in databases. • It is known that only approx. half of the genes can be found by homology to other known genes or proteins. • Based on the following principles: • Coding regions evolve slower than non-coding regions, i.e. local sequence similarity can be used as a gene finder • Homologous sequences reflect a common evolutionary origin and possibly a common gene structure. • Standard pair-wise comparison methods can be used (BLAST or Smith-Waterman) • Include gene syntax information (start/stop codons etc.) • Useful to confirm predictions inferred by other methods Lec-8

  11. Intrinsic/Ab initio Method • Predicts genes based on statistical properties of the given DNA sequence. • Statistical patterns inside and outside of the gene regions as well as typical patterns at their boundaries. Lec-8

  12. Features for gene prediction in eukaryotes Signal sensors Content sensors (extrinsic and intrinsic content sensors) • Signal sensors Evaluates fixed-length features in DNA Signals: splice sites, start/stop codon, branch points, promoters and terminators of transcription, polyadenylation sites, ribosomal-binding sites, topoisomerase II binding sites, various transcription factor-binding sites etc. • These are measures that try to detect the presence of the functional sites specific to a gene. • The basic signal sensor is a simple consensus sequence or an expression that describes a consensus sequence along with allowable variations. • Use of weight matrices Lec-8

  13. Features for gene prediction in eukaryotes (contd.) • Content sensors Evaluates variable length features which extend from one signal to another They classify a DNA region into different types, e.g. coding vs non-coding Extrinsic content sensor • These sensors perform similarity searching between a genomic sequences region and a protein or DNA sequence present in a database. • Basic tools needed for similarity searching, i.e. BLAST, FASTA etc. • Intragenomic and Intergenomic comparisons Lec-8

  14. Features for gene prediction in eukaryotes (contd.) Intrinsic content sensor Based on statistical models of the nucleotide frequencies and dependencies present in codon structure • Use of MM • CpG islands (regions which often mark the beginning of genes where frequency of CG is not as low as it is in the rest of the genome) • Sensors for repetitive DNA (e.g. ALU sequences) Lec-8

  15. Gene prediction tools Software based on ab initio methods GENSCAN, FGENESH, GeneMark.hmm, Glimmer, Genie, GeneID Software based on similarity-based methods GeneWise, SYNCOD, ORFgene2, EbEST Lec-8

More Related