190 likes | 561 Vues
Finding genes in the genome. Lecture 7. Introduction. The open reading frame: (OFR) Finding genes in prokaryotes. Finding gene in Eukaryotes EST (cDNA) and there role. Finding promoters. Introduction. Homolgous approach: sequence similarity [discussed in the next lecture]
 
                
                E N D
Finding genes in the genome Lecture 7 Global Sequence
Introduction • The open reading frame: (OFR) • Finding genes in prokaryotes. • Finding gene in Eukaryotes • EST (cDNA) and there role. • Finding promoters
Introduction • Homolgous approach: • sequence similarity [discussed in the next lecture] • Has proven to be useful but only if similar sequences already exist in the database. • However, if there is no similar sequences then must apply the general property of genes; start codon…stop codon to analyse our sequence. Global Sequence
Open Reading Frames (ORF) • If the homologous approach is not successful then you look for ORF • This is a region of the DNA which could be a coding sequence (CDS) of a gene [not the promoter, untranslated region (UTR)… • It has a start codon (ATG) and an end codon [ one of three] (TAA, TAG, TGA) • If you have a novel sequence you would look for all ORF in all 6 reading, 3 reading frames per strand, as a
Finding potential OFR • Translate each reading frame beginning at: • Base 1: 5’ 3’ frame 1 • Base 2: 5’ 3’ frame 2 • Base 3: 5’3’ frame 2 • Why no need for frame 4? • Get the “reverse compliment of the given strand” and repeat the process”; 3’ 5’ frame 1…. • Look for start and stop codons (amino acids). • Note: in afasta file the gene will be in the given sequence (strand ) so no need to get the reverse compliment. Global Sequence
Is the ORF a gene • First check length of the ORF; {consider the smallest protein is about 20 aa in length.] • Check for the presence of promoters upstream of the ORF (TATAAT) sequence… • Search for genes with similar aa sequences to the candidate gene. • Prokaryotes and eukaryotes take different approaches which takes into account the difference in their gene structure.
ORF’s in prokaryotic genes • In prokaryotic genes the ORF or protein coding sequence beings with a start codon and ends with a stop codon. • Gene density is about 1 per kilobase, ORF every 1000 bases. In some cases the genes density can cause the stop codon of one gene to overlap with the promoter of another [ Zvelebil chapter 9] • E. G. Within the lac operon there are 3 genes (CDS) all in close proximity: so the ATG lac Y is close to TAG of LacZ…. Global Sequence
Review of Eukaryotic gene expression expression Eukaryotic expression showing exons/ introns…, adapted from Zhang 2002
ORF in Eukaryotes • Gene density is much lower; genes are further apart and can vary significantly between chromosomes (~ 1.5% of human DNA is CDS). • ORF contain introns between the coding sequences (CDS) of exons. Further detail can be found at klug 2010. • An added problem in relation to interpretating the data is; e.g. if the intron contains a stop codon sequence it means it is only a; e.g. a “tta”, sequence and not a stop codon • Further details on finding and prediction of exons can be found at (Baxevanis 2005) Global Sequence
Finding Coding regions in Eukaryotes • Identify the TTS and the Untranslated regions: • Like coding region they also contain exons and introns • there are Untranslated regions (UTRs), on both sides of the CDS (both at the 5’ and 3’ end of the coding mRNA) and they play a part in regulating translation via: degradation, attaching to the ribosome and promote or inhibit translation. • Identify start and stop signals (Zhang 2002Chasin 2007) • Initial exon (start and 5’ splice site) • Internal exon (3’ and 5’ site) • Terminal site (3’ and and stop codon) • There is compositional bias in: the coding regions; and also at splice sites • Database pattern searches can also be used where it is assumed that coding regions have a higher degree of conservation than not coding regions. • It is important to be aware that the length exons and introns may not be multiples of 3.[ Zvelebil chapter 9 and chapter 5 Baxevanis] Global Sequence
Promoter Analysis • The existence of a “potential” ORF indicates the presence of a near by promoter. • Promoter are essential elements upstream of the protein coding sequence that are essential in the transcription process and exist in both eukaryotic and prokaryotic organisms. The figure below illustrates a number of eukaryotic promoters and illustrates the variability. [klug 7thed] . However it also illustrates the common features: TATA box… Global Sequence
Promoter Analysis • In Prokaryotes: • the TATAAT region, pribnow box, just upstream, of the TTS (transcription start site). (-10 b.p.) • A further marker, TTGACA, may also be found 25 p.b. from this position. (-35 bp) • In Eukaryotes there are 3 subsections of the promoter. • The core/basal promoter (~80 bp from the TSS) (klug p. 321) • In most cases in contains a TATA box (25 bp upstream of TSS) • Many contain a CAAT box and are GC elements rich. • The proximal/upstream promoter (~ 250 bp from the TSS) • There is wide variation in this region from one gene to another. • The distal promoter (much further upstream) Global Sequence
Promoter Analysis • The identification of a Core promoter indicates the presence of a gene and visa versa so prediction of both to an extent complement each other. • Promoters characterisation (discovering transcription factor binding patterns) takes two basic approaches (Chapter 5 Baxevanis 2005): • Pattern Driven Algorithms: depends on existing annotated data, in bioinformatics databases, that relate to binding sites • Sequence-driven algorithms: the assumption that common, promoter functionality can be obtained from underlying conserved, sequences. Genes that are co-regulation or co-expression provide good candidates for obtaining data for this approach. Global Sequence
Potential exam questions • Open reading frames (ORFs) are an essential part of finding genes in genomes: Discuss how you would attempt to find ORF’s and why such ORF’s are a more accurate prediction of protein structure in bacterial cells as opposed to animal cells • A critical part of finding the protein coding regions of DNA sequences is the discovery of open reading frames (ORF). Discuss the difficulties associated with finding such sequences in Eukaryotic cells
Reference • Baxevanis, A.D. 2005 Bioinformatics: a practical guide to the analysis of genes and proteins. Wiley; Chapter 5. [book is in the library] • Kel, A. E. et al 2003: MATCHTM: a tool for searching transcription factor binding sites in DNA sequences; Nucleic Acids Res. 2003 July 1; 31(13): 3576–3579 • Klug, W.A. et al 2010; Concepts of Genetics; Pearson Education p. 596-p.597 • Zhang, M.Q. 2002 Computational prediction of eukaryotic coding genes. Nat Rev. Genet. 3 698-709. • Chasin, L.A. 2007 Searching for splicing motifs. Adv Exp Med Biol. 623:85-106 • Zvelebil M. “understanding bioinformatics” chapter 9 {book is in the library] Global Sequence