The modern RNA world: computational screens for noncoding RNA genes Eddy lab HHMI/Washington University, Saint Louis
The genome, famously, is digital 1892: Miescher postulates that genetic information may be encoded in a linear form using a few different chemical units: “...just as all the words and concepts in all languages can find expression in twenty-four to thirty letters of the alphabet.”
Symbolic texts can be cracked Michael Ventris and John Chadwick, 1953 “Cryptography has contributed a new weapon to the student of unknown scripts.... the basic principle is the analysis and indexing of coded texts, so that underlying patterns and regularities can be discovered. If a number of instances can be collected, it may appear that a certain group of signs in the coded text has a particular function....” - John Chadwick, The Decipherment of Linear B, Cambridge Univ. Press, 1958
Comparative genome analysis VISTA plot; I. Dubchak, E. Rubin, et al. human, mouse, dog genomes
Estimates of human gene number www.ensembl.org/Genesweep/ mean: 61,710 low: 27,462 high: 153,478 Want to place a bet? The book is held by the bartender at Cold Spring Harbor Laboratory.
The yeast genome completed Science 274:546, 1996 Life with 6000 Genes A. Goffeau, B.G. Barrell, H. Bussey, R.W. Davis, B. Dujon, H. Feldmann, F. Galibert, J.D. Hoheisel, C. Jacq, M. Johnston, E.J. Louis, H.W. Mewes, Y. Murakami, P. Phillippsen, H. Tettelin, S.G. Oliver where “gene” = ORF of 100 amino acids or more. but besides the ~6000 large protein-coding genes, there’s also: 140 ribosomal RNA genes, 275 transfer RNA genes, ~40 small nuclear RNA genes, ~100 small nucleolar RNA genes, ... and ... ?
Structure of the large ribosomal subunit Haloarcula marismortui Ban et. al., Science 289:905, 2000
inside-out genes Tycowski, Shu, and Steitz Nature 379:464, 1996 Human UHG (U22 host gene) no significant ORFs; not conserved with mouse; rapidly degraded Eight intron-encoded snoRNAs conserved with mouse; stable
An RNA motor Simpson et al, Nature 408:745, 2000 “Structure of the bacteriophage f29 DNA packaging motor”
Cartilage-hair hypoplasia mapped to an RNA M. Ridanpaa et al. Cell 104:195, 2001 RMRP: Human RNase MRP, 267 nt
microRNAs (miRNAs) in metazoa T. Tuschl; D. Bartel; V. Ambros lin-4 acts as translational repressor by binding 3’ UTR ~22-mer processed from ~70-mer precursor by RNAi pathway
RNA genes can be hard to detect UGAGGUAGUAGGUUGUAUAGU C. elegans Let-7; 21 nt Pasquinelli et al. Nature 408:86, 2000 • often small • sometimes multicopy and redundant • often not polyadenylated (and remember EST libraries are poly-A selected) • immune to frameshift and nonsense mutation • no open reading frame or codon bias • relatively little information in primary sequence consensus
Two computational analysis problems • Similarity search (e.g. BLAST): • I give you a query; you find sequences in a database that • look like the query. • For RNA, you want to take the secondary structure • of the query into account. • 2. Genefinding (e.g. GENSCAN): • Based solely on a priori knowledge of what a “gene” • looks like, find genes in a genome sequence. • For RNA – with no open reading frame and no codon • bias – what do you look for?
Context-free grammars Noam Chomsky, 1956 Basic CFG “production rules” a CFG “derivation”
Sequence vs. secondary structure alignment R Durbin, SR Eddy, GJ Mitchison, A Krogh Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids Cambridge Univ. Press, 1998 HMM algorithm (sequence) Viterbi Forward Forward-Backward O(MN) O(M2N) O(MN) SCFG algorithm (structure) CYK Inside Inside-Outside O(MN2) O(M3N3) O(MN3) Goal optimal alignment P(sequence | model) EM parameter estimation memory complexity: time complexity (general): time complexity (as used): • we can analyze target sequences with secondary structure models; • but the algorithms are computationally expensive.
SCFG-based RNA similarity search C/D methylation guide snoRNA consensus: Graphical model, prior to conversion to probabilistic model: the program snoscan was used to detect C/D snoRNA homologues in Archaea; Omer et al., Science 288:517-522, 2000
SCFGs for RNA folding Elena Rivas and S.R. Eddy, Bioinformatics 16:573, 2000 Full SCFG analogue of Michael Zuker’s minimum energy RNA folding – means we can apply statistical models to any RNA structure (e. g., what’s the probability that this is a plausible RNA structure?)
Genefinding by comparative analysis Jonathan Badger, Gary Olsen: CRITICA, Mol Biol. Evol. 16:512, 1999 Most comparative analysis relies just on differential rates of evolution. However, the pattern of mutation is also informative. The OTHER model: score with terms P(a,b | OTH) models divergence only the CODING model: score with terms P(aaa,bbb | COD) models divergence, constrained by amino acid substitution matrix and codon bias
add: a comparative model of structural RNAs Elena Rivas, S.R. Eddy: QRNA, BMC Bioinformatics 2:8, 2001 The RNA model: terms: P(a-a’, b-b’ | RNA) models DNA divergence constrained by a secondary structure
Some technical issues • The structure is unknown; must do ensemble averaging. • model must deal with gapped alignments. • bounds of conservation or alignment don’t correspond to bounds of RNA. • evolutionary divergence times of the three models must be the same. • We use a form of probabilistic model called “pair-SCFGs”.
A screen for novel ncRNAs in E. coli Elena Rivas et al., Curr Biol 11:1369, 2001 2367 E. coli intergenic sequences >50 nt in length WUBLASTN vs. S. typhi, S. paratyphi, S. enteriditis, K. pneumoniae gave 23,674 WUBLASTN alignments w/ E<0.01, length >50 nt, >65% identity QRNA classified: 556 candidate RNA loci 160 candidate small ORFs (not examined further) 281 candidate loci are explainable: cis-regulatory RNA structures (terminators, attenuators, etc.) and certain inverted repeat elements leaves 275 candidate ncRNA gene loci Northerns on 49 candidates: 11/49 are expressed as small stable RNAs in exponentially growing E. coli in rich media
The Altuvia screen Argaman et al., Current Biology 11:941, 2001 “Novel small RNA-encoding genes in the intergenic regions of E. coli” “Over a period of about 30 years, only four bona fide regulatory RNAs have been discovered in E. coli. Here we report on the discovery of 14 novel small RNA-encoding genes....” sraA 120 nt sraB 149-168 nt rprA 105 nt sraC 234-249 nt sraD 70 nt gcvB 205 nt sraE 88 nt sraF 189 nt sraG 146-174 nt sraH 88-108 nt sraI 91-94 nt sraJ 172 nt sraK 245 nt sraL 140 nt • start w/ “intergenic” regions • computational identification of putative promoter and terminator, 50-400 nt apart • select regions conserved with other bacteria by BLAST
The Gottesman screen Wassarman et al., Genes Dev. 15:1637, 2001 “Identification of novel small RNAs using comparative genomics and microarrays” rydB 60 nt ryeE 86 nt ryfA 320 nt ryhA 45 nt (sraH) ryhB 90 nt (sraI) ryiA 210 nt ryjA 92 nt rybB 80 nt ryiB 270 nt (sraK, csrC) rybA 205 nt rygA 89 nt (sraE) rygB 83 nt ryeA 275 nt ryeB 100 nt ryeC 107,143 nt ryeD 102,137 nt rygC 107,139 nt “... a multifaceted search strategy to predict sRNA genes was validated by our discovery of 17 novel sRNAs....” • intergenic regions >= 180 nt • conserved w/ other bacteria by BLAST • manual inspection of location & sequence • expression detected on high-density oligo probe array
Summary of three E. coli screens 31 different new RNAs found and confirmed by the three screens: Altuvia: 14 Gottesman: 19 (1 showed no expression; 1 untested) Rivas: 22 (1 showed no expression; 10 untested) Conclusions: Sensitivity of QRNA is respectable; most E. coli ncRNAs conserve secondary structure Only 4/11 of our confirmed ncRNAs are in the Altuvia or Gottesman genes Conclusions: These screens have not saturated E. coli for new ncRNAs; We have >200 other candidates in testing; We have confirmed transcripts as short as 40 nt; The functions of these RNAs are unknown.
Pyrococcus: three hyperthermophile genomes • P. horikoshii • 1.8 Mb, complete • isolated off Okinawa, 1400m depth • Kawarabayasi et al. (NITE, Tokyo) • P. furiosus • 1.9 Mb, complete • from Vulcano Island, Italy • Robb et al. (Utah Genome Center) • P. abyssi • 1.8 Mb, complete • from South Pacific vent, 3500m depth • Genoscope (France) A “black smoker” – deep sea hydrothermal vent photo: American Natural History Museum
RNAs stand out in AT-rich hyperthermophiles % known RNAs detected growth temp (C) %RNA-%genome % GC (genome) % GC (RNA) Methanococcus 85 31% 67% 36% 97% Pyrococcus 98 42% 71% 29% 52% Borrelia 37 29% 54% 25% 29% Aquifex 90 44% 68% 24% 14% Archaeoglobus 83 48% 68% 20% 2% S. cerevisiae 30 38% 54% 16% 0 E. coli 37 51% 59% 8% 0 !!
The G/C computational screen Robbie Klein et al., manuscript submitted Implemented as a 2-state hidden Markov model, using Viterbi or posterior decoding algorithms. Methanococcus jannaschii: (Viterbi parse alone) 43 regions detected (some span multiple RNAs) includes 36/37 tRNAs; SSU and LSU rRNA; 5S, 7S, RNase P. 9 unassigned candidates. 4/9 express small RNAs detectable on Northern. Pyrococcus furiosus: (posterior decoding, plus conservation w. P.a., P.h.) 51 regions detected (some span multiple RNAs) includes 46/46 tRNAs, SSU and LSU rRNA; 2 5S, 7S, and RNase P. 8 unassigned candidates. 4/8 express small RNAs detectable on Northern.
Comparison of G/C to QRNA screen Robbie Klein et al., PNAS, in press P. furiosus – screened by QRNA by comparison to P. horikoshii, P. abyssi G/C screen Both QRNA screen n.d. 51 73 Candidate loci: 45 45 known tRNAs detected (of 46): 46 17 4 novel loci: 8 Confirmed by Northern: 4 4 3 • Like the E. coli screen, about 25% of QRNA candidates were • confirmed by Northern (again in a single growth condition only). • QRNA is detecting most novel structural RNA genes.
human/mouse ncRNA detection the cartilage-hair hypoplasia region: QRNA is a general genefinder for structural ncRNA genes.
The ancient RNA World Gesteland, Cech, Atkins: The RNA World, CSHL Press, 1999
RNA is very good at recognizing RNA Ha, Wightman, Ruvkun; Genes Dev. 10:3041, 1996
A closing idea: The modern RNA world Hypothesis: When a cell needs a molecule that specifically recognizes a target RNA molecule, and the function is either: - catalytically unsophisticated - something that can be abstracted onto a shared protein (e.g. many guide snoRNAs, one methylase) then RNA may be the material of choice. Specific RNA-binding proteins are big, expensive, and more difficult to evolve.
In fact, an old idea... Jacob and Monod, JMB 3:318, 1961
Summary • There appear to be many noncoding RNA genes. • Methods to find homologous RNAs by structural similarity have been • greatly improved, using stochastic context free grammar algorithms. • Methods to find novel RNAs by de novo genefinding have finally • become possible, for instance by using comparative genome analysis. • . [SR Eddy, Nature Reviews Genetics, 2:919, 2001] [R Durbin et al., Biological Sequence Analysis, Cambridge U. Press 1998] [E Rivas, RJ Klein, TA Jones, SR Eddy, Curr Biol 11:1369, 2001; E Rivas, SR Eddy, BMC Bioinformatics, 2:8, 2001]
Acknowledgements the Eddy lab: http://www.genetics.wustl.edu/eddy/ the Eddy lab: http://www.genetics.wustl.edu/eddy/ senior scientist: Elena Rivas systems: Goran Ceric webmaster: Ajay Khanna wet lab: Ziva Misulovin secret agent man: Tom Jones funding: HHMI NIH NHGRI NSF Monsanto students: Zhirong Bao Christian Zmasek Robin Dowell Robbie Klein Steve Johnson Shawn Stricklin John McCutcheon