Genome Wide Searches for RNA Secondary Structure Motifs

Genome Wide Searches for RNA Secondary Structure Motifs Russell S. Hamilton Davis Lab Wellcome Trust Centre for Cell Biology ? Drosophila melanogaster

Introduction: RNA Localization 2 mRNA cis-acting signal Trans-acting factors Dynein - + microtubules RNA Localization is a mode of targeting various proteins to their site of function Cis-acting signals in the mRNA are recognised by trans-acting factors bound to the dynein motor Translation of the mRNA into protein is blocked during transport The mRNA is anchored at the site of function before being translated to protein (Delanoue & Davis, 2005, Cell, in press)

Introduction: gurken 3 Localizing mRNA in oocyte D grk P A osk bcd gurken encodes a TGFαhomologue V gurken is localized to the dorso/anterior corner, forming a cap around the oocyte nucleus and establishes the dorso/ventral axis gurken localization has been shown to be dynein dependent (MacDougal et al, 2003, Dev. Cell, 4, 307-19) gurken localization signal has been mapped to 64nt necessary and sufficient for localization (Van De Bor & Davis, 2004, Curr. Opin. Cell Biol. 16, 300-7 ) gurken also localizes in the embryo

Localized I Factor nucleus Introduction: I Factor 4 I Factor is a retrotransposon (or transposable element), which inserts itself into the genome of an organism I Factor has been found to localize in a similar manner to gurken (Van De Bor, Hartswood, Jones, Finnegan & Davis) The localization signal has been mapped to a 58nt signal necessary and sufficient for localization. Van De Bor

H gurken 64nt stem loop I Factor 58nt stem loop H St St I1 I1 B B I2 I2 Introduction: gurken and I Factor 5 Sequence Similarity %ID = 34% gurken AAGTAATTTTCGTGCTCTCAACAATTGTCGCCGTCACAGATTGTTGTTCGAGCCGAATCTTACT 64 Ifactor ---TGCACACCTCCCTCGTCACTCTTGATTTT-TCAAGAGCCTTCGATCGAGTAGGTGTGCA-- 58 ** *** ** *** *** * * ***** * * Structural Similarity Are there more examples in the Drosophila genome using a similar mechanism of localization? Search by secondary structure not sequence V. Van Der Bor, D. Finnegan, E. Harstwood and C. Jones

Method Outline 6 Database Genome sequences Folded Genome sequences Comparison with grk & I Factor structures

Method: RNALFOLD 7 RNALFOLD Folds large genomic sequences outputting stable structures of a given size Similar to mfold, but optimised for folding on genome wide scale 2L chromosome arm genomic sequence • Window Length user defined • Use 64 and 58 (grk & I Factor LEs) Stable Structures RNALfold Hofacker et al (2004) Bioinformatics 20, 191-198

Method: RNAdistance & RNAforester 8 • RNAdistance & RNAforester • Structures represented in bracket format • Minimal representation maintaining all structural characteristics • Structures then aligned (not by sequence) with the query structure e.g. gurken LE • Scores can be weighted by sequence length and total number of base pairs ..(((((.....))))). Matches = + score .-(-(((-....))))-. Mismatches = - score ( = base pair . = unpaired base - = gap RNAdistance Global Structure Comparison Hofacker (1994) Monatsh.Chem. 125, 167-188 RNAforester Local Structure Comparison Hochsmann (2003) Proc. Comp. Sys. Bioinf. (CSB 2003)

Method: RNAMotif 9 Flexible secondary structure definition and searching algorithm Two step process Step 1. Create a structure description Step 2. Use the description to find matching structures in a sequence database Uses Mfold (and pknots) for secondary structure predictions Output can be ranked by thermodynamic stability User Defined Scoring Based on if/then/else statements e.g. if loop has 6-8 bases then score += 10 else score -= 10 Algorithm Summary Description converted to a tree structure Sequence being matched, has secondary structure converted to tree structure Then the matching can occur. Macke, T.J. et al (2001) Nucl., Acids., Res., 29, 4724-4735

Method: RNAMotif 10 • Define base pairings allowed • (in addition to Watson-Crick) • Define stems, loops, and bulges • Including number of nucleotides • Setting a range 0-N means it can either • be present or not • Can also put in sequence constraints • Including tolerated mismatches • Can search for pseudoknots, triplexes & quadruplexes • Very flexible method of describing secondary structures

Method: RNAMotif 11 4 Description files so far… 1. Basic 2900 hits Matches both gurken and I factor LEs 2. Basic + score 2900 hits Scores nearer gurken as positive Scores nearer I factor as negative 3. Basic + score + seq contraint UU 394 hits UU in bulge present in both gurken and I factor 4. Basic + score + seq contraint UU + CAA/AAC 151+ hits CAA/AAC stem1 present in both gurken and I factor

Method: Overview 12 Take all available sequence databases Predict all stable secondary structures Calculate similarity between grk/Ifactor and stable structures Pattern match structures against an RNAMotif description Results put in database and accessed via web interface

Computational Infrastructure 13 Computational requirements are beyond desktop PC’s Main requirements are for processing power and enough storage space for the sequences being searched and the database of matching structures • Processing • 6 processing nodes • Pentium 4 HT • 1GB RAM Development Platform Data Storage RAID Array File Server Tape Backup Robot Web Server Linked to Database

Web Interface: Searching 14 To stop your browser crashing, you can limit the number of hits displayed Filter by percentage of the sequence deemed to have low complexity Narrow down the search by CG, TE, CR or individual identifiers Select the RNAMotif structure description used in the searches X http://wcbweb.icmb.ed.ac.uk/~ilan/bioinformatics.html

Web Interface: Search Results 15 Custom RNAMotif Score RNAdistance scores displayed RNAMotif raw output showing how sequence matches the structure description Indicates if the sequence has regions of low complexity/repeat regions (option to filter these out)

Web Interface: Gene Mapping 16

Web Interface: Conservation Assessment 17

Results: Candidate Injections 18 We are currently in the process of injecting candidates from the database into oocytes and embryos to determine if the RNA is localized. Results of candidate injections are stored in the database There have been suggestions that up to 20% of Drosophila genes may localize in the oocyte and/or embryo So we want to show that our method is able to enrich for localizing genes

Future Work: Expanding Searches 19 • Depending of the success of the experimental localization assays… • Expand the searches to: • Other Drosophilid genomes • 12 will be sequenced in the near future • Mammalian genomes (particularly human) • Will require considerable computational power • Search for LINE/SINE elements in human (transposon equivalents) • Develop the web interface to enable real time searches to be performed on genes/genomes of interest • Requires massive computational power…

Squid homology model RNA Binding Sites Flexible Linker region Future Work: Tertiary Structure 20 Squid Protein gurken mRNA is known to bind Squid protein Used homology modelling to predict squid tertiary structure (~2.5Å) (Hamilton & Soares) RNA tertiary structure prediction Secondary structure alone may not be sufficient for finding similar structures Experimental Structure Determination RNA + Protein - X-Ray and/or NMR RNA only - NMR RNA + protein 3D Structure Staufen + RNA Ramos et al, 2000, EMBO, 19, 997-1009

Future Work: Machine Learning 21 Long Term Future… Support Vector Machines (SVMs) Take sequence & structure for localizing and non-localizing matches (+ other data) Algorithm learns how to separate localizing from non-localizing Problem is we don’t have enough data at the moment However with all the candidate injections we will hopefully generate enough data for localizing and non-localizing genes

Acknowledgements 22 Davis Lab Ilan Davis Veronique Van De Bor Georgia Vendra Hille Tekotte Renald Delanue Carine Meignin Alejandra Clark Isabelle Kos Richard Parton Finnegan Lab David Finnegan Eve Hartswood Cheryl Jones Bioinformatics Discussions Alastair Kerr Systems Administration Paul Taylor Homology Modelling Dinesh Soares Funding Software

Genome Wide Searches for RNA Secondary Structure Motifs

Genome Wide Searches for RNA Secondary Structure Motifs

Presentation Transcript

RNA Secondary Structure

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction

Motifs: Super Secondary structure

RNA Secondary Structure Prediction

RNA Secondary Structure

6.5 RNA Secondary Structure

RNA Secondary Structure

RNA Secondary Structure

RNA secondary structure

Prediction of Secondary Structure of RNA

RNA secondary structure prediction

RNA Secondary Structure Prediction

RNA: Secondary Structure Prediction and Analysis

RNA Bioinformatics Genes and Secondary Structure

RNA Secondary Structure

Arc-Segment Alignment for RNA Secondary Structure

Secondary Structure Prediction (Mostly RNA)

RNA Secondary Structure

RNA Secondary Structure

RNA Secondary Structure Prediction