150 likes | 323 Vues
Using RNA secondary structure to guide sequence motif finding toward single-stranded regions. Michael Hiller, Rainer Pudimat, Anke Busch and Rolf Backofen Rachel Brower-Sinning. Motivation:.
E N D
Using RNA secondary structure to guide sequence motif finding toward single-stranded regions Michael Hiller, Rainer Pudimat, Anke Busch and Rolf Backofen Rachel Brower-Sinning
Motivation: RNA binding proteins are an integral part of pre-mRNA processing (splicing, etc.) and regulate mRNA processes (transport, stability, etc) On the mRNA molecule, there might be multiple binding sites according to the sequence specificity, but the structure of the strand will determine which site is accessible to the BP Current programs, such as MEME, search for motifs using a PSPM to look at the probability of each letter at each position in the pattern, but do not look at the secondary structure of the sequence
Approach The approach taken by MEMERIS to to first pre-compute the “single-strandedness” of a substring in the RNA sequence from position a to b; allowing for the choice of two measurements- The first being the probability that all bases in the string are unpaired (PUab) And the second being the expected fraction of bases in the substring [a,b] that do not form base pairs (EFab)
Approach con’t Next, this secondary structure information is integrated in with MEME MEME is a program for finding motifs in a set of unaligned sequences (X = X1, X2, … ,Xn), where the motif is defined as a PSPM, Θ1 = (P1, P2, … ,PW) where W is the length of the motif and the vector Pi is the probability distribution of the letters at position i. A given sequence Xi is modeled as consisting of two different parts: - a non-negative number of non-overlapping motif occurrences sampled from the matrix - random samples from a background probability distribution for the remaining sequence positions
Approach con’t MEME considers three different models- - exactly one motif occurrence per sequence (OOPS model) - zero or one motif occurrences per sequence (ZOOPS model) - zero or more motif occurrences per sequence (TCM model) To find the motif(s) an expectation maximization algorithm is used to perform a maximum likelihood estimation of the model given the data
Approach con’t OOPs model -MEME uses where -which changes to in MEMERIS, where
Approach con’t The expectation of the hidden variables is computed as And the ML estimation remains unchanged
Results • PU values are stricter than the EF values; using PU values will favour single strandedness more than using EF values • Trade off: PU values are dependent on motif length: an increase in the length of the motif results in a decrease in PU While EF values are independent of motif length
Results con’t Using the SELEX data, which contains 33 TCAT or ACAT repeats in hairpin loops, binding sites of the neuron-specific splicing factor Nova-1 MEMERIS correctly identifies these described 33 TCAT and ACAT his MEME identifies the correct motifs, but also motifs outside the hairpins (not binding sites)
Discussion RNA binding proteins bind in a sequence specific manner, but have a preferred structural characteristic to the binding site (with the motif occurring in dsRNA having been found to eliminate protein binding) MEMERIS can both look for sequence motif and incorporate secondary structure information MEMERIS has been shown to identify ssRNA motifs that often are the protein binding motif