1 / 19

Finding regulatory modules: A statistical approach

Finding regulatory modules: A statistical approach. Mikhail Velikanov Linnaeus Centre for Bioinformatics. Introduction. Regulatory modules (RMs): sets of regulatory sites that work cooperatively TF binding sites and promoter elements Splicing enhancers and suppressors “Site clusters”

sauda
Télécharger la présentation

Finding regulatory modules: A statistical approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding regulatory modules: A statistical approach Mikhail Velikanov Linnaeus Centre for Bioinformatics

  2. Introduction • Regulatory modules (RMs): sets of regulatory sites that work cooperatively • TF binding sites and promoter elements • Splicing enhancers and suppressors • “Site clusters” • site A AND (site B OR site C) AND (NOT site D) • “Beads on a spring” • site B is 20 ± 3 bp downstream of site A • Distance distributions have a short range and a well-defined peak

  3. Searching for RMs: Setup of the Problem Motifs Annotations • Seq. length constant and small (~0.5 kb) • Num. of sites ~20 • No overlapping sites • Sites characterized by: • Identity • P-values (≤ pt) • shown by width Look for annotation patterns that occur consistently in all or some of the sequences.

  4. RMs as Annotation Alignments • Align sites by identity • Find sequences of 2 or more sites shared across a number of annotations (common annotations) Conditions: • Distances between sites are similar • P-values of aligned sites are similar • P-values of aligned sites are small Need a function that measures how well conditions (1-3) are satisfied (strength of common annotation).

  5. Strength of common annotation: site p-values • Assume a common annotation of S sites supported by N sequences • For the i-th site, let pimin, pimax be the smallest and the largest of the N p-values • pimax: measure of how small p-values are • Ri = pimax/pimin: measure of similarity • Probability πi of observing p-values as similar and as small in N random annotations

  6. Strength of common annotation:distances between sites • Account for no overlaps between sites • renormalization of πi for each site • π0 = 1 - ∑πi : positions between sites • Compute approximately probability of common annotation PCA as a function of π0, π1, …, πS • Strength of common annotation Z = -ln PCA S ~ ~ i=1 ~ ~ ~

  7. Searching for the strongest common annotations • Given an input set of annotations, define groups of annotations such that • each group has at least one common annotation • the strongest common annotation of each group is distinct • NB: Groups may fully or partially overlap! Cannot use standard clustering algorithms.

  8. Classification Algorithm • Find pairs of annotations with at least one common annotation • Each pair is a nucleus of a potential group • Each group grows by adding annotations one at a time • the group retains its strongest common annotation at each step • each addition maximizes the group strength • annotation added to one group remains available for addition to other groups Where does the growth stop? (strength = group strength, Zg)

  9. Stopping criterion • No more annotations can be added • group contains all annotations in the input set • change in the strongest common annotation • Formed during growth of another group • ignore current group (“pruning”) • Group strength is too small • adding an unrelated annotation • group strength Zg is a score (Zg > 0) • can be computed for groups of random annotations • by the extremal types theorem lim Prob(Zgrand > Zg) = 1 - exp[-(Zg/b)-a] • threshold on Zg • numerical calibration of a, b for all possible N, S n → ∞

  10. From annotation groups to RMs • Need a way to: • account for optional sites • search for homologous RMs

  11. RMs as generalized HMMs • Generalized (duration) HMMs (gHMMs or dHMMs) consist of 2 types of states • motif states (PSSMs) • annotation sites • spacer states (distance distributions) • gaps between sites • States are connected according to certain topology • Transitions probabilities depend only on the present state • Common annotations of groups are simple gHMMs

  12. S S S0 S3 S3 S1 E S2 E RMs as generalized HMMs • Common annotations define gHMM states • Overlaps define topology and provide estimates of transition probabilities • Multiple matches to the model Can make a single model because of the overlap!

  13. S Annotations S0 S3 S1 S2 E From annotations to RMs RMs

  14. Testing the Method: Test 1 • 25 random DNA sequences, 20 are “seeded” with an RM • 2 sites with low p-value (< 10-3) separated by 20 – 25 bp • Scan sequences with unrelated motif subject to p-value threshold • 3rd site (random noise in annotations) m0 m1

  15. m0 m1 m3 m4 Testing the Method: Test 2 • 25 random DNA sequences, 2 non-overlapping groups of 10 and 11 sequences • each group is “seeded” with a distinct RM (2 sites) • distance between sites is 20 – 25 bp or 52 – 55 bp • Extra site added as before

  16. m0 m1 m3 m4 Testing the Method: Test 3 • 25 random DNA sequences, 2 overlapping groups of 12 and 14 sequences • same RMs as in previous test • groups overlap by 5 sequences • Extra site added as before

  17. Summary • A method for discovery of regulatory modules given a set of annotated sequences • Builds RMs from recurrent annotation patterns • Treats site p-values and distances in consistent statistical framework • Can use prior information on RMs (Bayesian approach) • RMs are output as gHMMs • flexibility of RMs structure (topology) • searching for homologous RMs

  18. Future developments • Testing the method on real data • upstream regions of bacterial operons • bacterial Fe-regulons • other benchmark sets? • Algorithm improvements • better stopping criterion (use properties of distance distributions) • more precise computation of common annotation strength • better similarity measure for site p-values (reduce compensation)

  19. Acknowledgements Thanks to David Ardell (LCB, Uppsala) and Georgiy Sofronov (Univ. of Queensland, Brisbane) for many fruitful discussions

More Related