1 / 28

Picking Alignments from (Steiner) Trees

Picking Alignments from (Steiner) Trees. Lior Pachter. Fumei Lam. Marina Alexandersson. X. M. Y. Alignment. ATCG--G A-CGTCA. biologically meaningful. Steiner Networks. Pair Hidden Markov Models. fast alignments based on HMM structure. Some basic definitions:

rossa
Télécharger la présentation

Picking Alignments from (Steiner) Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Picking Alignments from (Steiner) Trees Lior Pachter Fumei Lam Marina Alexandersson

  2. X M Y Alignment ATCG--G A-CGTCA biologically meaningful Steiner Networks Pair Hidden Markov Models fast alignments based on HMM structure

  3. Some basic definitions: Let G be a graph and S  V(G). A k-spanner for S is a subgraph G’  G such that for any u,v  S the length of the shortest path between u,v in G’ is at most k times the distance between u and v in G. Let V(G)=R2 and E(G)=horizontal and vertical line segments. A Manhattan network is a 1-spanner for a set S of points in R2. Vertices in the Manhattan network that are not in S are called Steiner points

  4. Steiner point Manhattan network Example: S: red points

  5. [Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points 4-approximation in O(n3) and 8-approximation in O(nlogn)

  6. [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid

  7. [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations) slide A(v) = {u:v is the topmost node below and to the left of u} v

  8. [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide The minimum slide arborescense problem: Lingas-Pinter-Rivest-Shamir 1982 O(n3) optimal solution using dynamic programming

  9. [Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness b v a u

  10. What is an alignment? ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC

  11. X M Y Pair HMMs Simple sequence-alignment PHMM M = (mis)match X = insert seq1 Y = insert seq2

  12. G - - C G A T C G A C - T A Hidden alignment: Observed sequence: ATCG--G AC-GTCA ATCGG ACGTCA Pair HMMs transition probabilities Hidden sequence: M M M X M Y Y output probabilities

  13. MMXMYYM ATCG--G AC-GTCA Using the Pair HMM In practice, we have observed sequence ATCGG ACGTCA for which we wish to infer the underlying hidden states One solution: among all possible sequences of hidden states, determine the most likely (Viterbi algorithm).

  14. X 1 1 1 1 1 1 - - - - - - 3 3 3 3 3 3 M Y Needleman Wunsch Viterbi in PHMM Match prob: pm Mismatch prob: pr Gap prob: pg Match score: log(pm) Mismatch score: log(pr) Gap score: log(pg)

  15. Want to take into account that the sequences are genomic sequences: Example: a pair of syntenic genomic regions

  16. X M Y PHMM X Y

  17. PHMM X Y • A property of “single sequence” states is • that all paths in the Viterbi graph between • two vertices have the same weight

  18. C C G T A T T T A Strategy for Alignment G A T G GATTACATTGATCAGACAGGTGAAGA

  19. The CD4 region 50000 mouse 0 human 0 50000

  20. Exon 1 Exon 2 Exon 3 Exon 4 Intron 1 Intron 2 Intron 3 3’ 5’ Splice site GGTGAG Splice site CAG Stop codon TAG/TGA/TAA Branchpoint CTGAC Translation Initiation ATG

  21. Suggests a new Steiner problem Find the shortest 1-spanner connecting reds to blues

  22. Generalizes the Manhattan network problem (all points red and blue) Generalizes the Rectilinear Steiner Arborescence problem

  23. History of the Rectilinear Steiner Arborescence Problem 1985, Trubin - polynomial time algorithm 1992, Rao-Sadayappan-Hwang-Shor - error in Trubin 2000, Shi and Su - NP complete!

  24. Results for unlabeled problem • An O(n3) 2-approximation algorithm (implemented) • An O(nlogn) 4-approximation algorithm • Testing on CD4 region in human/mouse • Implementation ( SLIM ) • http://bio.math.berkeley.edu/slim/ • SLIM for SLAM (in progress) • http://bio.math.berkeley.edu/slam/

  25. G G A C T T G A T C A T G G A CNS D X Y M I T C T G G T T G G C C T C A G G T G T C G T T T A A A G A T T A G A A T T A G G G G T G T T G C A A T T A A C G T G G T T A C G C C C A A T T G A C G T T C G G A C A A T G T C

  26. The Viterbi graph for a more complicated alignment PHMM

  27. Comparison and Analysis of Performance • Our method has two main steps: (L=length of seqs, n=#HSP) • Building the network O(n3) or O(nlogn) • Running the Viterbi algorithm O(nL) worst case • for the HMM on the network • Banding algorithms are O(L2) worst case for step 2. • Chaining algorithms are O(n2) in the case where gap • penalties can depend on the sequences. • These strategies do not generalize well for more • sophisticated HMMs.

  28. ATCG--G A-CGTCA X M Y Summary Software: SLIM (network build): http://bio.math.berkeley.edu/slim/ SLAM (alignment): http://bio.math.berkeley.ed/slam/ Thanks: Nick Bray and Simon Cawley

More Related