280 likes | 421 Vues
Picking Alignments from (Steiner) Trees. Lior Pachter. Fumei Lam. Marina Alexandersson. X. M. Y. Alignment. ATCG--G A-CGTCA. biologically meaningful. Steiner Networks. Pair Hidden Markov Models. fast alignments based on HMM structure. Some basic definitions:
E N D
Picking Alignments from (Steiner) Trees Lior Pachter Fumei Lam Marina Alexandersson
X M Y Alignment ATCG--G A-CGTCA biologically meaningful Steiner Networks Pair Hidden Markov Models fast alignments based on HMM structure
Some basic definitions: Let G be a graph and S V(G). A k-spanner for S is a subgraph G’ G such that for any u,v S the length of the shortest path between u,v in G’ is at most k times the distance between u and v in G. Let V(G)=R2 and E(G)=horizontal and vertical line segments. A Manhattan network is a 1-spanner for a set S of points in R2. Vertices in the Manhattan network that are not in S are called Steiner points
Steiner point Manhattan network Example: S: red points
[Gudmundsson-Levcopoulos-Narasimhan 2001] Find the shortest Manhattan network connecting the points 4-approximation in O(n3) and 8-approximation in O(nlogn)
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 1. it suffices to work on the Hanan grid
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 2. Construct local slides (for all four orientations) slide A(v) = {u:v is the topmost node below and to the left of u} v
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 3. Solve each slide The minimum slide arborescense problem: Lingas-Pinter-Rivest-Shamir 1982 O(n3) optimal solution using dynamic programming
[Gudmundsson-Levcopoulos-Narasimhan 2001] proof outline: 4. Proof of correctness b v a u
What is an alignment? ATCG--GACATTACC-AC AC-GTCA-GATTA-CAAC
X M Y Pair HMMs Simple sequence-alignment PHMM M = (mis)match X = insert seq1 Y = insert seq2
G - - C G A T C G A C - T A Hidden alignment: Observed sequence: ATCG--G AC-GTCA ATCGG ACGTCA Pair HMMs transition probabilities Hidden sequence: M M M X M Y Y output probabilities
MMXMYYM ATCG--G AC-GTCA Using the Pair HMM In practice, we have observed sequence ATCGG ACGTCA for which we wish to infer the underlying hidden states One solution: among all possible sequences of hidden states, determine the most likely (Viterbi algorithm).
X 1 1 1 1 1 1 - - - - - - 3 3 3 3 3 3 M Y Needleman Wunsch Viterbi in PHMM Match prob: pm Mismatch prob: pr Gap prob: pg Match score: log(pm) Mismatch score: log(pr) Gap score: log(pg)
Want to take into account that the sequences are genomic sequences: Example: a pair of syntenic genomic regions
X M Y PHMM X Y
PHMM X Y • A property of “single sequence” states is • that all paths in the Viterbi graph between • two vertices have the same weight
C C G T A T T T A Strategy for Alignment G A T G GATTACATTGATCAGACAGGTGAAGA
The CD4 region 50000 mouse 0 human 0 50000
Exon 1 Exon 2 Exon 3 Exon 4 Intron 1 Intron 2 Intron 3 3’ 5’ Splice site GGTGAG Splice site CAG Stop codon TAG/TGA/TAA Branchpoint CTGAC Translation Initiation ATG
Suggests a new Steiner problem Find the shortest 1-spanner connecting reds to blues
Generalizes the Manhattan network problem (all points red and blue) Generalizes the Rectilinear Steiner Arborescence problem
History of the Rectilinear Steiner Arborescence Problem 1985, Trubin - polynomial time algorithm 1992, Rao-Sadayappan-Hwang-Shor - error in Trubin 2000, Shi and Su - NP complete!
Results for unlabeled problem • An O(n3) 2-approximation algorithm (implemented) • An O(nlogn) 4-approximation algorithm • Testing on CD4 region in human/mouse • Implementation ( SLIM ) • http://bio.math.berkeley.edu/slim/ • SLIM for SLAM (in progress) • http://bio.math.berkeley.edu/slam/
G G A C T T G A T C A T G G A CNS D X Y M I T C T G G T T G G C C T C A G G T G T C G T T T A A A G A T T A G A A T T A G G G G T G T T G C A A T T A A C G T G G T T A C G C C C A A T T G A C G T T C G G A C A A T G T C
Comparison and Analysis of Performance • Our method has two main steps: (L=length of seqs, n=#HSP) • Building the network O(n3) or O(nlogn) • Running the Viterbi algorithm O(nL) worst case • for the HMM on the network • Banding algorithms are O(L2) worst case for step 2. • Chaining algorithms are O(n2) in the case where gap • penalties can depend on the sequences. • These strategies do not generalize well for more • sophisticated HMMs.
ATCG--G A-CGTCA X M Y Summary Software: SLIM (network build): http://bio.math.berkeley.edu/slim/ SLAM (alignment): http://bio.math.berkeley.ed/slam/ Thanks: Nick Bray and Simon Cawley