Physical Mapping -- An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497

Physical Mapping --An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497 04Mar2004

Introduction Why physical mapping? -Physical mapping is a central in Molecular Biology. -DNA is cut into small fragments for replicate and study, and information on the ordering is lost. -The goal of physical mapping is to reconstruct the relative ordering of the clones.

Introduction Two Popular ways of obtaining fingerprints: • Restriction site analysis. Measure fragment’s length which is its fingerprint. • Hybridization. Check whether a small sequence known as a probe binds or hybridizes to the clone which is DNA fragment. Most often a probe is a STS (sequence tagged sites) – DNA string of 200-300 bp whose ends occur only once in the entire genome.

Models for Hybridization Mapping -Interval Graph Models: Vertices represent clones and edges represent overlap information between clones. -Disadvantage: complexity NP-hard.

Models for Hybridization Mapping-C1P definition Definition: A binary matrix is said to have the consecutive ones property (C1P) if a permutation of its columns can be found such that all 1s in each row are consecutive.

Models for Hybridization Mapping – C1P Assumptions for Consecutive Ones Property (C1P) Model : a. Probes are unique – a probe can bind to a clone in at most one place – use STS (sequence tagged sites); b. No errors – (C1P permutation exists); c. All “clones*probes” hybridization experiments have been done – difficult to achieve. Advantage: Polynomial-time solvable.

Models for Hybridization Mapping – C1P model n clones and m probes n * m binary matrix M built from experimental data Mij = 1 probe j hybridized to clone i Mij = 0 probe j not hybridized to clone i

Algorithm for C1P - Introduction Goal – Find a permutation of the columns such that in each row all 1s are consecutive. Assumptions: All rows are different, i.e. no two clones have the same fingerprint. No row is all zeros, i.e. every clone is hybridized by at least one probe.

Algorithm for C1P – Algorithm sketch • Separation of the rows into components (subsets of rows). • Permutation of the columns of each component. • Join of the components together.

Algorithm for C1P – Row relations Definition: " row iÎM, Si={columns k | Mi,k=1} Given two rows i and j: • SiÇSj = Æ or • SiÍSj or Sj ÍSi or • SiÇSj¹Æ and none is a subset of the other. First case: i and j have no conflicts - they can be dealt with separately. Second case: i and j are compatible - any solution for the row with fewer 1s is acceptable. Third case: i and j have to be treated simultaneously - they are connected.

Algorithm for C1P–Taking care of a component β l3 l2 α l4 l1 γ l5 l8 δ l6 l7 TABLE 5.1 A binary matrix. Figure 5.7 Graph Gc corresponding to the matrix of Table 5.1

Algorithm for C1P–Example Matrix A section of a binary matrix l1 l2 l3 {2,7,8} {2,7,8}{2,7,8} l1→… 0 1 1 1 0 … {5} {2,7} {2,7} {8} l1→ … 0 0 1 1 1 0 … l2→ … 0 1 1 1 0 0…

Algorithm for C1P–Example Matrix l1 l2 l3 What will happen if we place 5 on the right? {8} {7,2} {7,2} {5} l1→ … 0 1 1 1 0 0 … l2→ … 0 0 1 1 1 0…

Algorithm for C1P–Example Matrix How to place l3? Consider the number of elements in the intersections between S1, S2 and S3. Definition: Let x*y = |Sx∩Sy| be the internal product of rows x and y. -If l1*l3 < min(l1*l2, l2*l3), place l3 in the same direction that l2 was placed with respect to l1. -If l1*l3 > min(l1*l2, l2*l3), place l3 in the opposite direction that l2 was placed with respect to l1. l1 l2 l3

Algorithm for C1P–Example Matrix In our case: S3 = {1,4,7,8}, Then l1*l3 = 2, l1*l2 = 2, l3*l2 = 1. So, place l3 to the right of l2. l1 l2 l3 {5}{2}{7}{8}{1,4}{1,4} l1→… 0 0 1 1 1 0 0 0 … l2→… 0 1 1 1 0 0 0 0 … l3→… 0 0 0 1 1 1 1 0 …

Algorithm for C1P–Complexity Building Graph Gc takes O(nm) time. Process n rows, spending O(m) per row to check consistency of column sets. Total time is O(nm).

Algorithm for C1P–Joining Components Together α β γ δ Figure 5.9 Graph GM corresponding to the components of the matrix from Table 5.1. TABLE 5.1 A binary matrix.

Algorithm for C1P–Joining Components Together Process GM in topological ordering: -Process first components that have sets that are not contained anywhere else. -Suppose following edge (α,β), find “reference column” in component αthat will tell us how to place the rows of β. a. Choose row l fromβthat has the leftmost 1, and call the column where this 1 is cβ. b. Find all rows fromαthat contain Sl, and find the leftmost column where all such rows have 1s, this column cαis the reference column.

Algorithm for C1P–Joining Components Together {1} {2,4,5,7,9} {3,6,8} l1→ … 1 1 1 1 1 0 0 0 … l2→ … 0 1 1 1 1 1 1 1 … α {2,4,5,7,9} l3→ … 1 1 1 1 1 … β {1} {2,4,5,7,9} {3,6,8} l1→ … 1 1 1 1 1 1 0 0 0 … l2→ … 0 1 1 1 1 1 1 1 1 … l3→ … 0 1 1 1 1 1 0 0 0 …

Algorithm for C1P–Joining Components Together {9,5} {4} {7} {2} l6→ … 0 0 1 1 0 … l7→ … 0 0 0 1 1 … l8→ … 1 1 1 0 0 … δ {1} {9,5} {4} {7} {2} {3,6,8} … l1→ … 1 1 1 1 1 1 0 0 0 … l2→ … 0 1 1 1 1 1 1 1 1 … l3→ … 0 1 1 1 1 1 0 0 0 … l6→ … 0 0 0 1 1 0 0 0 0 … l7→ … 0 0 0 0 1 1 0 0 0 … l8→ … 0 1 1 1 0 0 0 0 0 …

Algorithm for C1P–Joining Components Together {6} {3} {8} l4→ … 0 1 1… l5→ … 1 1 0 … γ {1} {9,5} {4} {7} {2} {6} {3} {8}… l1→ … 1 1 1 1 1 1 0 0 0 … l2→ … 0 1 1 1 1 1 1 1 1 … l3→ … 0 1 1 1 1 1 0 0 0 … l6→ … 0 0 0 1 1 0 0 0 0 … l7→ … 0 0 0 0 1 1 0 0 0 … l8→ … 0 1 1 1 0 0 0 0 0 … l4→ … 0 0 0 0 0 0 0 1 1 … l5→ … 0 0 0 0 0 0 1 1 0 … α β δ γ

Algorithm for C1P–Joining Components Together Complexity: Topological sorting O(n+m); Preprocessing takes at most O(nm), e.g. store for each row the column where its leftmost 1 is; Total time O(nm).

Approximation for Hybridization Mapping with Errors 0 1 1 0 1 1 1 1 0 0 a false negation separate two blocks of 1s, creating another gap Approach: find a permutation where the total number of gaps in the matrix is minimum.

Approximation - Graph Model Gap minimization is equivalent to solving traveling salesman problem (TSP). TABLE 5.3 A clones*probes matrix with added column p6*.

Approximation - Graph Model p1 3 2 2 2 3 P6* p2 2 1 3 0 4 p5 The weight on each edge of G is the number of rows where the two corresponding columns differ. p3 4 2 3 2 2 p4 FIGURE 5.10 TSP graph for matrix of Table 5.3.

Approximation - Graph Model • a gap: a transition from 1 to 0 and further on a transition from 0 to 1. -two transitions for each gap, each gap contributes 2 to the weight of the cycle. • extremal transitions: transitions between elements in extremal (1 or m) column. -include an extra column of zeros in column m+1 to ensure every row has a pair of extremal transitions. prevent consecutive 1s to wrap around in each row.

-Relationship between cycles and permutations: Cycle weight = number of gap transitions + 2n For a given n, minimizing cycle weight is the same as minimizing the number of gaps. -Drawback: one or a few rows may have many gaps, while others may have none. One clone was subject to many more errors than other clones, and this contradicts laboratory experience. -Solution: minimizing the number of gaps per row. Approximation - Graph Model

Approximation - Guarantee -Assumptions: a. The number of probes is sufficiently large. b. The mapping process obeys a certain mathematical model. -Features: a. Each clone’s position is an independent random variable, clone locators are distributed uniformly over [0, N-1]. b. Occurrences of a given probe obey a Poisson process with rate λ. Pr{a given probe occurs k times in a given clone} =e-λλk/k!.

Approximation - Guarantee TSP permutation is a good approximation to the true permutation. Prove in terms of graph weights or clone distances. tij = |lj– li| + |rj-ri| = 2|lj-li| tij: true distance; clone’s coordinates: l (left), r (right); hij: Hamming distance between clones i and j. Given any four clones i, j, r, and s, hij < hrs implies tij < trs tij < trs implies hij < hrs.

Define hybridization graph H as a bipartite graph (U, V, E): Clones are the vertices of the U partition; Probes are the vertices of the V partition; There is an edge between two vertices if the corresponding probe hybridized to the corresponding clone. Approximation – Computational Practice

Approximation – Computational Practice TABLE 5.3 A clones*probes matrix with added column p6*. p1 p2 p3 p4 p5 c1 c2 c3 c4 FIGURE 5.11 Hybridization graph H corresponding to hybridization matrix from Table 5.3, without the added column.

Observations: a. H may not be connected, not be able to tell the relative order between probes that belong to different components. b. Connected component may be as simple as a singleton vertex. No hybridization - 0 in Column. c. Redundant probes, or probes that hybridize to exactly the same set of clones - same 1s and 0s in columns. Approximation – Computational Practice

Evaluation of a mapping algorithm is a difficult task. The fraction of strong adjacencies is used to measure a mapping algorithm. -Strong adjacencies: the number b of blocks of consecutive 1s present in a hybridization matrix with a given probe permutation π = p1, p2, …, pm. -Translocations: operations that reverse the order of a set of consecutive probes. Two adjacent probes pi and pi+1 represent a strong adjacency if placing these probes apart by any translocation increases b in each row. Approximation – Computational Practice

Approximation – Computational Practice Strong adjacency cost: 100(1/m-1∑δi) δi = 1, if pi and pi+1 is a strong adjacency in the true permutation but these probes are not adjacent in the proposed permutation. δi = 0, otherwise.

Approximation – Computational Practice TABLE 5.4 Strong adjacency costs for two algorithms on matrices with different kinds of errors. Error rates are indicates in the heading of each column (only one type of error per column). Coverage in all cases is 10, where coverage is the ratio between the total length of all clones and target DNA length.

REFERENCES 1. Sections 5.3 and 5.4 in our textbook: Introduction to Computational Molecualar Biology, Setubal/Meidanis, 1997. 2. On the Complexity of DNA Physical Mapping, Martin Charles Golumbic, Haim Kaplan and Ron Shamir, Advances in Applied Mathematics 15, 251-261 (1994).

Physical Mapping -- An Algorithm and An Approximation for Hybridization Mapping Shi Chen CSE497