Identification of large-scale genomic rearrangements between closely related organisms

Identification of large-scale genomic rearrangements between closely related organisms Bob Mau1,2, Aaron Darling1,3, Fred Blattner4,5, Nicole Perna1,5 Departments of Animal Health and Biomedical Sciences1, Oncology2, Computer Science3, Laboratory of Genetics4 , Genome Center University of Wisconsin – Madison

The Amazing Variety of Diseases caused by E.coli strainsin Bacterial Pathogenesis: A Molecular Approach “… isdue to the fact different strains have acquired different sets of virulence genes. Most strains of E.coli are avirulent because they lack these virulence genes. E.coli is an excellent example of the maxim that it is the set of virulence genes carried by an organsims that make it a pathogen, not its species or genus designation.”

Categories of Bacterial Genome Evolution • Local Single Base Mutations Indels (Small insertions and deletions • Global (Large-scale) Rearrangements Inversions, translocations, inverted translocations • Gene Gain and Loss Horizontal or Lateral Transfer Transformation, Transduction, and Conjugation Phage Integration Mobile Elements Transposons and Insertion Sequences Gene Duplication ( Mediated by mobile elements )

From the two E. coli genomes sequenced at the Blattner lab, we’ve identified: • ~3900 genes common to both K-12 and O157:H7 • 528 genes unique to K-12 • 1387 genes unique to O157:H7 • 40 % of these genes are of unknown function. Culprits for these wholesale differences: lateral transfer and phage integration

Strategy of Global Alignment of Two Highly Related Genomes: STEP 2 Collapse consecutive pairs to form a collection of maximally exact matches. (MEMs) Use LIS algorithm to construct a collinear set of maximally ordered matches. STEP 1 Quickly find all 16-mer matches between genomes (K1,O1) : (Ki,Oi) : (Kn,On) STEP 3 Extend across intervening regions via anchored alignments from individual MEM endpoints K Unique Insert Partially Sorted Suffix Arrays O Substitution

K-12 vs O157:H7 MEM Stats • 43,235 total MEMs (24 bps) • 31,640 form maximal collinear subset • The largest exact match is 2,632 bases • 62 MEMs exceed 1000 bps • Over 11,000 exceed 100 bps • 18,212 single base differences (SNPs) • Resulted in a segmentation of O157:H7 into 357 intervals of backbone or unique insert.

A Three-way Genomic Comparison: Parkhill et.al. Nature E. coli K-12 MG1655 S. Typhi CT18 S. Typhi-murium LT2

The “Traditional” WAY to view MEMs {(a0,b0),(a1,b1),…, (aK,bK)} for K+1 genomes For the reference genome G0, a0 < b0 by convention. For the NON reference genomes, ak<bk means the match is oriented with G0, ak>bk means the match occurs on the opposite strand (reverse complement)

A novel approach, wherein: • Extensibility: works just as well for N as it does for 2 genomes, provided there is sufficient sequence similarity. • Automatically identifies inversions, translocations, and inverted translocations • Determines a maximal collinear subset within each locally collinear region, without recourse to an LIS step • Extremely space efficient and fast

Multiple Oriented Offset For each non-reference genome, determine the polarity with respect to G0 As well as the offset: The Multiple Oriented Offset is the N vector:

Canonical MEM Equivalence Classes By appending the interval in reference genome coordinates: (a0, b0) to the Moo, the MEM is completely specified. We aggregate MEMs by their generalized offset, inducing a partition on the set of MEMs. This defines a CMemEC: {Moo,{(a01, b01), (a02, b02),…, (a0M, b0M)}}

In this example, it’s clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication.

In this example, it’s clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication. So…. We could probably get by with modest extensions of existing methods: (e.g. MUMmer or our earlier algorithm) to account for laterally transferred lineage-specific sequence.

In this example, it’s clear from the plot that there are two large rearrangements, one around the origin and the other about the terminus of replication. So…. We could probably get by with modest extensions of existing methods: (e.g. MUMmer or our earlier algorithm) to account for laterally transferred lineage-specific sequence. But, biology is never that accommodating...

Hmmmm………….

Hmmmm…………. You hear the one about the biologist, the statistician, and the mathematician volunteering for the psychology experiment ?

Approach Our strategy is to take a multidimensional identification and rearrangement problem, and recast it as a segmentation problem. The rationale is multifaceted: sequence relationships among genomes are easily quantified in this context, established statistical techniques can be used to differentiate signal from noise, and two dimensional segmentation graphs are intuitive and visually appealing. The framework leads to a simple and direct solution.

Simplest Block and Strip Diagram G0: Reference Strip 1 2 34 5 6 7 G1: Strip 1 1 -3 -24 5 6 7 1 -7 5 6 4 3 2 G2: Strip 2 G3: Strip 3 -3 -2 -1 -7 5 6 4 G4: Strip 4 -7 4 5 6 -3 -2 -1

Example with Variable Block Lengths G1: Genome 1 1 -3 -24 5 6 7 1 2 3 4 5 6 7 G2: Genome 2 G3: Genome 3 1 2 -34 -6 -5 7 1 2 -34 5 -6 7 G4: Genome 4 1 2 3 4 6-5 7 G5: Genome 5 G0: Reference 1 2 3 4 5 6 7 Cut pt. Terminus Origin

Large-scale Genomic Rearrangements in evolutionary context Genome 3 Genome 4 Genome 2 Genome 1 Genome 5 Species Tree MRCA Zero Pt. Terminus Origin

Segmentation Graph S(G0)

Sorted Merge Lists of Six Enterobacterial strains Escherichia coli Salmonella Enterica K-12 O157:H7 Typhi Typhimurium MG1655 W3110 EDL933 Sakai CT18 LT2 Six SMLs of bimers, one for each genome. A bimer is the lexicographically lesser of an n-mer (we use n=23) and its reverse complement, together with an orientation flag.

For non-reference genome D:\Perna_Land\Genomes\Ecoli\W3110_inv.fas Block # St BLK End BLK # of BLKS NRst NRend RFst RFend BUMsize NRsize Refsize Diff 1 1 1445 1445 741 1760428 1077 1755621 56789 1759688 1754545 0.0323668 2 -1454 -1446 9 1767955 1776641 1759789 1768475 332 8687 8687 0.038218 3 1455 2966 1512 1788751 3418425 1783919 3421292 62466 1629675 1637374 0.0381501 4 -4054 -2967 1088 3422076 4205822 3429068 4212864 52579 783747 783797 0.0670824 5 4055 4468 414 4211754 4635113 4214671 4638119 16454 423360 423449 0.0388571 For non-reference genome D:\Perna_Land\Genomes\Ecoli\EDL933.fas Block # St BLK End BLK # of BLKS NRst NRend RFst RFend BUMsize NRsize Refsize Diff 1 1 1314 1314 1094 1844555 1077 1311796 51773 1843462 1310720 0.0394997 2 -1386 -1315 72 1931014 2266139 1314021 1625564 2700 335126 3115440.00866651 3 1387 4468 3082 2327642 5523342 1654579 4638119 134147 3195701 2983541 0.0449623 For non-reference genome D:\Perna_Land\Genomes\Salmonella\stmur.fas Block # St BLK End BLK # of BLKS NRst NRend RFst RFend BUMsize NRsize Refsize Diff 1 1 1108 1108 1077 1170908 1077 1028164 43436 1169832 1027088 0.0422904 2 1113 1228 116 1199860 1325305 1062939 1195266 4705 125446 132328 0.0355556 3 -1532 -1404 129 1362183 1543579 1684989 1866491 5686 181397 181503 0.0313273 5 -1403 -1379 25 1554999 1648363 1551273 1676600 850 93365 125328 0.0067822 6 -1376 -1265 112 1651762 1873399 1265229 1547132 4492 221638 281904 0.0159345 8 -1264 -1241 24 1874358 1881818 1257015 1264270 1006 7461 7256 0.138644 10 -1240 -1229 12 1903740 1913781 1223601 1237021 423 10042 13421 0.0315178 11 1533 1661 129 1914663 2082370 1885529 2041692 4931 167708 156164 0.0315758 12 1664 1774 111 2146022 2362751 2083921 2306981 4179 216730 223061 0.0187348 13 1777 2164 388 2366550 2695516 2310740 2684112 15747 328967 373373 0.042175 15 2165 2466 302 2703974 3200986 2685405 3033444 12610 497013 348040 0.0362315 16 2468 3266 799 3201694 3732474 3034152 3594163 39636 530781 560012 0.0707771 19 3269 3459 191 3734913 4010332 3596597 3865052 7611 275420 268456 0.028351 21 3460 3799 340 4039599 4244302 3876077 4079896 14872 204704 203820 0.0729663 25 3803 4158 356 4245478 4547866 4081072 4345741 15148 302389 264670 0.0572335 26 4164 4402 239 4566598 4724335 4360148 4485767 9959 157738 125620 0.0792788 28 4403 4468 66 4786919 4856330 4586098 4638119 2715 69412 52022 0.0521895 For non-reference genome D:\Perna_Land\Genomes\Salmonella\styphii.fas Block # St BLK End BLK # of BLKS NRst NRend RFst RFend BUMsize NRsize Refsize Diff 1 1 1108 1108 1077 1079944 1077 1028164 43436 1078868 1027088 0.0422904 2 1113 1228 116 1109949 1234790 1062939 1195266 4705 124842 132328 0.0355556 3 1270 1376 107 1236923 1450281 1271007 1547132 4325 213359 276126 0.0156631 4 1379 1403 25 1450793 1537922 1551273 1676600 850 87130 125328 0.0067822 6 1404 1532 129 1583546 1751043 1684989 1866491 5686 167498 181503 0.0313273 9 -1264 -1241 24 1795960 1803420 1257015 1264270 1006 7461 7256 0.138644 11 -1240 -1229 12 1825342 1835382 1223601 1237021 423 10041 13421 0.0315178 12 1533 1661 129 1836264 2045959 1885529 2041692 4931 209696 156164 0.0315758 13 1664 1774 111 2102812 2319702 2083921 2306981 4179 216891 223061 0.0187348 14 1777 2164 388 2323501 2659616 2310740 2684112 15747 336116 373373 0.042175 16 2165 2466 302 2668061 3063508 2685405 3033444 12610 395448 348040 0.0362315 17 2468 2966 499 3064216 3418171 3034152 3421292 21210 353956 387141 0.0547862 18 -3746 -3622 125 3425772 3552616 3944538 4030773 5074 126845 86236 0.0588385 19 -4051 -3803 249 3558079 3699838 4081072 4205480 11418 141760 124409 0.0917779 23 -3799 -3747 53 3701014 3742547 4039570 4079896 2126 41534 40327 0.052719 24 -3621 -3460 162 3750285 3810582 3876077 3938202 7672 60298 62126 0.123491 26 -3459 -3269 191 3838373 4111064 3596597 3865052 7611 272692 268456 0.028351 29 -3266 -2967 300 4113504 4254805 3429068 3594163 18426 141302 165096 0.111608 30 4052 4158 107 4263480 4392983 4211800 4345741 3730 129504 133942 0.0278479 31 4164 4402 239 4543092 4680423 4360148 4485767 9959 137332 125620 0.0792788 33 4403 4468 66 4751272 4807935 4586098 4638119 2715 56664 52022 0.0521895

01 2 34 5 6 7 8 9 10 11 12 0 K5 K4 K3 K2 K1 K25 K24 K23K22 K21 K20.5 K20 K19 K18 A Transformation of CO92 to KIM by Inversions Near the Origin C20 C21 C22 C22.5 C23 C24 C25 C1 C2 C3 C4 C5 C6 C7 0 12 3107 15 42 11 6 98 0 0 89 6 11 2 45 1 7 10 3 12 0 0 1 5 4 2 11 6 9 8 7 10 3 12 0 0 1 11 2 4 5 69 8 7 10 3 12 0 0 1 3 10 7 8 96 5 4 2 11 12 0 0 1 3 2 4 5 6 9 8 7 10 11 12 0 0 1 3 2 4 5 6 9 8 7 10 11 12 0

Identification of large-scale genomic rearrangements between closely related organisms