DNA Sequence Compression using the Burrows-Wheeler Transform

DNA Sequence Compression using the Burrows-Wheeler Transform Don Adjeroh, Yong Zhang Computer Science and Electrical Engineering West Virginia University, Morgantown, USA Amar Mukherjee Computer Science University of Central Florida, Orlando USA Matt Powell and Tim Bell Computer Science University of Canterbury, Christchurch New Zealand August 16, 2002

Outline • Introduction and The Problem • Background • Overview of Approach • BWT and Repeat Analysis • Parsing Strategies • Results • Conclusion

Introduction • DNA as an information storage medium • Draft sequence of the human genome now available • Complete genomes available for many other organisms • Implications and Possibilities • Genome-wide analysis of entire genomes • Cross-genome analysis with complete genomes • Drug discovery and medical science • Potential cure for gene-related diseases, such as sickle-cell anemia, Huntington’s disease, Fragile-X mental retardation syndrome, cancer …

The problem Size !

The problem of size... • Some important genomes are in the order of billions of base pairs • Exponential growth in the number of complete genomes • Exponential growth in the size of available DNA sequence data Source: Genbank website • We need ... • Efficient and effective algorithms for sequence analysis and interpretation • Efficient techniques for management, organization, and distribution of huge-volume sequence data

Nature of DNA Sequences • Four types of nucleotide bases • A - adenine, C - cytosine, G - guanine, T - thymine • Various forms of repetitions • Direct and tandem repeats • Reverse repeats, complimented repeats, palindromes • Combinations • Approximate repeats • Introns and Exons • Coding areas - generally less repetitive • Non-coding areas - generally more redundant • Non-coding areas make up > 50% of genomes • DNA - only part of the whole story

S S . S X X X X S . . . Y Y Y . . . . S Z Z Z Z Y . . . S Y Dictionary DNA Sequence Compression Example: • General Data Compression • Symbol-wise substitution (Huffman codes) • Dictionary-based (LZ family) • Context-based (BWT and PPM) • DNA Sequence Compression • 4 symbols (A,C,G,T)  at most 2 bpcon average • Generally dictionary-based • Exploit the different forms of repeat • Key Issues • Speedy identification of the repeats • Parsing with the repeats • Representation of the parsed sequence • Encoding the results On-line dictionary Off-line dictionary

S . X X . S Y . S . Z Z Y . S Y Dictionary Overview of Approach • Off-line dictionary based • Repeat analysis via BWT • Variable-length parsing • Integer encoding using a simple Huffman code • Compression using the BWT compression pipeline Repeat Analysis input BWT MTF VLC output Repeat Analysis input BWT MTF VLC output

The Burrows-Wheeler Transform • Forward Transform Let T be the input sequence, T=t1,t2,…, tu. • Form u permutations of T by cyclic rotations of the characters in T. These form a uu matrix M’. Each row represents one permutation of T; • Sort the rows of M' lexicographically to form another matrix M. • Record L, the last column of M, and id, the row number for the row in M that corresponds to T. The BWT output is the pair (L, 2).

input BWT MTF VLC output The BWT … • Inverse Transform Given only the pair (L,id) • Sort L to produce F, the array of first characters • Compute V, provides 1-1 mapping between the L and F. F[V[j]]=L[j]. • Generate original sequence T, the symbol L[j] cyclically precedes the symbol F[j] in T, that is, L[V [j]] cyclically precedes the symbol L[j] in T. • BWT Compression Pipeline

BWT - Auxiliary Arrays • From inverse BWT, F[j] = L[V [j]]. We generate FT mapping: • Let Hr=reverse(H). we have • Define Hrs as the index vector to Hr: • Hr &Hrs can be obtained in linear time from L, V and F. • Example T=ACTAGA

BWT, Suffix Trees and Repeats • The BWT provides a lexicographic sorting of the contexts • The BWT is closely related to the suffix tree & suffix array • The suffix array corresponds exactly to our auxiliary array Hrs ! • Repetition analysis as is done on suffix trees can now be done on the BWT output !

Parsing Strategy I -- VPS1 • Off-line dictionary with pointers in the dictionary • Repetition codes: • 1- direct repeat; 2 - reverse; 3 palindrome; 4 - compliment; 5 - complimented palindrome; 6- reverse compliment • General Dictionary structure • Example: Dictionary for example

Analysis of VPS1 • Costs (in bits) • Original sequence, S: • Parsed sequence: • Vocabulary: • Positions • Dictionary: • Compression gain: • With , we underestimate the gain: , = + G(S) =

Parsing Strategy II -- VPS2 • Off-line dictionary with pointers in the sequence • Fixed-length versus variable-length parsing • Parsing schema1: <reference index> <rI> • Parsing schema2: <reference index, repeat type > <rI, rT> • Parsing schema3: <index, repeat type, range> <rI, rT, sP, nP> • Example • With fixed length parsing: • Parse(S): x1<1,1>x2 <3,1>x3<2,1>x4<4,1>x5<2,2>x6<1,5>x7<3,1> • With variable-length parsing • Parse(S): x1<1>x2<3>x3<2>x4<4>x5<1,1,1,5>x6<1,5>x7<3> Dictionary

Analysis of VPS2 • Parsing • ParsePart1: x1x2x3x4x5x6x7x8x9 • ParsePart1 lengths: sk1, sk2, sk3 sk4, sk5, sk6 sk7, sk8, sk9 ; ski = length of xi • ParsePart2: <1>0<3>0<2>0<4>0<1,1,1:5>0<1,5>0<3>0<4>0 • Costs (per occurrence costs for each repeat) • ParsePart1 • Vocabulary • Total cost • Per-replacement cost • Compression gain • Constraints on l(r),n(r), and rI(r) ParsePart2: + =

Encoding the Integers • 1st order Fibonacci codes (FK1) • 2nd order Fibonacci codes (AF1) • Simple Huffman codes (H1) • F - 00 • 1 - 010 • 2 - 011 • 0 - 1000 • 3 - 1001 • 4 - 1010 • 5 - 1011 • 6 - 1100 • 7 - 1101 • 8 - 1110 • 9 - 1111

Results • Compressed file size (bytes)

Results • Compressed file size (bits per character)

Results… Size (bytes) Method index 1 - gzip 2 - arith 3 - BWT 4 - vps 5- vpsH1a 6 - vpsH1b 7 - vpsFK1a 8 - vpsFK1b 9 - vpsAF1a 10 - vpsAF1b Size (bpc)

Results … • Computation Time (H1 versus AF1 and KF1) (Time in seconds)

Conclusion • BWT presents an effective mechanism for repeat analysis • Not all repeats can lead to compression • Off-line dictionaries are effective for DNA sequence compression • Performance depends on the representation of the parsed sequence, type, length, and number of repeats. • What next ? • Variable-length coding • More rigorous theoretical analysis • Further testing and comparative study (approximate repeats, palindromes, speeding up the process, etc.) • DNA sequence entropy estimation • Towards compressed domain DNA sequence analysis

The End Thank You

DNA Sequence Compression using the Burrows-Wheeler Transform