Hash Algorithm and SSAHA Implementations

S Hash Algorithm and SSAHA Implementations Zemin Ning Production Software Group Informatics

Outline of the Talk: • The need for fast search engine; • SSAHA – Sequence Search and Alignment using Hashing Algorithm; • Hash table; • Sequence search based on the hash table; • Search speed; • Memory requirement; • How to use the package.

Algorithms and Software Tools • Algorithms - Dynamic programming; - Hash method; - Suffix tree; - … • Software tools - FASTA; - BLAST; - Cross_Match; - Mummer; - … • CPU vs Memory

Only works effectively when gap penalties are used In example shown match = +1 mismatch = -1/3 gap = -1+1/3k (k=extent of gap) Start with all cell values = 0 Looks in subcolumn and subrow shown and in direct diagonal for a score that is the highest when you take alignment score or gap penalty into account Smith-Waterman Algorithm Hij=max{Hi-1, j-1 +s(ai,bj), max{Hi-k,j -Wk}, max{Hi, j-l -Wl}, 0}

Suffix Tree Example Mapping the string ababc into a suffix tree. root ab c b abc c abc c

Motivation for sequence indexing • faster (economy) • remove reliance on the external service and network delays (user independence) • integrate fully with a database engine (convenience) • exhaustive instead of heuristics (quality) • enable different statistics in sequence evaluation (flexibility)

Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (i) To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (ii) To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (iii) To provide possible tools for sequence analysis based on the search engine.

Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m K-tuple: (sisi+1...si+k-1) “A” =00; “C” = 01; “G” = 10; “T” = 11 Sequence Representation Using two binary digits for each base, we may have the following representations: For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT Non-overlap hashing W = N-k+1 (k = 12) ATGGCGTGCAGTCCATGTTCGGATCATTACGTAAGC ATGGGCAGATGT CCATGTTCGGAT CATTACGTAAGC Overlap Hashing W = N/k Non-overlap Hashing v Overlap Hashing

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Hash Table: A 2-tuple hashing table of S1, S2 and S3 S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA)

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Query sequence: Sq = (TGCAACAT)

Query sequence: k-tuples f(t) F(t) -(t-1) Fs(t) TG 1, 13 1, 13 0 1, 5 2, 7 2, 7 0 1, 13 3, 9 3, 9 0 2, -2 GC -1 CA 2, 3 2, 1 -2 2, 1 2, 9 2, 7 -2 2, 1 2, 21 2, 19 -2 2, 4 2, 27 2, 25 -2 2, 7 2, 33 2, 31 -2 2, 7 3, 21 3, 19 -2 2, 7 3, 23 3, 21 -2 2, 7 AA 2, 19 2, 16 -3 2, 16 AC 1, 9 1, 5 -4 2, 16 2, 5 2, 1 -4 2, 19 2, 11 2, 7 -4 2, 21 CA 2, 3 2, -2 -5 2, 25 2, 9 2, 4 -5 2, 28 2, 21 2, 16 -5 2, 31 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18 AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21 Array of index and offset data Sq = (TGCAACAT)

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Query sequence: Sq = (TGCAACAT)

E(t) = (E1 , E2 , …, Et , … En+1-k ) f1(t) = {H1(E(t),1), H1(E(t),2), …, H1(E(t),Nt,)} f2(t,g) = {H2(E(t),1), H2(E(t),2), …, H2(E(t),Nt,)} SequenceSearch Sequence search is carried out using the generated hash table. Suppose we have a query sequence with length n, Sq = (s1, s2, s3,...,sn), and we want to find whether this sequence is one of the sequences in the database or a small segment of the sequence. Based on Sq, we have an integer array using where t = 1, 2, …, n+1-k. Note that overlapping for the query sequence is allowed while making the above array. For each element E(t), there are two arrays of sequence index and offset data with a length of entry repeats Nt in the hash table:

Match Start Match Start Query t-1 Subject t-1 Reference Point Reference Point Frequency Array F1(t) = f1(t) F2(t) = {H2’(E(t),1), H2’(E(t),2), …, H2’(E(t),Nt)} with H2’(E(t),i) = H2(E(t),2)-(t-1) i = 1,2,…, Nt The above calculation to adjust offsets should be done for every element in the array.

Index Offset 64 Bit Machines In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F(t) = {H(E(t),1), H(E(t),2),…, H(E(t),Nt)} with H(E(t),i) = 232H1(E(t),i) + H2’(E(t),i) i = 1,2,…, Nt It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer.

Array Sorting For the query sequence, there are n+1-k arrays in total and it is necessary that we combine all the arrays into one single arrays and F = {F(1), F(2),…, F(t),…, F(n+1-k)} Finally when the array is sorted into an ascending order, i.e. F -> Fs with Fs,1 < Fs,2 < … < Fs,i < … the search results can be determined by the number of the data repeats in the array. In a section within the Fs array, if the found repeat level is higher than a given threshold level, this means that there is a match between the query sequence and sequences in the database.

Averaged length of frequency array: where Ni is the average length of the entruy repeats. ^ Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort. Power Law: CPU time v query length

k Emax+1 CPU (Get hash table) T1 (s) CPU (Search only) T2 (s)* 8 65,536 378 47702 9 262,144 382 8225 10 1,048,576 388 1793 11 4,194,304 408 387 12 16,777,216 427 102 13 67,108,864 454 57 14 268,435,456 477 49 Speed and Resolution – Effects of k Subject file: 1.5 Gbp of human DNA Query file: 39,000 reads

SSAHA Memory Memory for subject: Ms = 4*Ns/k+ 4*22k Memory for query: Mq = Nq House keeping: 10-20% total Total memory: Ms = 1.2*(Ms+Mq)

SSAHA Memory: One array combined read index and offset Ri+j Ri Ri+1

Match Start Match Start Query t-1 Subject t-1 Reference Point Reference Point Matching Positions Found by SSAHA

SSAHA seeds Edge length Edge length Sequence for cross_match SSAHA2 = SSAHA + Cross_Match SSAHA for matching seeds, cross_match for sequence alignment.

SSAHA2 Command Line ./ssaha2 query_file subject_file Options: -kmer: length of kmer words; default kmer=12 -seeds: number of exact kmer words; default seeds=10 -align: '1' - show full alignment; '0' - no alignment; default '1' -sense: '1' - search with higher sensitivity; '0' - normal; default '0' -tags: '1' - show a tag of 'ALIGNMENT'; '0' - no tag; default '0' -depth: number of reported hits with best alignment; default depth=50 -score: minimum score of smith-waterman; default score=30 -cut: number of word occurrence in the dataset; default cut=200 -memory: memory assigned in MBs for cross_match; default memory=2000 -array: memory assigned in MBs for storing frequence arrays; default memory=4 -edge: extension of both ends on the subject; default edge=200 -best: report the best alignment from the hit list; default '0' -start: start read from the query file; default start=0 -end: end read from the query file; default start= Total number of the reads in the query file;

COOKBOOKBACends placement - find the best hit in the database: -seeds 14 -kmer 13 -align 0 -tags 1 -depth 5 -score 200 -cut 50000;EST/cDNA alignment - produce splice on the subject sequence: -seeds 4 -kmer 13 -align 0 -tags 1 -depth 5 -score 20 -edge 20000;Primer/gene Marks alignment - find the matches of short motifs to the database: -seeds 1 -kmer 13 -tags 1 -score 12 -skip 1 -sense 1 -cut 50000;Search with higher sensitivity: -seeds 2 -kmer 13 -tags 1 -score 20 -sense 1 -cut 50000;Both query and subject are large (q: 100Kb < query < 1MB; s: no limit): -seeds 50 -kmer 13 -tags 1 -score 2000 -array 40 -memory 10000;

Summary: • Speed - Fast enough to perform genomic scale searches between large genomes; • Memory – linear; • Sensitivity – not as good as BLAST, but applicable in assembly and SNP detection;

Hash Algorithm and SSAHA Implementations