Hashing Algorithm and its Applications in Bioinformatics

Hashing Algorithm and its Applications in Bioinformatics By Zemin Ning Informatics Division The Wellcome Trust Sanger Institute

Outline of the Talk: • Research Background • SSAHA – The Fastest Sequence Search Engine - Hash table; - Sequence search based on the hash table; - Various applications. • Euler Path – consensus generation - Euler Path; - Consensus generation; - SNP calling. • Phusion – the WGS assembler: - Phusion pipeline; - Reads grouping; - Applications. • Current Research

Powder Simulation

Hair Dynamics Genetics and Human Hair Structure EAST ASIAN CAUCASIAN AFRICAN

Sequence Search and Alignment • Algorithms - Dynamic programming; - Suffix tree; - Hash method; - … • Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - … • CPU vs Memory

Objectives: With SSAHA algorithm, we aim to achieve the following objectives: (i) To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy; (ii) To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection; (iii) To provide possible tools for sequence analysis based on the search engine.

Automatic Sequencing ATGCAGGTCC …….

Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m K-tuple: (sisi+1...si+k-1) “A” =00; “C” = 01; “G” = 10; “T” = 11 SSAHA Index: Sequence Representation Using two binary digits for each base, we may have the following representations: For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Hash Table: A 2-tuple hashing table of S1, S2 and S3 S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA)

E k-tuple Ni Indices and Offsets 0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 23 5 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0 10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT Query sequence: Sq = (TGCAACAT)

Query sequence: k-tuples f(t) F(t) -(t-1) Fs(t) TG 1, 13 1, 13 0 1, 5 2, 7 2, 7 0 1, 13 3, 9 3, 9 0 2, -2 GC -1 CA 2, 3 2, 1 -2 2, 1 2, 9 2, 7 -2 2, 1 2, 21 2, 19 -2 2, 4 2, 27 2, 25 -2 2, 7 2, 33 2, 31 -2 2, 7 3, 21 3, 19 -2 2, 7 3, 23 3, 21 -2 2, 7 AA 2, 19 2, 16 -3 2, 16 AC 1, 9 1, 5 -4 2, 16 2, 5 2, 1 -4 2, 19 2, 11 2, 7 -4 2, 21 CA 2, 3 2, -2 -5 2, 25 2, 9 2, 4 -5 2, 28 2, 21 2, 16 -5 2, 31 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18 AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21 Array of index and offset data Sq = (TGCAACAT)

Index Offset 64 Bit Machines In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as F(t) = {H(E(t),1), H(E(t),2),…, H(E(t),Nt)} with H(E(t),i) = 232 H1(E(t),i) + H2’(E(t),i) i = 1,2,…, Nt It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer.

Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort. Power Law: CPU time v query length

Memory for subject: Ms = 4*Ns/k+ 4*22k Memory for query: Mq = Nq House keeping: 10-20% total Total memory: Ms = 1.2*(Ms+Mq) SSAHA Memory

? SSAHA2 SSAHA2 Client Client ? The SSAHA Trace Server It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.

a b a d . d . . Pregel River b c . c The Seven Bridges of Konigsberg • During the 18th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions. • Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?

a f e b d c Vertex Degree, Euler Circuit and Euler Path • Vertex degree: For an undirected graph G, the vertex degree is defined as the number of edges in the graph. • Euler circuit: For an undirected graph G, if there is a circuit in G that traverses every edge of the graph exactly once, then G is said to have an Euler circuit. • Euler path: If there is an open trail from a to c in G and this trails traverses each edge in G exactly once, the the trail is called an Euler trail or Euler path.

Sequence Reconstruction - Hamiltonian path approach S=(ATGCAGGTCC) ATG -> TGC -> GCA -> CAG -> AGG -> GGT -> GTC -> TCC ATG AGG TGC TCC GTC GGT GCA CAG • Vertices: k-tuples from the spectrum shown in red (8); • Edges: overlapping k-tuples (7); • Path: visiting all vertices corresponding to the sequence.

CG GT GC AT TG CA GG Sequence Reconstruction - Euler path approach ATG -> TGG -> GGC -> GCG -> CGT -> GTG -> TGC -> GCA ATGCGTGGCA ATGGCGTGCA • Vertices: correspond to (k-I)-tuples (7); • Edges: correspond to k-tuples from the spectrum (8); • Path: visiting all EDGES corresponding to the sequence.

E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 SSAHA Type Hash Table S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

E k-tuples Indices, Offsets and links to the next 7 ATG 1,1,28 3,1,28 4,1,28 8 ATC 2,1,29 10 AGT 4,5,38 11 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,32 28 TGC 1,2,45 3,2,46 4,2,45 29 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-1 38 GTT 4,6,24 40 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,51 46 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,10 52 CAC 3,4,19 Point to the Next - Hash Table Links S1=(ATGCAGGTCC) , S2=(ATCAGGTCC) S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

ATGC--AGGTCC AT--C--AGGTCC ATGCTAGGTCC ATGC--AGTTCC ATGC--AGGTCC Consensus ATG ->TGC -> GCA ->CAG -> AGG ->GGT -> GTC ->TCC CONS=(ATGCAGGTCC)

ATGC--AGGTCC ATGC--AGGTCC ATTCCAGGTCC ATTC--AGCTCC ATGCTAGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGC--AGGTCC ATGCTAGGTCC ATGC--AGGTCC ATGCTAGGTCC ATGCTAGGTCC eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly occurred base pair on the location.

Assembly Data Process Shotgun Reads Supercontig FPC Mapping Read-pair Tracker PRono RPjoin –Merge Reads Group RPphrap - Contig Phusion Assembler Pipeline

ATGGCGTGCAGTCCATGTTCGGATCA ATGGCGTGCAGT TGGCGTGCAGTC GGCGTGCAGTCC GCGTGCAGTCCA CGTGCAGTCCAT ATGGCGTGCAGTCCATGTTCGGATCA ATGGGCAGATGT TGGCCAGTTGTT GGCGAGTCGTTC GCGTGTCCTTCG Kmer Word Hashing Contiguous Base Hash K = 12 Gap-Hash 4x3

Zebrafish as a model organism • Danio rerio • Fish length: 3 cm long Estimated genome size: 1.55 Gb • Easy to maintain short generation time can be kept at high densities • Easy to manipulate external fertilisation and development transparent embryos Sanger Institute WGS project started in spring 2001 • DNA source Tuebingen embryos; • WGS read Insert sizes: 2 - 10 kb; • BACends insert sizes: 165 – 175 kb; • Polymorphism: ~ 1000 5 day old embryos; • SNP density: One in every 200 bps; • Indel density: One in every 1500 bps; • Indel length: 2 – 30 bps.

Acknowledgements: • Jim Mullkin • Yong Gu • Adam Spargo • Richard Durbin • Kerstin Jekosch • Sean Humphray • Jane Rogers • Sanger Systems Support • Sanger Sequencing Facilities

Hashing Algorithm and its Applications in Bioinformatics

Hashing Algorithm and its Applications in Bioinformatics

Presentation Transcript

Bioinformatics Applications

Advance in Fireworks Algorithm and its Applications

BIRCH: A New Data Clustering Algorithm and Its Applications

EM algorithm and applications

SECURE HASHING ALGORITHM

“Semantic Web” Applications in Bioinformatics

Bioinformatics: Applications

Bioinformatics Applications

Cloud Technologies and Bioinformatics Applications

Bioinformatics Applications and Workloads

Hashing Algorithm

Scalable Perfect Hashing Schemes and Applications

A compression Algorithm for DNA sequences and its applications in Genome Comparison

EM Algorithm and its Applications

Machine Learning and its Applications in Bioinformatics

Data Structures and Algorithm Analysis Hashing

Applications to Bioinformatics

DWT Based Robust Image Hashing Algorithm

BF528 - Applications in Translational Bioinformatics

Advance in Fireworks Algorithm and its Applications

EM Algorithm and its Applications

An Optimal Soft-Output Multiuser Detection Algorithm and its Applications