Mapping Genomes onto each other – Synteny detection

Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar

Necessity is the mother of invention • Genome sequencing has given rise to voluminous amounts of genomic data. • Human genome has completely been sequenced. Rat and mouse genomes have also been completed. • What do we do with all this data?

Necessity… • Need to analyze all this data meaningfully. • Has given rise to the field of Comparative Genomics. • Identification of functional DNA through comparative methods. • A large set of functional elements in Rat/Human/Mouse genomes remains uncharacterized. (Pash: Kalafus et al)

Analysis Methods • Standard Dynamic Programming Alignment algorithms – Needleman Wunsch, Smith-Waterman. • Highly sensitive aligners. • Computationally prohibitive – impossible to apply to analysis of multiple mammalian genomes.

Methods… • Faster implementations of dynamic programming such as LAGAN (Brudno et al 2003). • Works well on a megabase level, but requires prior information (‘anchors’) on a genomic scale. • Seed and extend methods – a ‘seed’, hotspot is determined. Then it is extended on either side. • Again, extension step is computationally expensive.

Pash • So what is the solution? • Use Positional Hashing!!! • Pash: Efficient Genome-Scale Sequence Anchoring by Positonal Hashing • Authors: Ken Kalafus, Andrew Jackson and Aleksandar Milosavijevic

Pash in figures

More formally… • The sequences S, T are conceptually divided into sub-sequences of length L: • Si = [i*L+1,..., (i+1)*L] • Ti’ = [i’*L+1,..., (i’+1)*L]

Hashing • The single scoring matrix is divided into L diagonal matrices. • These are further divided into L ‘diagonal segment’ matrices. • We have L² ‘diagonal segment’ matrices. • We use a hash table for each ‘diagonal segment’ matrix. • Therefore Total #Hash tables = L²

Hashing… • Each k-mer is mapped to a bin in the hash table. • The indices of the k-mer are stored in one of two linked lists (one for each sequence). • We assume an efficient hash function.

Hashing… • If both the lists in a bin are non-empty, then the kmer corresponding to that bin, is a matching kmer! • Collation of matching kmers involves a single traversal of each list.

Running time • Worst case?? • When you have to perform an all against all comparison • O(M*N) • Highly unrealistic

Running time… • In practical applications, output size is O(M+N). • If k-mers of sufficient length are used, each of L² hash tables is populated with (M+N)/L k-mers. • Hence running time = O(M+N)*L) • If you have L nodes, running time = O(M+N).

Significance of Similarities • For each sequence found, Pash reports both the number of matching bases and a bit score that indicates significance. • The bit score is calculated according to the Algorithmic Significance method.

Significance of Similarities… • Based on the number of bits saved in a minimal encoding of the target sequence X=T given that the source is known. • D = Io(X) – I(X) • Io(X) = 2 * n bits

Kmer encoding… • To encode I(X), one of two options are used on a case by case basis. • A 1 bit flag is used to denote which method is used. • Let w be the number of matching kmers. • Let W be the maximum possible number of kmers in a match. • Conceptually, W corresponds to the length of the diagonal and is constant.

Kmer encoding… • There are C(W,w) possible lists of matching kmers. • To uniquely identify a kmer set we need log2C(W,w) bits • Therefore Kmer encoding of Iw(X): Iw(X) = 1 + log2W + log2C(W,w) bits

Base encoding • Base encoding is very similar to kmer encoding. • Let b the number of bases defined in a match. • Let B be defined as the maximum possible number of bases contained in a match. • Ib(X) = 1 + log2B + log2C(B,b) bits.

Significance of Similarities • Therefore Imin(X) = min(Iw(X), Ib(X)) • I(X) = Imin(X) + 2*(n-b) bits • Therefore, after combining and simplifying, d = 2 * b – Imin(X)

Results • Used in comparing the latest assembly of rat genome to the human and mouse ones. • Each pair-wise comparison took 4 days in 6 CPU’s = 24 CPU days • Computers were running on 750 MHz Pentium III processors • Peak Ram usage = 500 MB (approx)

Results…

Discussion • In contrast to seed and extend methods, Pash represents sequences as short kmers, rather than bases. • Efficiently parallizable. • Applications requiring basepair level alignments, Pash can be used as an anchoring module • This can in turn be post processed by programs like LAGAN, AVID or BLASTZ.

Availiability • Available free of charge for academic use. • http://www.br1.bcm.tmc.edu

Thanks!

Mapping Genomes onto each other – Synteny detection