290 likes | 553 Vues
From Kolmogorov and Shannon to Bioinformatics and Grid Computing. Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo. Aim. Give a flavour of fundamental novel discoveries about indexing and compression: A string, and any compact encoding of it, is the best index for itself
E N D
From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo
Aim • Give a flavour of fundamental novel discoveries about indexing and compression: A string, and any compact encoding of it, is the best index for itself • Give a flavour of some fundamental novel discoveries about Distance functions and Classification, particularly relevant for Bioinformatics • On the way, mention uses of :suffix trees, suffix arrays, Burrows-Wheelet Transform, Move to Front… • In 30 min. an incredibly long jurney: From Kolmogorov and Shannon to Grid Computing • References: available on-line
Types of data DNA sequences Audio-video files Executables Raw sequence of characters or bytes Types of query Character-based query Arbitrary substring What do we mean by “Indexing” ? Indexing approaches : • Full-text indexes, • Suffix Array, Suffix tree,…
Moral: More economical to store data in compressed form than uncompressed • From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT • Same performance of a PC with double memory but at half cost What do we mean by “Compression” ? • Any Algorithm that squezes data : lossless, lossy • CPU speed nowadays makes (de)compression “costless” !!
What we mean by “Classification” ? • Any tool that can group “related” objects together, e.g. the unaligned mithocondrial genomes NCBI Classfication
In terms of space occupancy Also in terms of compression ratio Compression and Indexing: Two sides of the same coin ! • Do we witness a paradoxical situation ? • An index injects redundant data, in order to speed up the pattern searches • Compression removes redundancy, in order to squeeze the space occupancy • NO, new results proved a mutual reinforcement behaviour ! • Better indexes can be designed by exploiting compression techniques • Better compressors can be designed by exploiting indexing techniques • Classification is the “third side” of the coin: Kolmogorov Complexity, Information Theory, Compression and Indexing
Compressed Index • Space close to gzip, bzip • Query time close to O(|P|) Compression Booster Tool to transform a poorcompressor into a better compression algorithm Kolmogorov Universal Distances and Classification Our journey, today... Index design (Weiner ’73) Compressor design (Shannon ’48) Burrows-Wheeler Transform (1994) Suffix Array (1990)
First Lap…in record time!!! Investigate Indexing ideasCompressor design Booster
s # i m p 1 12 ssi pi# si # i# i ppi# 10 9 11 9 ppi# ssippi# ssippi# ppi# ssippi# 5 2 7 4 ppi# 6 3 Key Idea 1: Suffix Tree [Weiner 73, McCreight 76, Ukkonen 92] • String: mississippi#
bwt(s) #mississipp i i#mississipp ippi#mississ issippi#miss ississippi# m Sort the rows s mississippi# pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i Key Idea 2: Burrows-Wheeler Compression (1994) Let us be given a string s = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi
Burrows and Wheeler Compression • Why it works: • BWT creates a locally homogeneous string: • abaababa bbbaaaaa • MTF transforms it into a globally homegeneous sequence of integers • bbbaaaaa 00010000 • The final string is “easy” to compress • Experimentally: compressibility is proportional to % of zeros
The technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee A s c Booster c’ Boosting [Ferragina, Giancarlo, Manzini, Sciortino, 03,04,05] The better isA, the better isAboost The more compressible iss, the better is Aboost Qualitatively, it can be shown: • c’is shorter thanc, ifsis compressible • Time(Aboost) = Time(A), i.e. no slowdown • Ais used as a black-box
Second Lap…Even faster We investigated: Index Ideas Compression design Let’s now turn to the other direction Compression ideasIndex design Compressed Indexes
SA L Rotated text L includes SA and T. Can we search within L ? 12 11 8 5 2 1 10 9 7 4 6 3 #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m i p s s m # p i s s i i Suffix Array vs. BW-transform #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m mississippi
The theoretical result: • Query complexity: O(p + occ logeN) time • Space occupancy: O( N Hk(T)) + o(N) bits k-th order empirical entropy A compressed index[Ferragina-Manzini, IEEE Focs 2000] In practice, the index is much appealing: • Space close to the best known compressors, ie. bzip • Query time of few millisecs on hundreds of MBs
Third Lap… Universal Distances and Classification
Large Data Sets • Classification of Sequences on a Genome-wide Scale • Distances based on alignments are either not applicable or too slow • Fast and reliable alignment-free methods are badly needed • Classification of Proteins, both for Function and Structure- Lagging behind to sequence data
Proteins and Their String Representations • Amino acid sequence (FASTA format); • Atomic coordinates (Atom lines);
Protein Representations • Topologic Models (Top Diagrams)
Kolmogorov Complexity • The Kolmogorov Complexity K(x) of a stringx is defined as the length of the shortest binary program that produces x. • The conditional Kolmogorov Complexity K(x|y) represents the minimum amount of information required to generate x by an effective computation when y is given as an input to the computation. • The Kolmogorov Complexity K(x,y) of a pair objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart.
Universal Similarity metric (USM) • Problem: • USM(x,y) is based on Kolmogorov Complexity that is non- computable in the Turing sense. • Solution: • K(x) can be approximated via data compression by using its relationship with Shannon Information Theory. • USM is a methodology rather than a formula quantifying the similarity of two strings.
Approximations of USM • K(x) can be approximated by C(x), K(x,y) by C(xy) and K(x|y*) by C(xy) – C(x). We obtain three approximations to USM: where
Experiments [Ferragina, Giancarlo, Greco, Manzini, Valiente, 2007] • Experimental setup: • Five Benchmarck datasets of proteins (several alternative representations); • A benchmark dataset of Genomic sequences (complete unaligned mitochondrial Genomes); • Twenty-five compression algorithms; • Three dissimilarity functions based on USM. • Two set of experiments to compare USM both with methods based on alignments and not: • via ROC Analysis; • via UPGMA and NJ.
An example • Unaligned mitochondrial DNA complete Genomes
Results and Conclusions • Useful Guidelines for Use of USM Methodilogy for Biological Investigation • Which compressor to use • Which among UCD,NCD and CD to use • Which data representation is best • Etc…
Software • Kolmogorov Library: http://www.math.unipa.it/~raffaele/kolmogorov/ • Sequential processing is too slow even for relatively small data sets, i.e, 278 files (1.5Mb) classification takes 12 hours on a state of the art PC…half an hour on Grid • Soon Available as a Grid-aware Web Service on COMETA Portal
Adevertisement 2 • 20° EDition of Lipari International Summer School for Computer Scientists • TOPIC: Algorithms, Science and Engineering • See Lipari School Website