360 likes | 483 Vues
28. 07. 2011. Informatics view of determining the relationship between organisms. Lenka Kovářová Supervisor : Milena Kovářová. Imagine the situation you have find some new so far unknown organism. And you want to know some relative species of organisms Whtat to do know ?
E N D
28. 07. 2011 Informatics view of determining the relationship between organisms LenkaKovářová Supervisor: Milena Kovářová
Imagine the situation you have find some new so far unknown organism And youwant to knowsomerelative species oforganisms Whtat to do know? Asktheunknownorganism Itwouldprobably not answer Findsomesimilarsigns to knownorganisms Thatis not a proofoftheirrelationship Findgenomicsimilarity
How to compare genome ComparewholeDNA code Theatis a big amount of data Compareselected chromosome Relative speciescanhave the same gene on different locationsof different chromosomes Compareone special gene You don’t knowwhere to findselectedgene in gemoneoffar unknown organism Almostalleucatyotic organisms has mitochondrial DNA
Mitochondrion Semi–autonomic organele Foundin most eukaryotic cells Described as cellular power plants Has its own independentgenome Believed to be originally derived from endosymbiotic prokaryotes
Mitochondrial DNA Mostly circular DNA molecule
Mitochondrial inheritance Mitochondriaare normally inherited exclusively from the mother Mitochondrial DNA is not highly conserved and has a rapid mutation rate Itisusefulfor studyingthe evolutionary relationships of organisms Model of human migration based on Mitochondrial DNA
Project Try to determninetherelationshipbetweenorganismsbased on mitochondrail DNA Download as many mitochondial DNA ofdifferentorganisms as possible Make a program ofsuiatablealgorithm to determinethesimilarityof DNA code Analyse theresults
Organisms Streptophyta Arabidopsis thaliana Physcomitrella patens
Organisms Fungi Aspergillusniger Ashbyagossypii Saccharomycescerevisiae
Organisms Insecta Apismellifera Acyrthosiphonpisum Triboliumcastaneum
Organisms Deuterostomia Cionaintestinalis Saccoglossuskowalevskii Strongylocentrotuspurpuratus
Organisms Vertebrata Anoliscarolinensis Xenopus (Silurana) tropicalis Daniorerio
Organisms Aves Gallus gallus Meleagrisgallopavo Taeniopygiaguttata
Organisms Mammalia Ornithorhynchusanatinus Monodelphisdomestica
Organisms Carnivora Ailuropodamelanoleuca Canisfamiliaris
Organisms Cetartiodactyla Susscrofa Bostaurus Perissodactyla Equuscaballus
Glires Oryctolaguscuniculus Musmusculus Rattusnorvegicus Organisms
Organisms Primates Macacamulatta Pongoabelii Pan troglodytes Homo sapiens
Metodology GettheFastaformatofmitochondrialDNA Comparethesimilarityofgenoms Findrelativeorganisms
Metodology Compressone DNA codeandcompresstwo DNA codestogether Comparethelenghtofcompressedfiles Computethe coefficientsof similarity Where Compress(x) is the compression algorithm used on file x |x| is the length of file x and DNA1+DNA2 is the concatenation of DNA1 and DNA2 in this order
Compress algoritm Deflatestream Combinationof LZ77 algorithm and Huffman coding Compression is achieved through two steps The matching and replacement of duplicate strings with pointers Replacing symbols with new, weighted symbols based on frequency of use
Compress algorithm Deflatestream Lossless data compression algorithm Combination of LZ77 algorithm and Huffman coding Series of blocks, each block preceded by a 3-bit header 1-bit: Last block in stream marker 1: this is the last-block in the stream 0: there are more blocks to process after this one 2-bits: Encoding method used for this block type: 00: a raw section follows, between 0 and 65,535 bytes in length 01: a static Huffman compressed block, using a pre-agreed Huffman tree 10: a compressed block complete with the Huffman table supplied 11: reserved, don't use
Algorithm LZ77 Duplicate string elimination Within compressed blocks If a duplicate series of bytes is spotted (a repeated string) then a back-reference is inserted linking to the previous location of that identical string instead An encoded match to an earlier string consists of a length (3–258 bytes) and a distance (1–32,768 bytes) Relative back-references can be made across any number of blocks
Huffman coding Replacing Commonly used symbols with shorter representations Less commonly used symbols with longer representations Unprefixed tree of non-overlapping intervals Length of each sequence is inversely proportional to the probability of that symbol needing to be encoded A tree is created which contains space for 288 symbols 0–255: represent the literal bytes/symbols 0–255. 256: end of block 257–285: combined with extra-bits, match length of 3–258 bytes 286, 287: not used
Huffman coding A match length code will be followed by a distance code Based on the distance code read, further "extra" bits may be read in order to produce the final distance. The distance tree contains space for 32 symbols 0–3: distances 1–4 4–5: distances 5–8, 1 extra bit 6–7: distances 9–16, 2 extra bits 8–9: distances 17–32, 3 extra bits ... 26–27: distances 8,193–16,384, 12 extra bits 28–29: distances 16,385–32,768, 13 extra bits 30–31: not used
Results Homininaerelationship
Results Avesrelationship
Results Insectarelationship
Results Mammaliarelationship
Results Imaginethisisyouunknownanimal
Conclusion Many mitochondrial DNA codesweredowloaded Thetaxomonyrelationshipbetweenthemwasfound Researchofsuitablealgorthmsfordeterminingtherelationshipbetweenorganismswasdone Demonstratedalgorithmwaschoosen Analgorithmwasimplemented Ability to determinethe relationship between organismsusingthisalgorithmwasproofed
Results Thankyouforyourattention