130 likes | 256 Vues
University of Belgrade School of Electrical Engineering. Implementation of the BLAST A lgorithm U sing Hadoop MapReduce. Siniša Ivković , Goran Rako čević, Prof. Veljko Milutinovic. Introduction. -Sequence alignment.
E N D
University of Belgrade School of Electrical Engineering Implementation of the BLAST Algorithm Using Hadoop MapReduce Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic
Introduction -Sequence alignment • way of arranging sequences of DNK, RNK or protein to identifyregions of similarity • functional • structural • evolutionaryrelationshipsbetween sequences - How to know that two genes, often in different organizams, in fact two versions of the same gene? Similarity! Siniša Ivković - sinisa.ivkovic@gmail.com
Introduction • There are a number of algorithms that solve problems of • aligning the sequences and guarantee the best solutions • By increasing amount of data that need to be processed • execution speed of these algorithms becomes unacceptable • Therefore, we must turn to heuristic methods - BLAST Siniša Ivković - sinisa.ivkovic@gmail.com
BLAST - Basic Local Alignment Search Tool • Fast local sequence alignment algorithm • BLAST efficiency lies in the fact that it tends to find regions of • high similarity, not necessarily trying to find and check all • local alignment. KRKLQRNRTSFTQEQIEALEKEFERTHYPDVFARERLAAKIDLPEARIQVWFSNRRAKWRREEKL KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL KKKHRRNRTTFTTYQLHQLERAFEASHYPDVYSREELAAKVHLPEVRVQVWFQNRRAKWRRQERL Siniša Ivković - sinisa.ivkovic@gmail.com
Parallel BLAST • - Most bioinformatics algorithms are designed as a sequential • The very nature of bioinformatics processing • The rapid spread of knowledge in biology causes • constant emergence of new concepts, and • significant changes to already known • Declining price of genome sequencing requires • increasing the speed of execution of these algorithms • -Implementations of Parallel BLAST • PThread • MPI Siniša Ivković - sinisa.ivkovic@gmail.com
ETF Hadoop BLAST • Big Data – collection of data sets so large and complex • that it becomes difficult to process using standard database tools or traditional data processing applications • - Parallel computing – a form of computation • in which many calculations are carried out simultaneously • communication and synchronization between processes • hardware failure • MapReduce – programming model that frees programmers of thinking about these problems • Apache Hadoop – free implementation of the MapReduce paradigm Siniša Ivković - sinisa.ivkovic@gmail.com
MapReduce SORT MAP VALUE VALUE VALUE REDUCE VALUE MAP VALUE REDUCE VALUE VALUE MAP VALUE Siniša Ivković - sinisa.ivkovic@gmail.com
ETF Hadoop BLAST - Implementation {q1} {q1} {db1} {db3} {db2} {db1} {db2} {q1} {db3} {db1} {db2} {db2} {db1} {db3} {db3} mySequence {q1} MAP MAP MAP {hit1} {hit3} {hit5} {hit2} {hit4} {hit6} Siniša Ivković - sinisa.ivkovic@gmail.com
ETF Hadoop BLAST - Implementation {db3} {q1} {db2} {q1} {db2} {db1} {q1} {db3} {db2} {db1} {db1} {db2} {db3} {db3} {db1} {db2} {db1} {db3} mySequence {q1} MAP MAP MAP REDUCE REDUCE {hit1} {hit3} {hit5} {hit2} {hit4} {hit6} Siniša Ivković - sinisa.ivkovic@gmail.com {hit1} {hit3} {hit6}
ETF Hadoop BLAST >GENSCAN00000000013 pep:genscan chromosome:GRCh37:18:4755977:4807982:1 transcript:GENSCAN00000000013 transcript_biotype:protein_coding TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLLAASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFPFG TANTGLLAVKVEVIILVSLTHAQLSRAGQHAGCTTCLQDECAVAAGEEEETQQGELADVIYPSLLAASTSSVLEDGAGPHKGLQKLSRLIRFVDVVGGFRREKGYMAWIKPRYSEFPKVNSWTESSFPFG HSP: 661 E-value: 0.001446314485823671 Siniša Ivković - sinisa.ivkovic@gmail.com
Conclusion • Bioinformatics has become an important part of many areas of biology • Sequencing and annotating genomes and • their observed mutations • Datamining of biological literature and • the development of gene ontologies • Understanding of evolutionary aspects of molecular biology • - Personalized medicine • Medical model that proposes the customization of healthcare • We need to consider whole spectar of clinical information • Electronic health care records • Clinical trials • etc. Siniša Ivković - sinisa.ivkovic@gmail.com
Conclusion • We need to collect information from real world • Develop analytics that can actually extract causal relationships • and generate predictive models • Future steps: • Specialized hardvare (FPGA) Siniša Ivković - sinisa.ivkovic@gmail.com
Thank you for your attention Siniša Ivković sinisa.ivkovic@gmail.com