1 / 28

Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science

An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data. Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science Washington State University. Outline. Problem Introduction Related Work

urban
Télécharger la présentation

Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Efficient Parallel Approachfor Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, AnanthKalyanaraman School of Electrical Engineering and Computer Science Washington State University

  2. Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

  3. Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

  4. Metagenomics • Application of genomics techniques to the study of microbial communities in their natural environments. • Without isolation and lab cultivation of individual species. SC08, Austin, TX

  5. Protein Family Identification Problem • Motivation • Family identification  Functional annotation • Diversity of protein family universe family1 family2 known proteins ……… functional annotation new metagenomic proteins familyi new protein family SC08, Austin, TX

  6. What is a Protein Family? • A protein family is a group of evolutionarily (thus functionally) related proteins. sequence similarity structure similarity domain similarity SC08, Austin, TX

  7. Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

  8. Related Work • General approach • Perform all-against-all sequence comparison (BLAST) • Group proteins based on pair-wise similarity • Related work • Kriventsevaet al. (2001) • Enrightet al. (2002) • Pipenbacheret al. (2002) • Kelilet al. (2007) • Yoosephet al. (2007) • … sequential approach SC08, Austin, TX

  9. GOS Approach • Yooseph et al. (2007) Θ(n2) space Ω(n2) time ……… ……… ……… Dense subgraph detection 1 Redundancy removal Graph generation 2 3 SC08, Austin, TX

  10. Limitations of Current Approaches • Constructing large graphs can be time-consuming • ~106 CPU hours for ~28.6 million proteins – GOS approach • Quadratic space requirement • Brute-force parallel approach SC08, Austin, TX

  11. Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

  12. Main Ideas of Our Approach • Idea#1: A dense subgraph cannot span two connected components CC DS Challenge: find connected components without generating the whole graph CC CC DS DS CC use divide and conquer to drastically reduce problem size! SC08, Austin, TX

  13. Main Ideas of Our Approach • Idea#2: Exact-match based filtering technique 98% sequence similarity >= 33 bp 100 bp eliminate unnecessary all-against-all comparisons! SC08, Austin, TX

  14. Main Ideas of Our Approach • Idea#3: High overlap of outlinks  dense subgraph u u v web community … v … outlinks use outlinks comparison to group vertices into dense subgraph! SC08, Austin, TX

  15. Our Parallel Approach for Protein Family Identification connected components input protein sequences 2 1 connected component detection … redundancy removal protein sequence pairwise sequence homology dense subgraph detection bipartite graph generation … … … … 3 4 dense subgraph dense subgraph SC08, Austin, TX

  16. Redundancy Removal • Criteria • similarity of the match is >= 98% • >= 95% of the shorter sequence is covered by the match generalized suffix tree (GST) cut off |||||| |||||||||||||| idea#2 >=98% >=95% p1 p2 p3 p4 p5 SC08, Austin, TX

  17. Connected Component Detection M – Master node W – Worker node manage CC using union-find data structure distribute work in a load-balancing way M W + alignment results + alignment results + alignment results pairs work pairs work pairs work ……… generate pairs sequence alignment W W GST1 GST2 GSTp SC08, Austin, TX

  18. Bipartite Graph Generation … … B(V,V,E) connected component G(V,E) SC08, Austin, TX

  19. Dense Subgraph Detection • Shingle algorithm s, c: parameters u outlinks(v) outlinks(u) permutation permutation v … … … … … … c times … … shingle shingle s elems s elems comparison SC08, Austin, TX

  20. Dense Subgraph Detection 1 2 3 1st pass 2nd pass A~B B shingle A dense subgraph … … … … … … … … … … … … dense subgraph shingle B(V, V, E) B(V, V, E) B(V, V, E) SC08, Austin, TX

  21. Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

  22. Qualitative Validation with GOS Data • 160k data set • Our results vs. GOS results Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49% SC08, Austin, TX

  23. Drastical Work Reduction • 40k input data #(sequence alignment work) ~800 million ~8 million all-against-all BLAST our parallel approach SC08, Austin, TX

  24. Run Time as Function of Input Size SC08, Austin, TX

  25. Performance Evaluation SC08, Austin, TX

  26. Conclusions & Future Work • Presented a parallel approach for protein family identification • Quality testing – better “benchmark” • Parallelization of Shingle algorithm – potentialmemory problem • Large-scale application – 28.6 million SC08, Austin, TX

  27. Acknowledgments • Prof. Srinivas Aluru at Iowa State University for BlueGene/L access • Anonymous reviewers • Funding: Washington State University Foundation and the Office of Research SC08, Austin, TX

  28. Thanks!Questions? SC08, Austin, TX

More Related