Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science

An Efficient Parallel Approachfor Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, AnanthKalyanaraman School of Electrical Engineering and Computer Science Washington State University

Outline • Problem Introduction • Related Work • Our Parallel Approach for Protein Family Identification • Experimental Results • Conclusions & Future Work • Acknowledgments SC08, Austin, TX

Metagenomics • Application of genomics techniques to the study of microbial communities in their natural environments. • Without isolation and lab cultivation of individual species. SC08, Austin, TX

Protein Family Identification Problem • Motivation • Family identification  Functional annotation • Diversity of protein family universe family1 family2 known proteins ……… functional annotation new metagenomic proteins familyi new protein family SC08, Austin, TX

What is a Protein Family? • A protein family is a group of evolutionarily (thus functionally) related proteins. sequence similarity structure similarity domain similarity SC08, Austin, TX

Related Work • General approach • Perform all-against-all sequence comparison (BLAST) • Group proteins based on pair-wise similarity • Related work • Kriventsevaet al. (2001) • Enrightet al. (2002) • Pipenbacheret al. (2002) • Kelilet al. (2007) • Yoosephet al. (2007) • … sequential approach SC08, Austin, TX

GOS Approach • Yooseph et al. (2007) Θ(n2) space Ω(n2) time ……… ……… ……… Dense subgraph detection 1 Redundancy removal Graph generation 2 3 SC08, Austin, TX

Limitations of Current Approaches • Constructing large graphs can be time-consuming • ~106 CPU hours for ~28.6 million proteins – GOS approach • Quadratic space requirement • Brute-force parallel approach SC08, Austin, TX

Main Ideas of Our Approach • Idea#1: A dense subgraph cannot span two connected components CC DS Challenge: find connected components without generating the whole graph CC CC DS DS CC use divide and conquer to drastically reduce problem size! SC08, Austin, TX

Main Ideas of Our Approach • Idea#2: Exact-match based filtering technique 98% sequence similarity >= 33 bp 100 bp eliminate unnecessary all-against-all comparisons! SC08, Austin, TX

Main Ideas of Our Approach • Idea#3: High overlap of outlinks  dense subgraph u u v web community … v … outlinks use outlinks comparison to group vertices into dense subgraph! SC08, Austin, TX

Our Parallel Approach for Protein Family Identification connected components input protein sequences 2 1 connected component detection … redundancy removal protein sequence pairwise sequence homology dense subgraph detection bipartite graph generation … … … … 3 4 dense subgraph dense subgraph SC08, Austin, TX

Redundancy Removal • Criteria • similarity of the match is >= 98% • >= 95% of the shorter sequence is covered by the match generalized suffix tree (GST) cut off |||||| |||||||||||||| idea#2 >=98% >=95% p1 p2 p3 p4 p5 SC08, Austin, TX

Connected Component Detection M – Master node W – Worker node manage CC using union-find data structure distribute work in a load-balancing way M W + alignment results + alignment results + alignment results pairs work pairs work pairs work ……… generate pairs sequence alignment W W GST1 GST2 GSTp SC08, Austin, TX

Bipartite Graph Generation … … B(V,V,E) connected component G(V,E) SC08, Austin, TX

Dense Subgraph Detection • Shingle algorithm s, c: parameters u outlinks(v) outlinks(u) permutation permutation v … … … … … … c times … … shingle shingle s elems s elems comparison SC08, Austin, TX

Dense Subgraph Detection 1 2 3 1st pass 2nd pass A~B B shingle A dense subgraph … … … … … … … … … … … … dense subgraph shingle B(V, V, E) B(V, V, E) B(V, V, E) SC08, Austin, TX

Qualitative Validation with GOS Data • 160k data set • Our results vs. GOS results Precision Rate (PR) = 95.75% Sensitivity (SE) = 56.89% Overlap Quality (OQ) = 55.49% SC08, Austin, TX

Drastical Work Reduction • 40k input data #(sequence alignment work) ~800 million ~8 million all-against-all BLAST our parallel approach SC08, Austin, TX

Run Time as Function of Input Size SC08, Austin, TX

Performance Evaluation SC08, Austin, TX

Conclusions & Future Work • Presented a parallel approach for protein family identification • Quality testing – better “benchmark” • Parallelization of Shingle algorithm – potentialmemory problem • Large-scale application – 28.6 million SC08, Austin, TX

Acknowledgments • Prof. Srinivas Aluru at Iowa State University for BlueGene/L access • Anonymous reviewers • Funding: Washington State University Foundation and the Office of Research SC08, Austin, TX

Thanks!Questions? SC08, Austin, TX

Changjun Wu, Ananth Kalyanaraman School of Electrical Engineering and Computer Science