1 / 39

Case Study #4: The Encyclopedia of Life: A Toolkit for High-Throughput Proteomics

Case Study #4: The Encyclopedia of Life: A Toolkit for High-Throughput Proteomics. Mark A. Miller. Principal Investigator Integrative BioSciences. Output from Genome Sequencing Project:.

Télécharger la présentation

Case Study #4: The Encyclopedia of Life: A Toolkit for High-Throughput Proteomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Study #4: The Encyclopedia of Life:A Toolkit for High-Throughput Proteomics • Mark A. Miller Principal Investigator Integrative BioSciences

  2. Output from Genome Sequencing Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141 tttggaagat attgccgctc accctgaaat catcaagccc ctgcaggaag agatcaggag 1201 agttattgcc gaagaagggt ggaacacgaa ggctatgtac aagatgttcc ttctcgacag 1261 cgtatttaag gaaacccaac gattgaaacc cattcaagtt ggtaagttga acaattatcg 1321 cttttgtaaa gtcgcttgct tacctcaaat cagcttcaat ggtgcgagaa gcgcagtccg 1381 acatcacact

  3.  30,000   107 Goal of Genome Sequencing Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct

  4. BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141

  5. Goal of EOL Project: BASE COUNT 531 a 506 c 464 g 496 t ORIGIN 1 aaacacaact ggtatctctt ccaaggcttg acagaccatt tcgctgtcat ggctctcatc 61 accagtttgc aggatgtccg gctcgacatg ctggcatatt ttgttgcctt tctcgtcgta 121 gtatccgtcg tacgaaagaa gctggccccg caacccagcg catacttgct caatcccaga 181 cgctggtacg agtttaccga tgctcgtgca gtctcagaag tccttcacac cacccgccaa 241 accctcgaag aatggttcca caagcaccca acaacccctg tccgcctgac aaccgatttc 301 ggtgaaatga cctttttgcc tcccactctg gccgatgaaa tcaagagtga taagcgtctc 361 agcttcatca aggcagctaa tgattcggta tgtggaaccc gttaaataca tggcaggccc 421 atatagctaa acaaccaata ggccttccac actgaaatcc ccggttttga gcctttccgc 481 gagggcggaa gaaatgaggc agcactgatc aaggaggtta ttcacggtca attgaagaaa 541 actctgagta agccgggaac ccatccagat tataacatgc catgtcgatg ctaattgttt 601 ttgtttagac aagatgacct ttccattggc tcaagaaacc cagctggctg ttgaacacta 661 cctcggtgct aacaagggta aggaccatcg cgacacgttg ttgactttaa gctgattgtc 721 tatagaatgg cacaagattc gactcagaga cgcactgcta cccctggtca ctagaatctc 781 aacacgtatc tttttgggtg aagatctatg ccaaaatgac aaatggatta gcatcacttc 841 ggaatacgct gccaacagtc tcgaggtcgc aaaccgcctg cgcgtctggc ccaagtacat 901 gcgttacgtc gtttcatact tctctccagg atgcggaatt ctacgaaacc aggtcaagaa 961 tgctcgcgaa ctaatcactc ccattgttga acgccgtcga tccgaggaaa agggtaagga 1021 atacaatgat tctctgggct ggtttgagaa gactgccaaa caagcgtaca accctgctgc 1081 tacccaacta ttcctttctg ctgtatctgt ccacaccacc accgatctca tctgccaatg 1141 Software pipeline

  6. The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu

  7. The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation • Integration of annotations with output from key biological resources ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu

  8. The EOL project has three goals: • Putative functional and 3-D structure assignment through the largest computation ever attempted in biology • Integration of annotations with output from key biological resources • Innovative information delivery systems, including a peer to peer EOL Notebook ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide • The EOL project has three goals: • Putative functional and 3-D structure assignment through a large computation • Integration of annotations with output from key biological resources • Innovative information delivery systems, including a peer to peer EOL Notebook ENCYCLOPEDIA of LIFE - a collaborative project based at the San Diego Supercomputer Center (SDSC); open to scientists and biological resources worldwide http://eol.sdsc.edu

  9. Tool 1 Tool 2 Tool 3 Tool 4 Step 1. Create a genome Annotation Pipeline Individual genome sequences Annotation tools from the community Output Output Output Output

  10. Tool1 Tool2 Tool 3 Tool4 The EOL project 1) incorporates the best sequence analysis tools 2) analyses all genomes through an automated annotation process, and 3) uses web tools to serve the results to the community. ALL genome sequences International Data Providers Federated DATABASES SOAP Services Annotation tools from the community • Features: • annotation of all genomes by automated program portfolio • all runs stored in federated database • federation of local and public databases at API level • results served via SOAP server • interface facilitates novel queries • interface facilitates data management and exchange P2P HTTP Pages Third party providers

  11. Annotation Strategy High cpu requirements So far “embarrassingly parallel” Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Show value in the annotation pipeline in a manual 1 genome run Port the pipeline to local and partner resources Run the pipeline remotely on distributed local resources APST: Globus,Condor friendly; but also Globus,Condor independent Running on EOL Cluster, Sun Ultra, 4 Sun E10’s. Run the pipeline remotely on partner resources using APST

  12. The iGAP Pipeline Today Arabidopsis protein sequences structure info sequence info SCOP, PDB NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction Structural assignment of domains by 123D on FOLDLIB Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSI-Pred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in DB

  13. ~107 cpu hours* = (> 1000 cpu years and growing) *for one pass through the pipeline! Where are we now? High cpu requirements So far embarrassingly parallel Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline Current pipeline rate: about 1 cpu hour/sequence ~800 genomes (and growing) =~107 ORF’s (and growing) Allocated Data Star cpu hours in an NRAC year: 4 X 106

  14. Annotation Strategy Putative Functional and 3D Fold Assignment of All Genomes by Automated Pipeline High cpu requirements So far embarrassingly parallel Run the pipeline remotely on partner resources using APST In production: SDSC NPACI Grid BII, Singapore Monash University, Australia TITEC, Japan Belfast E-Science Center, Ireland UFCG, Brasil University of Michigan, USA Teragrid, USA

  15. Annotation Strategy Distributed resources Home resource dagger.sdsc.edu Staged jobs Tomcat Staged jobs Struts USER Interface APST Staged jobs Staged jobs Workflow Manager User Axis Staged jobs Staged jobs All WEB Job Status Database Staged jobs Services

  16. OK, we CAN do this annotation, butshould we???? • The Biologists are investigating first pass annotations • of the human and fly genomes….. • Before embarking upon a calculation of this magnitude, • we must show a clear value in the results!

  17. In the meantime….. First, we can allow investigators to deploy the pipeline concept. • We can make our IT accomplishments valuable by putting them in the hands of investigators.

  18. Make the tools generic: The original EOL pipeline consisted of a 2000 line bourne shell script. It is rigid, fragile, and cumbersome.

  19. Kepler provides a flexible environment for workflow creation

  20. Encyclopedia of Life PDB JCSG Protein Kinase Resource Portals ADIT Interface Web Services Data Sources Protein Data Bank (15,000 structures) Step 2. Create a federated database CE Portal Tracking/Analysis Portal Data Entry/Analysis Portal SOAP Protein Kinase Resource (12,000 kinases) Joint Center for Structural Genomics(4,000 targets) Encyclopedia of Life (55 genomes annotated) DB2 DB2 Oracle MySQL from a collection of independent data projects….

  21. Web Services SOAP Encyclopedia of Life (55 annotated genomes) Small Molecule QM Database (all PDB molecules) Alliance for Cell Signaling (3,000 mol. pages) Protein Kinase Resource (12,000 kinases) Joint Center for Structural Genomics (4,000 targets) Protein Data Bank (15,000 structures) Unite Databases with Information Integrator middleware… Portals CE Portal GAMESS Portal Encyclopedia of Life Notebook Phase II: Interface allows scientists access to all federated databases Interface Query interface Federation Data Sources ….6 databases can be queried by the entire community through a user interface, and portals will create new functionalities for data manipulation…..

  22. Step 3. Create an interface to the database….

  23. Summary • EOL project provides a community-based effort to serve academic community. • EOL has created tools to deploy jobs across distributed resources. The tools can be adapted to a variety of applications. • EOL adds value by adding structure information and confidence predictions to conventional sequence tools. • EOL project has begun to federate data among public biology data resources, starting with local SDSC resources. • EOL project has an open interface for sequence querying, and provides structure viewers for analyzing homologous structures. • Continued growth will be spurred by User input……

  24. The Next Goal: The Virtual Cell DNA RNA Modeling Software lipids

More Related