1 / 28

CAMERA- Metagenomics meets the Cyberinfrastructure

CAMERA- Metagenomics meets the Cyberinfrastructure. David T. Kingsbury Gordon and Betty Moore Foundation BERAC - October 16, 2006. The CAMERA Partnership. Community C yberinfrastructure for A dvanced Marine M icrobial E cology R esearch and A nalysis.

zena
Télécharger la présentation

CAMERA- Metagenomics meets the Cyberinfrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CAMERA- Metagenomics meets the Cyberinfrastructure David T. Kingsbury Gordon and Betty Moore Foundation BERAC - October 16, 2006 Presentation Title April 4, 2002

  2. The CAMERA Partnership Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis Presentation Title April 4, 2002

  3. Genomic Data Is Growing Rapidly, But Metagenomics Will Vastly Increase The Scale… 100 Billion Bases! 35,000 Structures Protein Data Bank GenBank www.rcsb.org/pdb/holdings.html www.ncbi.nlm.nih.gov/Genbank Presentation Title April 4, 2002 Total Data < 1TB

  4. The Sargasso Sea Experiment The Power of Environmental Metagenomics • Yielded a Total of Over 1 billion Base Pairs of Non-Redundant Sequence • Displayed the Gene Content, Diversity, & Relative Abundance of the Organisms • Sequences from at Least 1800 Genomic Species, including 148 Previously Unknown • Identified over 1.2 Million Unknown Genes J. Craig Venter, et al. Science 2 April 2004: Vol. 304. pp. 66 - 74 MODIS-Aqua satellite image of ocean chlorophyll in the Sargasso Sea grid about the BATS site from 22 February 2003 Presentation Title April 4, 2002

  5. Marine Genome Sequencing ProjectMeasuring the Genetic Diversity of Ocean Microbes Presentation Title April 4, 2002

  6. Moore Foundation Funded the Venter Institute to Provide the Full Genome Sequence of 155 Marine Microbes Presentation Title April 4, 2002 www.moore.org/microgenome/trees_main.asp

  7. Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute Presentation Title April 4, 2002

  8. Moore Microbial Genome Sequencing Project: Cyanobacteria Being Sequenced by Venter Institute Presentation Title April 4, 2002

  9. GOS Sequences are Largely Bacterial ~3 Million Previously Known Sequences ~5.6 Million GOS Sequences Presentation Title April 4, 2002 Source: Shibu Yooseph, et al. (PLOS Biology in press 2006)

  10. Metagenomics Will Couple to Earth Observations Which Add Several TBs/Day Presentation Title April 4, 2002 Source: Glenn Iona, EOSDIS Element Evolution Technical Working Group January 6-7, 2005

  11. Driven by User Needs • CAMERA serves as one representation of a specific research community’s need for a system to • Collect and reference increasing metadata relevant to environmental metagenome datasets • Exploit the power of querying on metadata across multiple geospatial locations • Have access to a diverse and customizable set of easy-to-use tools to analyze their data in the context of collected metagenomic and whole genomic datasets • Have the ability to update and propagate improvements to annotations • Have a pre-publication, pre-submission collaborative workspace • Serve a diverse informatics-literate community Presentation Title April 4, 2002

  12. Services Provided • Data and Application Services • Tools and Workflows • Computational Data, Visualization and Collaborative environment • Outreach and Training in Environmental Genomics Presentation Title April 4, 2002

  13. Data and Application Services • Primary Data • Sargasso Sea and Sorcerer II expedition data • JGI marine & terrestrial environmental datasets • Moore Microbial Genomes • JGI and other relevant whole genomes • Research community submitted datasets • Submitted 454-based metagenomic datasets • Publically available NR protein and DNA sequence datasets • Derived Data • Annotations of datasets • Assemblies • Alignments • Pre-computed clusters Presentation Title April 4, 2002

  14. Sample Metadata from GOS • Site Metadata • Location (lat/long, water depth) • Site characterization (finite list of types plus “other”) • Site description (free text) • Country • Sampling Metadata • Sample collection date/time • Sampling depth • Conditions at time of sampling (e.g., stormy, surface temperature) • Sample physical/chemical measurements (T (oC), S (ppt), chl a (mg m-3), etc) • “author” • Experimental Parameters • Filter size • Insert size Presentation Title April 4, 2002

  15. Tools and Workflows • Initial set • BLAST Server • Clustering • HMM/Profile • Neighborhood analysis • Multiple sequence alignments • Assembly • Proposed New Tools • Multiple Auto Annotation pipelines • Fast Sequence lookup • Customized Assembly • Phylogenetic Analysis • Clustering Tools Presentation Title April 4, 2002

  16. CAMERA Outreach Modes • Scientific Advisory Board • Early Adopters – OptIPortal End Points • Targeted Workshops • User Forums • User Software Testing • Viz Tool Brainstorming • Presentations at Scientific Meetings • e.g. Demonstration Booth at JCVI Genomes, Medicine, and the Environment Conference October 2006 • Partnerships With Metagenomics Projects • E.g. DoE’s Joint Genome Institute (JGI) • Training and User Services Team Presentation Title April 4, 2002

  17. Guiding Philosophy for Development • Sprint Q4 2006 • Propagate JCVI toolkit and data ASAP • Mechanism for publication of Sorcerer II data • Enabler for community • Defined deliverables, project management approach • Marathon Q4 2006 onward • Additional Datasets • Additional tools • Community drives prioritization for ongoing releases • Advisory Board, Community Outreach • Keys to success: • Tight integration of science, bioinformatics, software, and IT • Matched to Community Needs Presentation Title April 4, 2002

  18. The Future Home of the Moore Foundation Funded Marine Microbial Ecology Metagenomics Complex First Implementation of the CAMERA Complex Major Buildout of Calit2 Server Room Underway http://calit2-1101-1.ucsd.edu/ Presentation Title April 4, 2002 Photo Courtesy Joe Keefe, Calit2

  19. Moore CAMERAProduction Environment • Creation of Initial Production Environment – September 2006 • Hardware • Compute Nodes – • ~200 4 CPU Nodes = ~800 Processing Cores • Storage Servers – • 10 systems = ¼ Petabyte raw storage • Database Servers • Larger 20-40TB; Smaller 5-10TB • Network Management – • Force10 E1200 Router w/12 10GigE Interfaces to Each System Ports • User Access to Compute Cycles • Bulk of free cycles available to external users • Proposal mechanism Presentation Title April 4, 2002 Source: Greg Hidley, Calit2; Phil Papadopoulos, SDSC, Calit2

  20. Visualization courtesy of Bob Patterson, NCSA. Countries are Aggressively Creating Gigabit Services:Interactive Access to CAMERA and LOOKING Systems www.glif.is Created in Reykjavik, Iceland 2003 Presentation Title April 4, 2002

  21. Scale Presentation Title April 4, 2002

  22. Dedicated Compute Farm (1000 CPUs) W E B PORTAL Data- Base Farm 10 GigE Fabric Local Environment Flat File Server Farm Direct Access Lambda Cnxns Web (other service) Local Cluster TeraGrid: Cyberinfrastructure Backplane (scheduled activities, e.g. all by all comparison) (10000s of CPUs) Calit2’s Direct Access Core Architecture Will Create Next Generation Metagenomics Server Traditional User • Sargasso Sea Data • Sorcerer II Expedition (GOS) • JGI Community Sequencing Project • Moore Marine Microbial Project • NASA Goddard Satellite Data • Community Microbial Metagenomics Data Request Response + Web Services Presentation Title April 4, 2002 Source: Phil Papadopoulos, SDSC, Calit2

  23. OptIPuter Scalable Adaptive Graphics Environment (SAGE) Allows Integration of HD Streams OptIPortal– Termination Device for the OptIPuter Global Backplane Presentation Title April 4, 2002

  24. OptIPortal– Termination Device for the OptIPuter Global Backplane • 20 Dual CPU Nodes, 20 24” Monitors, ~$50,000 • 1/4 Teraflop, 5 Terabyte Storage, 45 Mega Pixels--Nice PC! • Scalable Adaptive Graphics Environment ( SAGE) Jason Leigh, EVL-UIC Presentation Title April 4, 2002 Source: Phil Papadopoulos SDSC, Calit2

  25. UIC/UCSD 10GE CAVEWave on the National LambdaRail Emerging OptIPortal Sites OptIPortals UW NEW! UIC EVL MIT NEW! JCVI UCI UCSD SIO SunLight SDSU CICESE CAVEWave Connects Chicago to Seattle to San Diego…and Washington D.C. as of 4/1/06 and JCVI as of 5/15/06 Presentation Title April 4, 2002

  26. First Remote Interactive High Definition Video Exploration of Deep Sea Vents Canadian-U.S. Collaboration Presentation Title April 4, 2002 Source John Delaney & Deborah Kelley, UWash

  27. 1 cm. High Definition Still Frame of Hydrothermal Vent Ecology 2.3 Km Deep Source: John Delaney and Research Channel, U Washington White Filamentous Bacteria on 'Pill Bug' Outer Carapace Presentation Title April 4, 2002

  28. A Near Future Metagenomics Fiber Optic-Enabled Data Generator Presentation Title April 4, 2002 Source John Delaney, UWash

More Related