Large-Scale Microbial Ecology Cyberinfrastructure (CAMERA) Designing the Microbial Research Commons: An International SymposiumNational Academy of Sciences, Washington, DC, 8-9 October 2009Paul Gilna, B.Sc., Ph.D.California Institute for Telecommunications & Information Technology (Calit2)University of California, San Diego
Global Scientific Research Cyber-Community 3100 users 70 countries
CAMERA 2.0 Objectives • CAMERA serves as one representation of a specific research community’s need for a system to • Provide a metadata rich family of scalable databases and make them available to the community • Collect and reference increasing metadata relevant to environmental metagenome datasets • Exploit the power of querying on metadata across multiple geospatial locations • Provide a facility that allows for a diversity of software tools to be easily integrated into the system (and sufficient compute resources to support these analyses)
The Semantically Aware DB Schema • Some key features of the semantically aware DB schema • Environmental parameters: Modeled more generally, to accommodate any environment and any parameter within an environment • Sequence: Separate “registries” for DNA, rRNA, mRNA, viral segments, reference genomes etc. Sequence annotations are independently searchable. • Workflow Connection: Every computed property is associated with the workflow instance that created it. • Associated Data: Data not produced in CAMERA but often used for analysis and comparison • Ontologies: All metadata, measured and observed parameters are connected to ontologies, whenever possible.
Integration of External Data Warehousing Reference genomes Homologs, CoG clusters Raster data from slow/complex servers Remote Data KEGG pathways NASA MODIS data World Ocean Atlas Other data that come as “data sets” that do not conform to the schema
NASA Aqua-MODIS satellite data Metadata: beyond data collected at sampling site MODIS Images covering GOS sites #8 – 12, mid November, 2003 Sea Surface Temp Chlorophyll
Integrate and browse additional sources of microbial data
CAMERA 2.0 (Data Submission) Growing the CAMERA Community and Resource…
GBMF Data Acquisition Pipeline:A New Data Submission Paradigm-Metadata First! CAMERA sends acknowledgement to Investigator, Seq. Group, GBMF Seq. Group Upload data to CAMERA (& Investigator) Investigator submits proposal to GBMF Investigator submits metadata to CAMERA Data & Metadata Released in six months Metadata now collected before sequence data: GSC-compliant Project-ID serves as acceptance-proof Sample is Received and Sequenced Seq. Group send barcoded sample “kit” to investigators Webb Miller and Stephan C. Schuster, and Roche / 454 Genome Sequencer
Data Standards • Minimal Information for (Meta)Genomic Sequences: MIGS/MIMS • A Metadata standard, developed by the Genomics Standards Consortium • Controlled vocabularies e.g. EnvO, PATO • Common language: GCDML • Submissions shall comply with a MIMS/MIGS core, but any metadata can be entered via keywords and free text • Different metadata submission forms for different habitats: (water, soil, air, hosts)
CAMERA 2.0 (Computation) From simple job submission to community developed and published workflows…
RAMMCAP – Rapid clustering and functional annotation for metagenomic sequences RNA finding/filtering DNA Clustering Unique sequence Taxonomy / population analysis ORF clustering ORF calling Unique sequences Protein families ORF and cluster annotation Pfam, Tigrfam, COG, etc. Features Very fast (10-100x) as compared to BLAST-based methods Effective tools: CD-HIT, HMMERHEAD, meta_RNA, and RPS-BLAST Focused functional annotation via curated protein families COG Metagenomic Raw reads ORFs 1. ORF_finder 2. Metagene Pfam Tigrfam CD-HIT, 90-95% CD-HIT-EST, 95% HMMER HMMERHEAD RPS-BLAST 1. tRNA scan 2. rRNA scan 3. meta_RNA Non-redundant ORFs DNA clusters ORF Annotation tRNAs Unique DNA sequences CD-HIT, 60 or 30% rRNAs Cluster Annotation Representative sequences Protein clusters More in-depth analysis and further annotation
Annotation workflow A green box is called an ‘actor’ , which performs a task. Data flow is divided. This special actor represents an annotation component, such as BLAST search. Workflow parameters, which can be specified by users in the portal, are passed to workflow components.
Provenance of Workflow Related Data Provenance: A concept from art history and library Inputs, outputs, intermediate results, workflow design, workflow run Collected information Can be used in a number of ways Validation, reproducibility, fault tolerance, etc… Linked to the semantic database Viewable and searchable from CAMERA 2.0