Download
building cryptodb using gus n.
Skip this Video
Loading SlideShow in 5 Seconds..
Building CryptoDB using GUS PowerPoint Presentation
Download Presentation
Building CryptoDB using GUS

Building CryptoDB using GUS

99 Vues Download Presentation
Télécharger la présentation

Building CryptoDB using GUS

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu

  2. Genomic Data Analysis Results GUS Plugins Tomcat WDK Apache

  3. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper script • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results

  4. Site Design Considerations • data types we wanted to warehouse • additional analyses desired • how to load data into GUS • how to visualize data • tables • text • graphics (interactive, static) • what types of questions will be asked of the data

  5. Deciding Factors • What data was available. • What the research community needed. • What we could accomplish by the contractual deadline for our first release.

  6. Crypto External Resource Data • Genomic sequence and gene annotations for two species (GenBank) • sequence • CDS translations • gene product descriptions • exon coordinates • RNA type (mRNA, tRNA, snoRNA, rRNA) • other features • EST/mRNA (GenBank)

  7. Auxillary Data Required • NRDB • NCBI Taxonomy Reference • Sequence Ontology Definitions

  8. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results

  9. GUS Plugins • Perl modules for loading data into GUS • facilities to connect to the GUS perl object layer and the database • process command line arguments • create tracking information in the database • log and handle errors

  10. GUS Plugins • Supported and Community plugins bundled with GUS • Plugins are versioned • Each plugin version must be registered with GUS before use • records cvs version and md5 checksum • auditing

  11. Data Loading at CryptoDB • Install GUS • Register selected plugins • Load Controlled Vocabularies • NCBI Taxonomy • Sequence Ontology Definitions • Load Crypto annotated sequences from GenBank records • Load NRDB from FASTA file

  12. Data Loading at CryptoDB • Load Crypto mRNA GenBank records • Load ESTs from U Penn's database of NCBI's dbEST

  13. CryptoDB Analyses • BLASTP - compare annotated proteins to nrdb • BLASTX - compare whole genome to nrdb • BLASTN - synteny comparison of the two Crypto species we host • EST/mRNA clustering and alignment • signal peptide predictions • transmembrane predictions

  14. Analysis Workflow • Load Source Data into GUS (NRDB, genomic seqs) • Dump same data from GUS with GUS Ids • Perform analysis with this data (BLASTX) • Load results into GUS • GUS Ids allow results to be linked back to analysis input data

  15. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper script • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Analysis Results

  16. Data Analysis - BLASTP • Dump NRDB records from GUS to FASTA file - with GUS Ids >336 source_id=0703290B secondary_identifier=223280 tubulin alpha length=411 TIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAA NNYARGHYTIGKEIIDLVLDRIRKLADQCTGLQGFSVFHSFGGGTGSGFTSLLMERLSVD YGKKSKLEFSIYPARQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE RQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE • Dump annotated protein sequences from GUS to FASTA file - with GUS Ids

  17. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Analysis Results

  18. Data Analysis - BLASTP • Run BLASTP algorithm with these two GUS Id labeled datasets • used a Perl wrapper to BLAST executable, included with GUS... plugin compatible output • Load BLAST results with plugin • ga GUS::Common::Plugin::LoadBlastSimFast --file blastSimilarity.out --restartAlgInvs "" --queryTable DoTS::ExternalNASequence --subjectTable DoTS::ExternalAASequence --commit

  19. Post Data Loading • Find where the results were loaded • read documentation • ga GUS::Common::LoadBLAST --help • looked in plugin source code • asked other users • gusdb.org schema browser • fishing expeditions in GUS tables

  20. Getting Our Database On Line

  21. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results

  22. Web Development Kit (WDK) • provides accelerated development of database driven web sites • define questions and records in model XML file • default JavaServer Pages (JSP) views provided • not specific to GUS • can be used with any RDBMS

  23. WDK Question - Summary - Record Paradigm • Users supply parameter values to a canned question on the website • "Which genes have at least __ exons?" • The result is returned in summary pages that list links to the record pages • Record page - detailed view of data object • text • graphics • tables

  24. Questions Summary Record

  25. WDK Model - View - Controller architecture • Model XML configuration defines • questions • answer summaries • records • View • displays the model • defined in customizable JavaServer pages • Controller • internal, not configurable

  26. WDK Setup • build • write WDK model (WDK comes with Toy site - spent some time with that before hand) • test model from command line • install WDK into Tomcat • customize the view (jsp) pages • integrate Tomcat with Apache - personal preference

  27. WDK Model:Defining Questions <question name="GeneByContig" displayName="Genes by Contig" queryRef="GeneFeatureIds.GeneByContig" summaryAttributesRef="source_id,product,organism,contig" recordClassRef="GeneRecordClasses.GeneRecordClass"> <description>Find gene located on a given contig</description> </question>

  28. <sqlQuery name="GeneByContig" displayName="By Contig" isCacheable='true'> <description> Find Genes By Contig ID. </description> <paramRef ref="params.contig"/> <column name="source_id" isInternal="false"/> <sql> <!-- use CDATA because query includes angle brackets --> <![CDATA[ select g.source_id from dots.genefeature g, dots.naentry nae, dots.sequencetype st, dots.externalNAsequence enas where nae.na_sequence_id = g.na_sequence_id and enas.sequence_type_id = st.sequence_type_id and enas.na_sequence_id = nae.na_sequence_id and st.name = 'contig' and nae.source_id = '$$contig$$' ORDER BY g.source_id ]]> </sql> </sqlQuery>

  29. WDK Model - Record <recordClass idPrefix="" name="GeneRecordClass" type="Gene" attributeOrdering="source_id,exoncount,overview, product,linkout,dnaContext,genomeCompare,tmdata,blastpgraphic, translation,sequence,reference"> <attributeQueryRef ref="GeneAttributes.GeneAttrs"/> <attributeQueryRef ref="GeneAttributes.ExonCount"/> <attributeQueryRef ref="GeneAttributes.TMCount"/> <tableQueryRef ref="GeneTables.BlastP"/> <textAttribute name="overview" displayName="Overview"> <text> <![CDATA[ This <b><i>$$organism$$</i></b> gene spans positions <b>$$start_max$$</b> - <b>$$end_min$$</b> of contig <a href="showRecord.do?id=$$contig$$"><b>$$contig$$</b></a> which maps to chromosome <b>$$chromosome$$</b> ]]> </text> </textAttribute> </recordClass>

  30. Testing the Modelcommand line tools • wdkXml - check xml syntax • wdkSummary - test a summary • wdkQuery - run specific query • wdkRecord - test a record • wdkSanityTest - exercises all queries and records • wdkCache

  31. Install WDK into Tomcat • follow the installation instructions carefully • relies on symbolic links from Tomcat webapp to $GUS_HOME • disallowed by default Tomcat configuration • keep an eye on Tomcat logs for troubleshooting • reload the webapp when model changes • retest on command line • don't forget about the cache

  32. WDK Default View

  33. CryptoDB Custom View • Made style changes, added site branding • Added additional form elements • radio buttons, check boxes • 'Flattened out' the questions

  34. CryptoDB Custom View • Record pages - alterations to acheive the desired ordering and placement of text, tables and graphics • Standard JSP tags to embed external objects • GBrowse graphic

  35. Integrate Tomcat with Apache • Apache front end answers all web requests • Serves the static pages and cgi tools • BLAST interface • motif search • BLASTX keyword search • Calls to the WDK are passed to Tomcat

  36. External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results

  37. Pipeline • External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results