190 likes | 299 Vues
Learn about the schema porting, data migration, and integration process of dictyBase tools from CHADO and GMOD to Oracle, by Eric Just from Northwestern University. Explore challenges faced, solutions implemented, and tuning strategies applied.
E N D
Porting CHADO and GMOD Tools to Oracle and Integration with dictyBase Eric Just dictyBase http://dictybase.org Center for Genetic Medicine Northwestern University
WHY? • dictyBase based on SGD • Increase flexibility in feature storage • Want to use CHADO for feature data, but ‘dicty’ SGD schema for the rest • ‘dicty’ SGD (Oracle) needs to link to CHADO Eric Just - dictyBase – Northwestern University
Schema porting • SQL Fairy did most of this, but • Had to tweak Oracle Producer • Object name limited to 30 characters, systematically truncate names • Unique/primary keys on CLOBs (text) not allowed, changed to varchar2(4000) • ‘SYNONYM’ reserved name in Oracle, changed name to ‘SYNONYM _’ Eric Just - dictyBase – Northwestern University
Class::DBI • Class::DBI provides nice ‘table level’ abstraction • CRUD, follow references WITHOUT WRITING SQL • Excellent tool for portability • GMOD ships with Class::DBI configured for CHADO • Had to fix/customize Oracle Driver Eric Just - dictyBase – Northwestern University
AutoDBI • Package which loads Class::DBI classes for each table • Keep class name Chado::Synonym but call set_up_table( ‘synonym_’ ) • Made ‘residues’ a ‘lazy’ column of Chado::Feature • No other Significant porting needed Eric Just - dictyBase – Northwestern University
Data Migration GFF3 CHADO ‘dicty’ SGD • Export chromosome sequences and locations in GFF3 • Load GFF3 into CHADO schema • Update references to features with new tables and id’s Eric Just - dictyBase – Northwestern University
GBrowse porting • ‘rows’ method does not exist in Oracle DBI Driver if ($sth->rows() == 0) {…} my $rows_returned = @{$sth->fetchall_arrayref()}; $sth->execute or Bio::Root::Root->throw(); if ( $rows_returned== 0) {…} • Oracle fetchrow_hashref() is case sensitive $sth->fetchrow_hashref() $sth->fetchrow_hashref("NAME_lc") Eric Just - dictyBase – Northwestern University
GBrowse porting - Queries • Oracle does not like anything in a ‘using’ clause to also be in the ‘where’ clause select f.feature_id, f.name, fl.fmin,fl.fmax from feature f join featureloc fl using (feature_id) where f.feature_id = 221659 and fl.rank=0; select f.feature_id, f.name, fl.fmin,fl.fmax from feature f join featureloc fl on f.feature_id = fl.feature_id where f.feature_id = 221659 and fl.rank=0; • ‘substring’ becomes ‘substr’ • Any SQL containing synonym table must be modified • Any procedural SQL must be reproduced, in some cases this can be avoided Eric Just - dictyBase – Northwestern University
Tuning • Added is_deleted flag to feature table • Added some audit columns • Added audit table and triggers • Created Indexes Heuristically • Added hints to some difficult queries Eric Just - dictyBase – Northwestern University
Integrating into dictyBase I Various middleware and presentation objects dictyBase Presentation Layer dictyBase Object Model ‘Dbtable’ database abstraction layer ‘dicty’ SGD Eric Just - dictyBase – Northwestern University
Integrating into dictyBase II dictyBase Presentation Layer Various middleware and presentation objects dictyBase Object Model ‘Dbtable’ layer Class::DBI layer ‘dicty’ SGD CHADO Eric Just - dictyBase – Northwestern University
dictyBase Objects • Retrieve, insert, update, delete • Interface ignorant of schema • No presentation in data classes • Easy to use interfaces • Tuned with lazy evaluation most accessors • 75 – 80% unit test coverage Eric Just - dictyBase – Northwestern University
Use BioPerl • Use Bio::Seq to represent sequences • Use Bio::SeqFeatures to represent transcript and alignment locations • Harness the power of BioPerl for sequence tasks, file generation • NOTE: BioPerl only used for sequence and location Eric Just - dictyBase – Northwestern University
Class Diagram Feature Aligned mRNA Contig Chromosome getOverlappingFeatures() getOverlappingAlignments() Bio::SeqFeature::Generic Bio::SeqFeature::Gene::Transcript Bio::SeqFeature::Generic Bio::Seq Eric Just - dictyBase – Northwestern University
Object use case: Add an Exon, dbxref, and Description #!perl use dicty::Feature; my $transcript = new dicty::Feature( -feature_no => 218420 ); $transcript->description( ‘Gene model derived from AU12345' ); $transcript->add_external_id( -source => ‘GenBank Accession Number', -id => 'AU12345' ); $bioperl = $transcript->bioperl(); [$bioperl->exons()]->[2]->start( 281050 ); my $exon = Bio::SeqFeature::Gene::Exon->new( -start => 280921, -end => 280959, -strand => -1 ); $exon->is_coding(1); $bioperl->add_exon($exon); $transcript->update(); Eric Just - dictyBase – Northwestern University
Using Apollo GenBank file GenBank file GenBank file GenBank file Object layer Chado • Request segment through SOAP message over HTTP • Object layer generates GenBank File • Send GenBank File via SOAP message • Modify, in Apollo send changed gene models back via SOAP • Adaptor changes gene models and updates the database Eric Just - dictyBase – Northwestern University
New Curation Tools • Gene and Feature curation had to be rewritten • ‘Gene centric’ curation • Added more evidence qualifiers • Presentation classes that manipulate Object Layer Eric Just - dictyBase – Northwestern University
Where Are We Going • Utilize the flexibility – New Feature Types, feature relations, and SO. • Contribute back to GMOD • Gradually port different areas into CHADO • Provide feedback and testing ground for database independence Eric Just - dictyBase – Northwestern University
Acknowlegments Other Groups • Funding • NIH (NIGMS and NHGRI) • SGD • GMOD • CHADO • GBrowse • Apollo • BioPerl dictyBase • PIs • Rex Chisholm, PhD • Warren Kibbe, PhD • Programmer • Sohel Merchant • Curators • Petra Fey • Pascale Gaudet, PhD • Karen Pilcher • Bioinformatics Core at Northwestern Eric Just - dictyBase – Northwestern University