GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn

GUS 3.0: Implementationand DependenciesJune 19, 2002Jonathan Crabtreecrabtree@pcbi.upenn.edu

Outline • Schema "implementation" • what's done, what's not • Dependencies • data migration and testing • other tasks • Database implementation details • design decisions and implications • production & development dbs • Future work/current schema issues

Implementation • Implemented so far: • The schema itself (Pinney) • Updated Perl object layer (Brunk) • Revised GUS Application (GA) code (Schug) • Preliminary version of allgenes interface (Fischer) • Not yet implemented: • Extensive testing of schema, interface, objects, etc. • Data migration from GUSdev to GUS 3.0

Migration Dependencies • Instantiate/finalize GUS 3.0 schema (Pinney) • Upgrade database server operating system • Install and configure new RAID device • Write scripts to migrate existing data • Resolve any remaining inconsistencies • Freeze access to database • Annotator's interface (Diskin, Mazzarelli) • Current allgenes update (Pinney, Fischer)

Migration Dependencies cont. • Run scripts to migrate all existing data • Fix any problems that arise • Begin to "certify" plugins as 3.0-compliant • Discuss: how much does the GUS 3.0 schema "implementation" depend on our data migration? • In other words, the 3.0 schema can be viewed as implemented but untested. • Conflict with PlasmoDB final release date?

Migration Highlights • Two "namespaces" (Oracle schemas) to five: • GUSdev,RADdev => Core,DoTS,SRes,RAD3,TESS • Certain tables are now shared in Core, SRes • Avoid primary key conflicts by reloading RAD3 • Restructuring of DoTS "central dogma" tables: • GeneInstance, RNAInstance, ProteinInstance • Also GO terms, new LOE and Complex tables • Other pervasive changes: • e.g. ExternalDatabase => ExternalDatabaseRelease

Other Tasks • Script(s) to automate schema creation: • schemas (in the Oracle sense) • tables • sequences (to generate primary key values) • views • "bootstrap" rows • populate other tables as desired? (Anatomy, etc.) • constraints • indexes • GRANT permissions as desired

Other Tasks II • Complete schema documentation • Convert plugins to new schema as needed • Remove site-specific dependencies or standardize • e.g., hard-coded references to specific external_db_ids • Particularly for data loading plugins • make it easier to load and display sample dataset • Formalize schema development process • Which changes are "major" or "minor"? • Automatically determine which plugins are affected?

Database Design Decisions • GUS vs. other "plain" relational databases: • 1. subclassing (extra views) • 2. [blame] tracking/access control (extra columns) • 3. versioning (extra tables) • Minimal reliance on database-specific features • no stored procedures • no server-side Java • no object-relational tables • Generic links and naming conventions

1. Subclassing With Views • Advantages • conceptual clarity • straightforward to query the superclasses • schema evolution; views are easier to change • Implications • large tables (number of columns and rows) • complicates query optimization (number of rows) • slows row accesses (number of columns)

Subclassing cont. • Query optimization issues • Cost-based query optimization requires statistics • Confounded by coexistence of subclasses in table • Bigger tables make the worst case worse • Physical I/O issues • Any row access must read the entire row, including a potentially large set of irrelevant column values • Also increases the likelihood of chaining

Subclassing - alternatives • Use views for the superclass not the subclasses? • Isolates subclasses from one another more • Requires changing tables rather than views • Superclass view will be a large SQL UNION • Queries likely less efficient over superclasses • Keep existing system, but use partitions to specify physical placement of subclass rows • Solves some, but not all of the problems

"Large" Tables I • GUS: indexes=25G tables=~100G • NASequenceImp = 11G • AssemblySequenceVer = 8.6 G • SimilaritySpan = 8G / 74 million rows • Similarity = 4.8 G / 38 million rows • AssemblySequence = 3.6G • Evidence = 3G / 36 million rows • SimilarityVer = 2.8G • Approximately 10-20 quite large tables

"Large" Tables II • Tables with largest average row length: • GeneMapVer = 811 bytes • NASequenceImp = 747 bytes • AssemblySequence = 521 bytes • Tables with the most chained rows: • NASequenceImp = 384,524 rows • AssemblySequenceVer = 56,873 rows • AssemblySequence = 21,458 rows

2. Tracking/Access Control • Advantages: • Enables DBA to disburse wrath appropriately • Aids in correcting errors • Disadvantages: • Extra columns have foreign key constraints • Several small tables become bottlenecks for certain DDL and database update operations • Access controls not fully implemented • where and how should they be implemented?

Tracking II

3. Versioning • Advantages: • required for complete tracking • Disadvantages: • space overhead, results in slower updates • requires application-level code to implement • may be unnecessary in some DBMSs • currently not used uniformly • Different versions coexisting e.g., PlasmoDB

Development => Production • nemesis/8i (GUS) and erebus/9i (GUSdev) • Release cycle based on whole-database copy • Uses Oracle IMPORT and EXPORT utilities: • EXPORT over network to flat files • change owner/schema name • change physical placement of tables, indexes • Alternatives: • transportable tablespaces, SQL-based copy

Future Work • Issues with current schema from PlasmoDB • free text searching (and use of CLOB values) • more sophisticated schema for tracking session-oriented data (more on this tomorrow) • supporting queries for genome browser(s)

GUS 3.0: Implementation and Dependencies June 19, 2002 Jonathan Crabtree crabtree@pcbi.upenn