1 / 55

Database Technology in Bioinformatics

Database Technology in Bioinformatics. Philip McNeil. European Bioinformatics Institute. The Information Challenge Database Technologies Which Do You Choose? Data Modelling Some Database Features In Use at EBI. The Information Challenge. Many new data intensive methodologies

burt
Télécharger la présentation

Database Technology in Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Technology in Bioinformatics Philip McNeil European Bioinformatics Institute

  2. The Information Challenge Database Technologies Which Do You Choose? Data Modelling Some Database Features In Use at EBI

  3. The Information Challenge

  4. Many new data intensive methodologies Combinatorial Chemistry Genomics (including Structural Genomics) High Throughput Screening Proteomics Transgenics Microarrays Today’s Research Generates ever increasing amounts of data:

  5. Sequence Information

  6. Structure Information

  7. Genome information Megabases

  8. Size Complexity Integration ‘Data Waves’

  9. G R Y S P L E M CAGTAGTGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC G R Y S P L N M CAGTAAAGCACATCATTCGTCAATGCATACTGCACTAACCACACAGTAC Molecular biology has become information intensive

  10. Nucleotide Sequence Protein Sequence Protein Structure Protein Function Macromolecular Information Data Complexity

  11. Biological information all interrelated DNA sequence, Protein sequence, Structure, Function Specialist database for organisms HIV, Drosophila, C. Elegans Specialist databases for functions Eukaryotic promoters, Transcription factors Specialist databases for diseases and genes P53, Haemophilia B Proliferation of Databases

  12. Artificial boundaries between databases Coarse links between databases Multitude of exchange formats Lack of robustness Varied quality But…...

  13. Improved quality and integrity of data Data need to be well structured and robustly defined Flexible infrastructure to meet rapid changing requirements Open frameworks, management and analysis tools Integrate diverse data sources Meeting the Information Challenge

  14. Database Technologies

  15. Evolution of DBMS Technology Adapted from: Barry - The Object Database Handbook (1996)

  16. Essentially four different types: File System (Flat Files) Relational database (RDBMS) Object oriented database (ODBMS) Object-relational database (ORDBMS) Coming? XML Database Systems In Use Today

  17. All computers have them! Most of the world’s data still consists of old file systems and legacy data Many bioinformatics databases are still distributed as flat files: EMBL-Bank MSD/PDB SWISS-PROT/TrEMBL File Systems

  18. Well understood, mature technology Most widely used DBMS Standards: SQL92 although all vendors used proprietary extensions SQL99 Support for objects & other extensions SQL2003 XML Relational Databases

  19. Extended the SQL92 data model: User defined, complex data types Types, subtypes, inheritance References (‘OIDs’) Now supported by the SQL99 standard Many major relational databases now have object extensions Object-Relational Databases

  20. Persistent data store for objects created by object oriented programming languages Language binding: C++, Smalltalk, Java Standard: Object query language (OQL) Not implemented by many vendors Now mainly used in niche areas CAD/CAM, AI, telecomms Object Oriented databases

  21. Database Revenues

  22. Three different types defined by the XML:DB initiative: Native XML Database XML Enabled Database Hybrid XML Database XML Databases

  23. Defines a model for an XML document and stores and retrieves documents according to that model XML document is the fundamental unit of storage (cf. row in a relational database) Can be built on various underlying storage models (RDBMS, OODBMS, indexed compressed files) Native XML Database

  24. Has an added XML mapping layer Original XML metadata & structure may be lost Data retrieved as XML may not have originated in XML Data manipulation via e.g. DOM or SAX or via SQL Oracle, Microsoft, IBM use this approach XML Enabled Database

  25. Can be treated as either Native XML Database or XML Extended Database Example is Ozone Hybrid XML Database

  26. Which Do You Choose?

  27. Can store complex data (e.g. PDB) Can be indexed (at least for simple datatypes) Can be made publicly available in simple form Well suited for human browsing Avoid cost of database software Platform independent Easy to prepare for WWW Cheap Flat Files

  28. Low data reliability, security & integrity decentralised data and therefore decentralised control Inadequate data structuring difficult to provide adequate model of ‘the real world’ variety of formats - lack of robustness Difficult to get answers to ad-hoc queries no query language; data files are distinct sophisticated query tools have been developed - e.g. SRS Low responsiveness to change data and programs are not independent hard to integrate Limitations of Flat File ‘Database’

  29. Store large amounts of relatively simple data as tables of rows & columns Scalability Sound theoretical basis High security & reliability Performance Query optimization Parallel processing Strong support, tools, etc. Benefits of Relational Databases

  30. Cannot adequately support complex data complex data are stored as ‘BLOBS’ BLOBS can be retrieved, but not searched, indexed or manipulated Restricted set of data types even for less complex data Numbers, character strings, dates An inadequate model of ‘the real world’ entity/relationship model loss of semantics Expensive Limitations of Relational Databases

  31. Close to relational model, but benefit from some OO concepts Can handle large amounts of complex data ‘smart BLOBS’ Plug-in extensibility cartridges & datablades Good ad-hoc query capability: SQL99 High security & reliability Benefits of Object-Relational Databases

  32. A compromise solution, merging two paradigms underlying model still relational Less than perfect support for object extensions Even more expensive Limitations of Object-Relational Databases

  33. Support complex data structures Provide a much better implementation of ‘the real world’ model object oriented models map well - support for OO concepts little loss of semantics Vendor-specific: Good performance? Scalability? Ad-hoc queries are possible with OQL Closely integrated with programming languages Benefits of ODBMSs

  34. Hard to learn? Vendor Specific: Difficult to query? Few currently support OQL, a few support SQL Queries may have to be written in a 3GL, e.g. C++ Performance? Security? Reliability? Scalability? Backup & Recovery? Also expensive Few tools Limitations of ODBMSs

  35. It depends on the data and what you want to do with them: Which do you choose? Michael Stonebraker: “Object-relational DBMS - The Next Wave”, Illustra whitepaper, http://www.informix.com/informix/corpinfo/zines/whitpprs

  36. Most major data repositories in molecular biology have moved to using commercial RDBMS packages to manage their collections Most groups still collect and deliver the information using flat file protocols and formats – XML is becoming dominant here DBMS to store, flat files to communicate

  37. Data Modelling

  38. Start with a conceptual model Can be done using different approaches objects entity relationship This can be implemented using different physical database systems and programming languages (not always without difficulty!) Remember - your database will only be as good as the data model it supports Modelling Comes First

  39. UML stands for Unified Modeling Language The UML combines elements from Data Modelling concepts (Entity Relationship Diagrams) Business Modelling (work flow) Object Modelling Component Modelling UML is the OMG standard language for visualizing, specifying, constructing, and documenting the artifacts of a software-intensive system UML

  40. Study • Classes • Attributes • Links • Operations • Set and get (implicit) • Checks and constraints +details: String 1 1 * Experiment * Conditions +serial: Int * +name: String +serial: Int +ndim: Int +temperature: Float +details: String +pH: Float +__init__() 1 * ExpDim +dim: Int UML: Basics

  41. CCPN: Part of data model

  42. ArrayExpress Object Model

  43. Sequence Schema

  44. Some Advanced Database Features at the EBI

  45. Database designed for queries and analysis Facilitate the synchronisation with other databases Repository for derived data Modular The MSD Data Warehouse

  46. Deposition Deposition Stage1 Warehouse replication transformation Search-Warehouse replication distribution From Deposition to Distribution

  47. Exp. Result Assembly Chains Residues Atoms CHAIN ENTRY ASSEMBLY ALT ASSEMBLY DATA RESIDUE ATOM DATA ATOM MODEL Representing Macromolecular Structures

  48. Need for staging databases Transformation Mechanism Deposition – Warehouse refresh Replication/Distribution Query optimisation Interfacing with the warehouse(API-web-management tools) Technical Details

  49. Warehouse Schema Coordinates/ Secondary Structure

More Related