1 / 60

The Representation of Scientific Data

The Representation of Scientific Data. Frank.Gibson@ncl.ac.uk. Overview. Recording archiving and sharing the process and the results of experimental data is a challenge What to store? How to store it? Why?. Science is complicated. Technology. Complex experimental workflow

caroun
Télécharger la présentation

The Representation of Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Representation of Scientific Data Frank.Gibson@ncl.ac.uk

  2. Overview • Recording archiving and sharing the process and the results of experimental data is a challenge What to store? How to store it? Why?

  3. Science is complicated

  4. Technology • Complex experimental workflow • Advances in instrumentation • High-through methods

  5. 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa Analysis is complicated

  6. Analysis • New algorithms and software • Data integration • From multiple sources • Genomics • Proteomics • Metabolomics • Neuroscience • Systems biology

  7. 2D Image analysis

  8. Problems • “In the standard model, one collects data, publishes a paper or papers and then gradually loses the original dataset.” • THE NEW KNOWLEDGE ECONOMY AND SCIENCE AND TECHNOLOGY POLICYGeoffrey Bowker, University of California, San Diego

  9. Problems • Large, complex datasets are commonplace, • Heterogeneous data formats • Vendor specific, Lab specific • Multitude of analysis methods • Proprietary, open source

  10. Benefits • Knowledge discovery – results • Sharing of best practice • Evaluation of results • Sharing of data • Re-use

  11. Re-use of neuroscience datasets • Data that is shared and can be interpreted can often be used to address multiple questions. • Data that have been collected with one question in mind often turn out to be highly valuable to address other questions • (1) Hippocampus recordings for mapping place fields were the basis for high-profile papers addressing questions concerning temporal organization of neural codes (PMID: 12891358 ). • (2) Paired recordings using extracellular and intracellular electrodes originally collected for detecting dendritically generated action potentials provide ground truth for testing and comparing spike-sorting techniques (PMID: 10899214 ).

  12. Engineering and Physical Sciences Research Council CARMENCode, Analysis, Repository and Modelling for e-Neurosciencewww.carmen.org.uk

  13. Virtual Laboratory for Neurophysiology • Enabling sharing and collaborative exploitation of data, analysis code and expertise that are not physically collocated

  14. Cost • Infrastructure • Acquisition – data and metadata • Developing a common representation • Potential benefits are not always experienced by data producers • Lab experimenter vs bioinformatician

  15. Data pyramid Results Processing Derived data Raw data

  16. Mass Spectrometry Data pyramid Results Processing Derived data Raw data

  17. How do we store the data? • Dictated by form of access • Raw data, typically vendor specific formats for vendor specific software analysis • Derived data – unlimited formats – higher level of access required to determine results • Results – often queries over derived data • Problematic if derived data are represented in inconsistent structures • – consistent representation is valuable

  18. Metadata • Description of results • Sample • How it was generated • Equipment • Processing steps • Expensive to capture • Important to validate result Lab-book Lab-book Lab-book Lab-book Lab-book Lab-book Lab-book Lab-book Lab-book

  19. Standards • Science is a challenge • Scientific data is complex • Different data representations add further complexity to complex science • We need a common representation of data to get back to just complex science • Lots of individuals have created formats in isolation – only works for their data in their lab

  20. What is a standard? • “established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context“ • BSI - • http://www.bsi-global.com/en/Standards-and-Publications/About-standards/Glossary/

  21. Community standards development

  22. Standards: allow working together for knowledge discovery Knowledge

  23. Standards bodies • W3C -World wide web consortium (W3C) • IEEE - Institute of Electrical and Electronics Engineers • OMG – Object management group

  24. Life science communities

  25. Technologies for data standards • Important to adopt a technology that provides a clear representation of the domain • The model and the model documentation capture a shared understanding of the domain • Many technologies exist which support modelling • Each focuses on a different use such a validation, code generation and data transmission

  26. Technologies being used • Simple text documents or spreadsheets • XML - Extensible Markup Language • RDF – Resource Description Framework • UML – Unified Modeling Language • OWL – Web ontology Language • OBO – Open Biomedical Ontology format

  27. Simple documents • A list of what is required • MIxxx Minimum information XXX • MIAME • Minimum information about a Microarray Experiment • MAIPE • Minimum information about a Proteomics Experiment

  28. MIAPE:GE • Identifies the minimum information required to report the use of n-dimensional gel electrophoresis in a proteomics experiment

  29. XML • Widely used for representing biological information • Mark up sections with elements • Validates against a schema <lecture> <to>Bioinformatics students</to> <from>Frank Gibson</from> <title>Representation of scientific data </title> <feedback>Students all fell asleep </feedback> </lecture>

  30. UML • An implementation independent model • Allows multiple technology implementations of the same model • Such as • XML, JAVA, Relational tables

  31. The numbers indicate the multiplicity of the relationship with * meaning “many”. One or more instances of JetEngine can be associated with one or more instances of Aeroplane A filled diamond indicates containment. An Aeroplane can not exist without a JetEngine An arrow shows the direction of the relationship. An open-headed arrow indicates inheritance. A Pilot and a Passenger are both instances of Person, inheriting the attributes “name” and “DOB”. 1..* 1..*

  32. Functional Genomics Experiment (FuGE) • Model of common components in science investigations, such as materials, data, protocols, equipment and software. • Provides a framework for capturing complete laboratory workflows, enabling the integration of pre-existing data formats.

  33. GelML

  34. RDF • Overcomes limited expressivity of XML • Allows the semantic meaning of statements to be captured

  35. Uniprot(beta) in RDF

  36. Ontolgies for Life science • Emergence has occurred for two reasons • Consistent annotation of data • To add meaning and understanding that can be interpreted computationaly • Bio-ontologies registered on the OBO foundry

  37. Bio-ontologies • OBO format • Flat file format, more suited to controlled vocabularies, made popular by GO • OWL • W3C recommendation, designed for computers not humans

  38. sepCV In OBO

  39. OBI • An ontology for all investigations in the life sciences • Implemented in OWL • Large community involvement • sepCV to be integrated within OBI

  40. Tools • Tools are important • Biologist don’t want to look at XML • Need data entry tools – a website… • Direct export of data and metadata from instruments • Equipment vendors and manufactures need to be involved in the “community” of standards development • Tools lag behind development of the standard

  41. Symba - data entry and storage

  42. The Representation of Scientific Data The Road Map

  43. Patience • Standards development is slow it requires • A measure of technical and political consensus • An organisational framework • Individuals who are willing to contribute time and expertise, both domain experts and knowledge engineers (modellers)

  44. The Problem • Identify the problem • Identify the users that need the problem solved • Requirements gathering – what do the users need? • See if someone else has already done it! • If so, use it and go to the pub

More Related