1 / 77

MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research.

MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research. Morris Swertz ( m.a.swertz@rug.nl ) Utrecht, BioAssist meeting, September 19, 2008. Where do I come from. MSc technology management, specialized in IT Thesis on federated databases One-man company

werner
Télécharger la présentation

MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MOLGENIS – the cyber infrastructure with a “dial” to tune it to your research. Morris Swertz (m.a.swertz@rug.nl) Utrecht, BioAssist meeting, September 19, 2008

  2. Where do I come from • MSc technology management, specialized in IT • Thesis on federated databases • One-man company • Information systems • PhD bioinformatics • “dynamic software infrastructures for the life sciences” • Now at Medical Genetics, University Medical Center Groningen

  3. Ongoing work • BioBank platform leader • Cohort data, clinical phenotypes • Locus Specific databases • Molecular data (overlap with other platforms) • HTP genotype and phenotype experiments • QTL, GWAS • EU projects CASIMIR and GEN2PHEN • NPC workpackage 1 • Developing a platform for proteomics research • See Martijn. http://www.molgenis.org

  4. Outline of talk • What is MOLGENIS? • Concept • Simple example • Practical examples • How to create a proper data model • Existing databases + Taverna • Hands-on session • Generate your first MOLGENIS • Plug-ins • Import/export

  5. MOLGENIS Concepts and methods

  6. What is MOLGENIS • Quotes “ It is a holistic bio-database in a box, which can fit any data” “It is the database that comes with a dial, to tune it to your research” “It is the database where you program one feature, and then get many features for free”

  7. Cyber infrastructure? researchers user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688

  8. Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows

  9. What does this mean in practice…

  10. Large scale biology needs IT support Large datasets Dozens of samples Processing Complex relationships

  11. A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83

  12. A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83

  13. A website for experiments Swertz et al (2004) Bioinformatics 20, 2075-83

  14. bioinformatician softw engineers 10 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Biosoftware is hard The challenge: biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243

  15. bioinformatician softw engineers 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network … biologist needs change Then “we” need to: …reinvent the wheel …but we were lazy  biologist biologist Swertz & Jansen (2007) Nature Genetics 8, 235-243

  16. Strategy: a flexible platform 1x ∞x What? How? bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243

  17. 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network “dial” to new research bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243

  18. Upgrade to new software tools bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 100.000 strains genome SNP arrays inbreed 100 10,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 1000 1000 LC/MS mass peaks preprocess aligned peaks network http://www.molgenis.org Swertz & Jansen (2007) Nature Genetics 8, 235-243

  19. Sharing data and reuse tools • I want to still generate my own flavor (incl existing software) • Have free access to more resources via standard interfaces … IS “my” + Ontology tools … IS “my” + processing tools … IS “my” + workflows

  20. Array Production Legend: Process Biomaterial Data file Genemap (*gbk) Well desc. Spotter settings Spot desc. Amplicon design Amplicon design Plate synthesis Plate Array batch Array design & production Array Chip Layout Control Scans Array Experiments Hybrid. Protocol Organism, Media Control Scans Labeling kit Control Scans Sampling cDNA Labeling Hybridi-sation Hybrid. Array Measurement Sample RNA Extraction Sample RNA Labeled cDNA Quant. file Hires Scan Grid file MOLGENIS: the software family … Proteomics Genetical Genomics Microarray Illumina arrays on mouse Affy arrays on mouse Qiagen arrays on C. elegans LC-MS on A. thaliana Software factory + Sharing components and easier to integrate because all MOLGENIS instances have standard generated interfaces

  21. for processing for data and UI More projects use this concepts Swertz & Jansen (2007) Nature Genetics 8, 235-243

  22. Basics

  23. Open source biobase generator • Download for free at http://www.molgenis.org • works on Java, mySQL, Tomcat, Eclipse, Windows, Linux, Mac

  24. Example 1: a MOLGENIS from scratch probes individuals expressions

  25. Little OM language • Object model language: • Entities • Fields • Xrefs Is a ‘contract’ such as Machiel explained

  26. Little OM language

  27. Little UI language

  28. Generate online or in Eclipse http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase

  29. Result: Java code http://gbic.biol.rug.nl/supplementary/2007/molgenis_showcase

  30. Under the hood DSL file  Customizing...  Generate  MyScript GUI FormGen TreeGen MenuGen Simple: Marker.Find() PluginGen MatrixGen APIs in Java, R, Web services and HTTP JDBCMapGen JTypeGen JReadCsvGen JListGen Complex: Select id,name, type from Item natural join Trait natural join … RListGen JDatabaseGen RMatrixGen HSQLGen DB in MySQL or HSQL WSGen MySQLGen

  31. Result: many features for free Are ‘implementations of contracts’ that Machiel talked about: Java API SOAP API R-API Tab delimited API Database tables

  32. Software: interface to R source(“http://localhost:8080/molgenis4gg/R”) #download data use.experiment(name=“metanetwork”) #set default traits <- get.metabolitedata(name=“mytraits”) genotypes <- get.markerdata(name=“mygenotypes") #calculate mQTLs library(“MetaNetwork”) qtls <- qtlMapTwoPart(genotypes=genotypes, traits=traits, spike=4) #upload results for others to use add.mqtldata(qtls, name=“myqtls”) inspect MetaNetwork protocol: Fu, Swertz, Keurentjes, Jansen, Nature Protocols, 2007.

  33. Incl documentation 

  34. Applications

  35. Long projects Microaray experiments MOLGEN group, Groningen Genotypes and phenotypes Rudi Alberts, Braunschweig CILAIR, first NPC pilot Martijn Dijkstra, Isthiaq Ahmad, Groningen Animal Observatory Ate Boerema, Groningen Peptide and pathways Arjen Strijkstra, Groningen Recently started FINDIS database Juha Muilu, Finland MAGE-TAB Helen Parkinson, EBI Human variome + BioSQL Gudmundur Thorisson, Leicester More soon? Metabolomics, Floris Sluiter Chado, Victor de Jager NCP pilot 2, Don de Lange HTP sequencing… Ongoing projects

  36. Case 1: A realistic model xGaP – the extensible genotype and phentotype database

  37. Objective Integrated genetic study of (molecular) phenotypes • Challenge: various experimental designs • Flavors of QTL, GWAS, knockouts, etc • Array, MassSpec, Markers, SNPs, etc • Human, Mouse, Worm, Plant, etc. • “Standard” and extensible data representation • Ontology enabled, and in collaboration with other organizations like FuGE, MIQAS, PaGE-OM, OBO. • “Standard” cyber infrastructure • Format for exchange, e.g. XML or TAB formatted • Data management and searching, e.g. using mySQL • Communication, e.g. using web services Processing, e.g. using R

  38. Integration of data, reuse of algorithms • It’s a genotype and phenotype database ‘in-a-box’ xGaP + Ontology tools xGaP + processing tools xGaP + workflows

  39. Cyber infrastructure researchers xGaP user interaction infrastructure communication infrastructure data infrastructure bioinformaticians processing infrastructure Components of cyber infrastructure, Stein (2008) Nature Reviews Genetics 9: 678-688

  40. Towards a real model:

  41. Basic data? • Raw and processed data in matrix form Genotype data Subjects: STRAINS M A R K E R S DATA ELEMENTS T r a i t s: TRAIT  SUBJECT

  42. Minimal and simple data model TRAIT  SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows

  43. Too simple? What about QTL data? Probe association data? Interaction network data? Traits: MARKERS P R O B E S DATA T r a i t s: TRAIT  TRAIT! SUBJECT SUBJECT?

  44. dimension ELEMENT columns rows Minimal and simple data model TRAIT  SUBJECT TRAIT  TRAIT SUBJECT  SUBJECT SUBJECT columns TRAIT DATA ELEMENT rows DATA ELEMENT

  45. Annotation information…of many types? 10 10.000 Main work flow Data dependency Biomaterial/result Lab/analysis process Scale of information Associated data files material 10.000 process strains genome 10,000 markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes

  46. Systematic extension mechanism SUBJECT • STRAIN • Name • Type: CSS, RIL.. • Parent Strains • INDIVIDUAL • Name • Strain • Mother • Father • Sex • SAMPLE • Name • Individual • Tissue And so on … TRAIT dimension ELEMENT • PROBE • Name • Gene • Chromosme • Locus column • MARKER • Name • Allele • Chromosme • Locus • MASSPEAK • Name • MZ • RetentionTime And so on … DATA ELEMENT row

  47. What about experimental design? • Using FuGE data elements: QTL data Genotype data DATA Affy Array DATA QTL Mapping DATA DATA Affy M430 Protocol Affy M430 platform Bioconductor Norm. Mapping Protocol R Software FuGE: Expression data DATA DATA SNP Array DATA application Protocol Illumina Protocol Illumina Bead Studio Equipment Software FuGE: Jones et al Nature Biotech 25, 1127-1133

  48. Ontology enabled • Standard descriptions (semantics) are also essential for integration, next to standard structure (syntax) INVESTI GATION 2 INVESTI GATION 1 Hyperlink … Incompatible naming  Map mouse on human ontologies GENE Name = Mip1alpha GENE Name = Mip1a ONTOLOGY ENTRY Id = 0005615 Term = ABC Ontology=GO ONTOLOGY ENTRY Id = MP:0005385 Term = cardiovascular Ontology=MP Compatible Identifiers  DATABASE REFERENCE Id = ENSMUS098 Db=ENSEMBL DATABASE REFERENCE Id = ENSMU0S98 Db=ENSEMBL DATABASE REFERENCE Id = ENSMUS98 Db=ENSEMBL DATABASE REFERENCE Id = 1419561_AT Db=AFFY 430 FuGE: Jones et al Nature Biotech 25, 1127-1133

  49. Standard extension mechanism for new research Standard structure to ease sharing of data and tools Standard extension mechanism for new research

  50. Using the generator again…. bioinformatician softw engineer Little language <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Blueprint model <!-- entity organization --> <entityname="Experiment"label="Experiment"> <fieldname="ExperimentID"key="1“ readonly="true" label="ExperimentID(autonum)"/> <fieldname="Medium" type="xref" xref_field="Medium.name"/>/> <fieldname="Protocol" label="Experiment Protocol"/> <fieldname="Temperature"type="int" Software factory + biologist biologist 10 10.000 10.000 strains genome markers inbreed 100 1,000,000 10,000 individuals genotype genotypes map QTL profiles correlate 100,000 10,000,00 hybridize expressions preprocess norm exprs. network 100 100,000 microarrays probes Swertz & Jansen (2007) Nature Genetics 8, 235-243

More Related