Development of Array Data Standards and Prototype Repository: DESPRAD Initiative

DESPRAD subproject Alvis Brazma EMBL-EBI Hinxton, October 20, 2003

DESPRAD – Development and Establishment of Standards and Prototype Repository for Array Data

Participants • EBI • UMC Utrecht • University of Bergen • RZPD • Cambridge University • EMBL Heidelberg • University of Marseille (CIML) • University of Madrid (CMB)

Three major sets of WPs: • Developing standards and an international infrastructure for microarray data sharing (WP1 – WP4) • Establishing a public repository for microarray data – ArrayExpress (WP4 – WP9) • Research in gene expression data analysis and gene networks (WP9 – WP12)

ArrayExpress goals • Serving as an archival repository for microarray data supporting publications • Providing easy access to microarray data in a structured and standardised format for research community • Facilitating the sharing of microarray designs and protocols

ArrayExpress approach • To collect the necessary information enabling the user to understand how to interpret the data • To try to represent the information in a structured way potentially allowing for automated analysis and mining • To work towards a community agreement to represent the microarry data in a standard way – founding of the MGED society

1. Standards • Founding the Microarray Gene Expression Data (MGED) society • Development of the standards • MIMAE • MAGE • MGED ontology

Array scans Quantitations Samples Spots Genes A B D C Sharing microarray data – which data?

Sample annotations problem 1 Gene expression levels – problem 2 Gene annotations Annotations Samples Gene expression matrix Genes

MGED Society • Microarray Gene Expression Data Society is an international organisation for facilitating the sharing of functional genomics and proteomics array data MGED 1, Hinxton, November 1999 MGED 2, Heidelberg, May 2000 MGED 3, Stanford University, April 2001 MGED 4, Boston, February 2002 MGED 5, Tokyo, September 2002 MGED 6, Aix-en-Provence, September 2003 MGED 7, Toronto, September 2004 Board of directors – EBI, Stanford, UCB, TIGR, Affymetrix, Rosetta,…

labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Microarray array array array Gene expression data matrix Protocol Protocol Protocol Protocol Protocol Protocol normalization integration Experiment genes Sample Sample Sample Sample Sample Array design RNA extract RNA extract RNA extract RNA extract RNA extract hybridisation labelled nucleic acid hybridisation array hybridisation hybridisation hybridisation

The first database model - developed in collaboration with DKFZ in 1999

MGED standards - MIAME

Nature editorial

MGED standards – MAGE-ML MAGE-ML

Affymetrix Agilent Biodiscovery (Imagene5.5) BASE (Open source project coordinated at Lund) Iobion (Gene Traffic) Manchester University (MAXDB) Molmine (J-Express) NCI NIEHS Rosetta Biosoftware (Rosetta Resolver) RZPD Sanger Institute LIMS (MIDAS) Silicon Genetics (GeneNet) Stanford University (SMD) TIGR (MADAM) UC at Berkeley University of Pennsylvania (RAD) UMC Utrecht The organisations and software supporting MAGE-ML include

~3000 1172 ~250 Data in ArrayExpress Hybs 3000 2000 1000 ~100 6 2004 2003 2002 April September September February November

ArrayExpress content (experiments) +1 drosophyla experiment By experiment

Submissions by labs (in hybs)

Submissions by country (in experiments)

SUBSELECT Expression Profiler(component interface) 1 CLUSTER 2

ArrayExpress web-page hits • 2002 – 49 245 • 2003 – 274 983 (by 12 September)

ArrayExpress components Submissions Queries, Analysis Large-scale microarray facilities ArrayExpress Export to local analysis tools MAGE-ML MAGE-ML MIAMExpress - online submission tool Expression Profiler - online analysis tool Internet Smaller labs www

MIAMExpress • Online since December 1, 2002 • 2002 – 15 951 hits • 2003 – 112 871 hits by 12 September • So far ~20 submissions completed through MIAMExpress, i.e., about 25% of all experiments in ArrayExpress • MIAMExpress is open source software - installed in at least 15 labs (EMBL, RZPD, Leipzig, Leuven, Vancouver, VIB) • Tox-MIAMExpress – a specialised version for Toxicology

ArrayExpress infrastructure Submissions Access ArrayExpress www MIAMExpress (MySQL) Desktop Data Analysis software MIAMExpress Local installations (Cambridge,…) MAGE-ML Repository (Oracle) www MAGE-ML retrieval Local databases (RZPD,Stanford) Queries Query interface (Tomcat) Local databases LIMS (EMBL,TIGR) MAGE-ML pipelines Expression Profiler www Array Manufacturers (Affymetrix,Agilent)

Submissions by pipeline (in hybs)

More complex queries (genes, expression levels, etc) Simple queries (species, author, lab, array types, etc) Repository (MAGE-OM model) Warehouse (simple gene-centric model) Ensmart submissions curation curation Links back to the evidence Hyperlinks to other databases Database integration ArrayExpress development

Sample annotations Gene expression levels Gene annotations Gene expression data matrix Samples Genes

Summarised information about which gene is expressed where More complex queries (genes, expression levels, etc) Simple queries (species, author, lab, array types, etc) Repository (MAGE-OM model) Warehouse (simple gene-centric model) Gene Expression Atlas Ensmart submissions curation curation curation Links back to the evidence Hyperlinks to other databases Database integration Database integration ArrayExpress development

New in ArrayExpress • Password protected logins • Can be used to support anonymous refereeing of microarray papers • Discussions with Nature

Data growth in ArrayExpress Hybs 4000 ? 3000 2000 1000 2004 2003 2002

Distributed data collection Small lab Small lab Small lab Small lab Small lab Small lab Small lab Small lab National microarray centre National microarray centre National microarray centre EMBL ArrayExpress Stanford Sanger TIGR

Data analysis tools • Expression profiler – complete redevelopment of the earlier tool – new interface, new functionality, XML based modularity – beta version will be ready on months 24 • J-express – (developed in Bergen), talk by Inge Jonassen

Research • Microarray based gene network analysis – 2 publications out, 1 in print, 1 submitted • S. Pombe gene expression data analysis (in collaboration with the Sanger Institute) – publication in preparation • New algorithms for clustering and cluster comparison – 2 publications in preparation

Transcription factor binding network • Chromatin IP experiments on a chip (ChiP on chip) • Using microarrays for finding genomic (intragenic) sequences (of length of few hundred bp) where a particular transcription factor is likely to bind • ChIP by Lee et al. (Science 2002) – binding site location data in yeast genome for 107 transcription factors (from about 250 yeast transcription factors in total) • Identified around 4500 binding locations

ChIP on chip network by Lee et al

DA DC DB C A gene A gene B gene C B D gene D Gene disruption network

Data for over 200 gene disruptions in Yeast Hughes et al, Cell, 102 (2000)

Mutation network for S. Cerevisiae

Three networks in yeast • ChIP network (Lee et al) • Mutation network (Hughes et al) • In silico network – matching 38 experimentally known transcription factor binding sites (Pilpel et al) against yeast genome sequence

Intersection of the networks Red – 39 arcs present in all networks Green – arcs present in at least 2 networks and adjacent to one of SWI4, SWI6 or MBP1

All genes t Transcription factors h Disrupted genes How Chip-chip and disruption networks relate? All genes Regulation set of t Effectual set of h

Development of Array Data Standards and Prototype Repository: DESPRAD Initiative

Development of Array Data Standards and Prototype Repository: DESPRAD Initiative

Presentation Transcript

New baseline of the ICS subproject

Subproject KNI and Governance

INGAS Subproject SPA2

Presentation of subproject Sharewood in Ciumani

INGAS Subproject SPA2

Inelastic Subproject Report B. Fultz, Caltech

INGAS Subproject SPA2

Subproject II: Robustness in Speech Recognition

INGAS Subproject SPA2

INGAS Subproject SPB2

Subproject III - Spoken Language Systems

INGAS Subproject SPB2

DESPRAD-Meeting 02/09/2003

DESPRAD meeting report

Scientific Presentation SubProject 5: Theory

Subproject 4: HTML-WML Transcoding System

LLL2010 SUBPROJECT 5

InterVal – Wissensnetze (Subproject A)

SHAMS – Subproject 1 Bosra Development Office - BDO

TMDL Mercury Emissions Inventory Subproject

INGAS Subproject SPA2

PEOPLE SUBPROJECT SHARE IT