MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it

MIAME and ArrayExpress- astandard for microarray data annotation and a database to store it Helen Parkinson Microarray Informatics Team European Bioinformatics Institute Hinxton

Three parts of my talk • Microarray data standards • Ontologies for gene expression data • ArrayExpress - a public database for microarray data • Analysis tools at the EBI

The size of the datasets • Experiments: • ~100 000 different transcripts in human • ~320 cell types • 2000 compounds • 3 time points • 2 concentrations • 2 replicates • Data • 8 x 1011 data-points • 1 x 1015 = 1 Peta Byte for Affymetrix (data from Jerry Lanfear)

Microarray data • Microarrays are widely used in experiments and already producing massive amounts of data • These data have to be stored in a well organised and standard way, if they are to be accessed and analysed by the wide research community • There is a general consensus that there is a need for a public repository for microarray data • It is much less clear what exactly should be stored in such a repository

Sample annotations Gene annotations A gene expression database from the data analyst’s point of view Samples Gene expression matrix Genes Gene expression levels

Three parts of a gene expression database • Gene annotation – can be given by links to gene sequence databases and GO (function,process,cell compartment) – not perfect but lets not worry about it • Sample annotation – we do not have any external databases for sample description (except species taxonomy) – problem 1 • Gene expression matrix – what are the measurement units for gene expression levels? – problem 2

Problem/consideration 1 – sample annotation • Gene expression data only have meaning in the context of detailed sample descriptions • If the data is going to be interpreted by independent parties, sample information has to be searchable and in the database • Controlled vocabularies and ontologies (species, cell types, compound nomenclature, treatments, etc) are needed for unambiguous sample description

Sample annotation- what can be done? • Few cv’s and ontologies for sample description are available (species taxonomy, model organisms) • Some use of free text descriptions are unavoidable (curation workload) • Existing efforts of creating such ontologies should be coordinated (MGED ontology working group) • Use existing ontologies and cv’s wherever possible

Problem 2 – the lack of gene expression measurement units • What we would like to have • gene expression levels expressed in some standard units (e.g. molecules per cell) • reliability measure associated with each value (e.g. standard deviation) • What have we got • each experiment using different units • no reliability information

cm inc Comparing expression data

? ? Comparing expression data

Comparing expression data

What to do in the absence of standard measurement units? • Record raw, intermediate and final analysis data together with the detailed annotation of how the analysis has been performed • This effectively passes on the responsibility about interpreting the final analysis data to the user

Quantitation matrices Gene expression data Raw data Array scans Quantitations Samples Spots Genes Spot quantitations Gene exp. levels Three levels of microarray data processing

Measurement units • In perspective: • standard controls for experiments (on chips and in the samples) should be introduced • replicate measurements will become a norm • Temporary solution: • storing intermediate analysis results (including the images) and annotations of how they were obtained • Standards within experiments themselves (standard controls and protocols)

Standards for microarray data • Standards are needed to build a well organised microarray database • Standards for annotation • Standards for data exchange • Standards for controls in the experiment and data normalisation • www.dnachip.org/mged/normalization.html

How to create microarray data standards • To understand thoroughly what is the minimum information about a microarray experiment that is needed to interpret it unambiguously and what is the structure of this information (objects and relationships) • To create the technical data format able to capture this information • Finding appropriate controlled vocabularies

Standardisation of microarray data and annotations -MGED group The goal of the group is to facilitate the adoption of standards for DNA-array experiment annotation and data representation, as well as the introduction of standard experimental controls and data normalisation methods. Includes most of the worlds largest microarray laboratories and companies (TIGR,Affymetrix Stanford,Sanger,Agilent etc) www.mged.org

MGED • MGED 2 meeting in Heidelberg in 2000, MGED 3 in Stanford in 2001, both ~ 300 participants • Minimum Information About a Microarray Experiment – MIAME version 1.0 posted • Collaboration with OMG on data formats MAML+GEML = MAGE-ML and MAGE-OM • MGED 4 meeting in February 2001, Boston • MGED will become an ISCB Special Interest Group

Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data Normalisation MIAME – Minimum Information About a Microarray Experiment External links Publication 6 parts of a microarray experiment www.mged.org

MIAME Section on Sample Source and Treatment • sample source and treatment ID as used in section 1 • organism (NCBI taxonomy) • additional "qualifier, value, source" list; the list includes: • cell source - provider • type (if derived from primary sources (s)) • sex • age • growth conditions • development stage • organism part (tissue) • animal/plant strain or line • genetic variation (e.g., gene knockout, transgenic variation) • individual • individual genetic characteristics (e.g., disease alleles, polymorphisms) • disease state or normal • target cell type • cell line and source (if applicable) • in vivo treatments (organism or individual treatments) • in vitro treatments (cell culture conditions) • treatment type (e.g., small molecule, heat shock, cold shock, food deprivation) • compound • is additional clinical information available (link) • separation technique (e.g., none, trimming, microdissection, FACS) • laboratory protocol for sample treatment……

What is an ontology? • An ontology is a specification of concepts that includes the relationships between those concepts. • Provides semantics and constraints • Allows for computational inferences and reliable comparisons

MGED Biomaterial Ontology • Under construction by Chris Stoeckert • Using OILed (may use others) • Motivated by MIAME and coordinated with the database model • Extend classes, provide constraints, define terms, provide terms to use,develop cv’s for submissions (EBI)

Use case scenario

Ontology Example • Concept=Age def=in standard units referenced to an identifiable time point from (class) developmental stage • Age=6 {units=days}, • {dev_stage}=dauer • Hierarchy=Dev_stage->larva->dauer

Excerpts from a Sample Descriptioncourtesy of M. Hoffman, S. Schmidtke, Lion BioSciences • Organism: mus musculus [ NCBI taxonomy browser ] • Cell source: in-house bred mice (contact: person@somewhere.ac.uk) • Sex: female [ MGED ] • Age: 3 - 4 weeks after birth [ MGED ] • Growth conditions: normal • controlled environment • 20 - 22 oC average temperature • housed in cages according to EU legislation • specified pathogen free conditions (SPF) • 14 hours light cycle • 10 hours dark cycle • Developmental stage: stage 28 (juvenile (young) mice)) [ GXD "Mouse Anatomical Dictionary" ] • Organism part: thymus [ GXD "Mouse Anatomical Dictionary" ] • Strain or line: C57BL/6 [International Committee on Standardized Genetic Nomenclature for Mice] • Genetic Variation: Inbr (J) 150. Origin: substrains 6 and 10 were separated prior to 1937. This substrain is now probably the most widely used of all inbred strains. Substrain 6 and 10 differ at the H9, Igh2 and Lv loci. Maint. by J,N, Ola. [International Committee on Standardized Genetic Nomenclature for Mice ] • Treatment: in vivo [MGED] intraperitoneal injection of Dexamethasone into mice, 10 microgram per 25 g bodyweight of the mouse • Compound: drug [MGED] synthetic glucocorticoid Dexamethasone, dissolved in PBS

Experiment Source (e.g., Taxonomy) Gene (e.g., EMBL) Sample Hybridisation Array Data ArrayExpress conceptual model Publication External links Normalisation

ArrayExpress object model

ArrayExpress – the state of the art • ArrayExpress Object model supporting MIAME requirements developed • Data model implemented in Oracle • Data loader from MAML file format • Expression Profiler – data analysis tool already available

ArrayExpress – plans and schedule • EU grant – new staff being recruited • A web based query interface - under development • A web based submission tool – under test • Participation in OMG – MAGE-OM & MAGE-ML • MAGE-ML will replace MAML in October • Full scale database operation expected to start at the beginning of 2002 • Expression Profiler to link to ArrayExpress

Microarray data analysis • Expression Profiler – a web based gene expression data analysis tool: www.ebi.ac.uk/microarray/

Expression Profiler - web based tool for microarray data analysis http://www.ebi.ac.uk/microarray/ External data, tools pathways, function, etc. Expression data EPCLUST (cluster Expression profiles) GENOMES sequence, function, annotation URLMAP: provide links SPEXS (Sequence Pattern Exhaustive Search) novel patterns PATMATCHknownpatterns

Conclusions • Microarray standardisation is a challenge and an imperative • Join MGED to contribute to this process www.mged.org • Participate in the development of ontologies and controlled vocabularies • Send me your protocols • Make your data available • Feedback on MIAME, it’s up for discussion

Acknowledgments • Microarray Informatics Team, EBI Alvis Brazma, Katja Kivinen, Helen Parkinson, Olga Perez, Johan Rung, Ugis Sarkans,Thomas Schlitt, Mohammad Shojatalab, Lev Soinov, Koichi Tazaki, Jaak Vilo • Industry Support team, EBI Alan Robinson • MGED steering committee • MIAME working group • Chris Stoeckert, U. Penn. and MGED

Useful URL’s • www.mged.org • www.tigr.org • www.ebi.ac.uk/array • www.geneontology.org • www.hgmp.mrc.ac.uk • www.dnachip.org/mged/normalization.html • parkinson@ebi.ac.uk

MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it

MIAME and ArrayExpress - a standard for microarray data annotation and a database to store it

Presentation Transcript

The ArrayExpress Gene Expression Database: a Software Engineering and Implementation Perspective

Microarray Databases and MIAME ( Minimum Information About a Microarray Experiment )

MIAME and ArrayExpress – a standard for microarray gene expression data and the public database at EBI

SQL Database for a Book Store

Microarray Basics, and Planning a Microarray Experiment

DNA Microarray Data Acquisition and Analysis - Introduction to Stanford Microarray Database

MAGE-OM and ArrayExpress database model

ebi.ac.uk/arrayexpress/ ebi.ac.uk/microarray/

‘MIAME in practice’ loading microarray expression data with maxd

From MIAME to MAML: Microarray Gene Expression Database (MGED)

Minimum Information About a Microarray Experiment - MIAME

ArrayExpress - a Public Repository for Microarray Based Gene Expression Data

A Gene Expression Barcode for Microarray Data

ArrayExpress – a public database for microarray gene expression data

MIAME, ArrayExpress and the data submission tool MIAMExpress

Minimum Information About a Microarray Experiment - MIAME

Store XML Data in a Relational Database

DNA Microarray Data Acquisition and Analysis - Introduction to Stanford Microarray Database

Annotation and Analysis of Microarray Data A primer for NERC researchers

Minimum Information About a Microarray Experiment - MIAME

From MIAME to MAML: Microarray Gene Expression Database (MGED)

Manual Annotation of a mouse pancreas-specific microarray chip:PancChip4.0