Advancements in Gene Expression Data for Vector Biology at Imperial College London
The project aims to provide vector biologists with easy access to standardized gene expression data, ensuring consistent data processing and submission protocols through ArrayExpress. Current features include a robust bioinformatics software environment, BASE, for data storage and analysis. Future enhancements will focus on improving community submissions, annotation accuracy, and integration with genomic data. The platform facilitates multi-site collaborations and aims to support advanced analytical queries to drive biological insights from gene expression datasets.
Advancements in Gene Expression Data for Vector Biology at Imperial College London
E N D
Presentation Transcript
Gene expression data in VectorBase Fotis Kafatos, George Christophides, Bob MacCallum & Seth Redmond Imperial College London (thanks also to EBI, Sanger and ND)
Outline • Project goals • What’s currently available • Current challenges and future plans
Project goals • For vector biologists: • Easy access to gene expression data • consistent data processing • For array specialists: • ArrayExpress submission • Advanced analysis tools • Array annotation
EXPRESSION DATA BULK LOADER STORAGE & ANALYSIS • BASE: BioArray Software Environment • http://base.thep.lu.se/ • Open source, active development and user community • LIMS, data storage, export and analysis • Web-based, user/group access control • BASE 2.x adoption will bring Affy support
Data submission • Community submission guidelines available • First batch of experiments loaded by us • Bulk data loader • Sample/experiment annotation requires intervention from curators
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS • Data held in BASE is largely MIAME compliant • Script for semi-automated export in TAB2MAGE format • One experiment submitted so far
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS
ArrayExpress EXPRESSION DATA BULK LOADER ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES • BASE web interface offers powerful and extendable analysis environment • Can be used for multi-site collaborations on pre-publication data • Steep learning curve/not 100% intuitive • Not easily linked to • We provide simpler views so the casual user can quickly draw biological inferences
Standardised data All displayed data is processed in the same way: • Poor quality spots removed • Currently using submitted spot flags • Normalisation • “lowess” for two-colour experiments
3 probe types 6 array designs Mapping handled via Ensembl pipeline: Oligo exonerate PCR e-PCR cDNA exonerate2genes ArrayExpress EXPRESSION DATA BULK LOADER PROBE MAPPING ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES
VectorBase ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION GFF3 ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER
VectorBase ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS GENOME BIOLOGISTS VECTOR BIOLOGISTS
BioMart • Beta version currently available • http://base.vectorbase.org:9999/biomart/martview • Improvements still needed: • experiment annotations • Alignments (i.e. handle split alignments) • Federation with current marts • Integration with new data?
Current challenges and future plans • How do you want to query? • CVs & ontologies • APIs • Community submission • Manual annotation
Querying strategy • What do you want to query on? • Fetch all genes upregulated under condition X • Fetch all experiments with gene X and condition Y • Fetch all probes with expression similar to probe X • All essentially boil down to: • Define probe (genes etc) • Define significant expression • ANOVA? • Up/down-regulation WRT what? • Define experimental conditions • Sample annotation • Experimental design
ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER PROBE MAPPING AUTOMATIC ANNOTATION CV / ONTOLOGY ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER DATA MINING ARRAY BIOLOGISTS GENOME BIOLOGISTS VECTOR BIOLOGISTS
PROBE MAPPING AE API ? e! API ‘PUBLIC’ STORAGE STORAGE & ANALYSIS DATA SUMMARIES GENOME BROWSER MartJ / MQL DATA MINING ArrayExpress EXPRESSION DATA GENOMIC DATA BULK LOADER CV / ONTOLOGY AUTOMATIC ANNOTATION Array API ?
Array API Perl / Java objects for retrieval / handling of array data • Dual purpose: • Consistency & efficiency of VB expression website • Computational access to VB data for all • Objects must be: • General, DB-independent • Compatible with pre-existing Bio API (BioPerl / BioJava) • Nb. May be pre-existing solution: • ArrayExpress API? • BioPerl-Expression? • MAGE-OM-stk • http://neuron.cse.nd.edu/vectorbase/index.php/Array_API_proposal
Community data submission • Carrot? • Help with ArrayExpress submission • Analysis tools • Dissemination • Stick? • Outreach (courses, conferences) • Networking
GE data manual annotators • Gene-build designed arrays • Negative evidence less compelling • EST clone-based arrays • http://tinyurl.com/vlkwo
Longer term plans • Host-parasite GE data integration & analysis • GE-clusters “upstream” regions regulatory elements, upstream TFs • RNAi phenotypes • Images
CVs & ontologies • Integrate MGED and specialist ontologies for • Body parts • Developmental stages • Disease processes • … • Allows comparison across experiments with similar experimental conditions
Most biomarts: Gene-based Mostly ‘binary’ data e.g. a gene either has a signal domain or doesn’t Easily linked with other (gene-based) biomarts VB Biomart: Probe based Many probes not aligned Exp data less clear e.g. define ‘differential expression’ Exports gene/trans IDs for linking to other Marts BioMart
Clustering • A priority? • Easy to do on reporter level within experiments • Harder to do at gene level across all experiments • Binary gene profile: “yes/no differentially expressed in experiment” ? • Amazon-style links to “genes which may have similar expression profiles”?
BASE 2.x • Adoption delayed, now in progress • Brings Affymetrix support • Cleaner/modern interface • Better API (Java)