1 / 73

ArrayExpress and Gene Expression Atlas:

ArrayExpress and Gene Expression Atlas:. Mining Functional Genomics Data. Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI. What’s covered this morning?. http://www.ebi.ac.uk/training/course/bioinformatics-udine2013.

marge
Télécharger la présentation

ArrayExpress and Gene Expression Atlas:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ArrayExpress and Gene Expression Atlas: Mining Functional Genomics Data Amy Tang, PhD ArrayExpress Production Team Functional Genomics Group EMBL-EBI

  2. What’s covered this morning? http://www.ebi.ac.uk/training/course/bioinformatics-udine2013 • What do we mean by “functional genomics data”? Why do we need databases for them? • Two databases: • ArrayExpress • Expression Atlas • What’s in each database, how to browse, search, interpret, download data • (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress

  3. Functional genomics (FG) data • The aim of FG is to understand the function of genes and other (non-genic) parts of the genome • Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) • Questions addressed: • Gene expression - when? where? how much? changes? • Gene function - roles of genes in cellular processes, pathways • Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress

  4. Example of FG data sets in ArrayExpress • Questions addressed: • Gene expression - when? where? how much? changes? • Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress

  5. Example of FG data sets in ArrayExpress • Questions addressed: • Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress

  6. The two databases: how are they related? Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 6 ArrayExpress

  7. The two databases: how do they compare? 7 ArrayExpress

  8. ArrayExpresswww.ebi.ac.uk/arrayexpress • Public repository for functional genomics data (both microarray and sequencing) • Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications • Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information • Submissions are curated based on community standards: • MIAME guidelines & MAGE-TAB format for microarray • MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress

  9. Community standards for data requirement • MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) • MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) • The checklist: 9 ArrayExpress

  10. What is an experimental factor? • The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable. • Values of the factor (“factor values”) should vary. X A 10 ArrayExpress

  11. Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files • Array Design Format file • Describes probes on an array, e.g. sequence, genomic mapping location • Investigation Description Format file • Experiment title • Experiment description • Submitter’s contact details • Definition of all protocols ADF (microarray only) IDF • Raw and processed data files • Sample Data Relationship Format file • Starting materials with annotation • Derived materials (e.g. RNA extracts) • All assays (hybs/seq. lanes) • Resulting data file(s) for each assay Normalized.txt SDRF .CEL A1.CEL 2.fq.gz 1.fq.gz 11 ArrayExpress

  12. MAGE-TAB Example: IDF

  13. MAGE-TAB Example: SDRF

  14. How much data in ArrayExpress?(as of 29 Oct 2013) 14 ArrayExpress

  15. HTS data in ArrayExpress(as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress

  16. ArrayExpress Browsing ArrayExpress www.ebi.ac.uk/arrayexpress

  17. ArrayExpress Browsing ArrayExpress experimentswww.ebi.ac.uk/arrayexpress/experiments/browse.html All columns can be sorted by clicking at the heading

  18. File download on the Browse page Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress

  19. ArrayExpress single-experiment view Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress

  20. Samples view – microarray experiment All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress

  21. Samples view – sequencing experiment Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress

  22. ArrayExpress Searching for experiments in ArrayExpresswww.ebi.ac.uk/arrayexpress/experiments/browse.html

  23. ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo • Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) • Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) • Combine terms from a subset of well-maintained and compatible ontologies, e.g. • Gene Ontology (cellular component + biological process terms) • NCBI Taxonomy  Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html

  24. Building EFO - an example Take all experimental factors Find the logical connection between them Organize them in an ontology disease disease sarcoma is the parent term [-] neoplasm disease neoplasm cancer is a type of [-] cancer neoplasm cancer neoplasm is synonym of [-] sarcoma disease sarcoma cancer is a type of [-] Kaposi’s sarcoma Kaposi’s sarcoma Kaposi’s sarcoma sarcoma is a type of ArrayExpress

  25. Exploring EFO - an example ArrayExpress

  26. Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo EFO developed to: • increase the richness of annotations in databases • expand on search terms when querying ArrayExpress and Expression Atlas • using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) • using child terms (e.g. “bone”  “rib” and “vertebra”) • promote consistency (e.g. F/female/, 1day/24hours) • facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress

  27. ArrayExpress Searching ArrayExpressUsing EFO terms and filters • Filter your search results by: • Species of interest • One array design (platform), • molecule (DNA, RNA, protein, etc) • technology (microarray or HTS) • “Auto-complete” with suggestions (like Google search) • Avoid acronyms as search terms Enter keyword, click search, then filter next.

  28. ArrayExpress What search terms can I use? • ArrayExpress accession number, e.g. “E-MEXP-568” • Secondary accession number e.g. GEO series “GSE5389” • Experiment title, description • Submitter's email address • Publication title, authors and journal name, PubMed ID • Sample attributes and experimental factor / factor values: • “genetic modification” “heart” “diabetes” • “neural stem cells” “penicillin” “ChIP-chip” • “methylation profiling” “Arabidopsis” “p53” • * Powered by EFO expansion. Use EFO terms wherever possible.

  29. Example search: “leukemia” Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress

  30. Advanced search • Allows you to restrict your search to a specific field • Format of search term: field_name:search_term • Some examples: • More examples: https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment

  31. ArrayExpress QUESTIONS?

  32. Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinomaHands-on exercise 2Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress

  33. The two databases Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 33 ArrayExpress

  34. The two databases: how do they compare? 34 ArrayExpress

  35. Array (platform) designs relating to the experiment must be provided. Probe annotation must be adequate to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) At least 3 replicates for each value of the experimental factor Maximum 4 experimental factors Adequate sample annotation using EFO terms Presence of rawdata files: CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments Atlas construction - expt selection criteria ArrayExpress

  36. ArrayExpress Atlasconstruction – analysis pipeline Cond.1 Cond.2 Cond.3 A dummy example from one experiment: genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Output: 2-D matrix Input data (Affy CEL, non-Affy processed) 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full

  37. Atlasconstruction – analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Compare and calculate statistic ArrayExpress

  38. Atlasconstruction - analysispipeline Exp.1 Cond.1 Cond.2 Cond.3 Apply linear modelling statistics to each of the n experiments Statistical test genes Exp. 2 Cond.4 Cond.5 Cond.6 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n genes Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress

  39. ArrayExpress Atlas construction - result Summary of the “verdicts” from different experiments

  40. ArrayExpress Expression Atlas home page http://www.ebi.ac.uk/gxa Restrict query by direction of differential expression (up, down, both, neither) Query for conditions Query for genes The ‘advanced query’ option allows building more complex queries

  41. Mapping microarray probes to genes • Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. • From Ensembl genes, we also get: • Compara genes • External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes

  42. Example Atlas search: KCC2 gene and BPA Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation BPA + • PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your questions: What is the human KCC2 gene? What is its general expression pattern? In which human organ/tissue is the KCC2 gene differentially expressed? What is the expression pattern of KCC2/Kcc2orthologues? ArrayExpress

  43. Gene search: human KCC2 gene ArrayExpress

  44. (1) Summarised expression data for one gene Default: Sort by levels of diff. expression Group by experimental factor / intent Clicking at a factor/condition  changes profile display ArrayExpress

  45. (2) The anatomogram ArrayExpress

  46. ArrayExpress (3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) Samples mapped to “brain” experimental factor by EFO * * * * * * * *

  47. (4) Jump to orthologues from gene summary Orthology comes from EnsemblCompara database ArrayExpress

  48. (5) Compare orthologues with parallel heatmaps ArrayExpress

  49. Atlas ‘condition-only’ query ArrayExpress

  50. ArrayExpress Atlas ‘condition-only’ query (cont’d)heatmap view

More Related