1 / 0

ArrayExpress and Gene Expression Atlas:

ArrayExpress and Gene Expression Atlas:. Mining Functional Genomics Data. Amy Tang, PhD amytang@ebi.ac.uk ArrayExpress Production Team Functional Genomics Group EMBL-EBI. What’s covered this morning?.

sela
Télécharger la présentation

ArrayExpress and Gene Expression Atlas:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ArrayExpress and Gene Expression Atlas:

    Mining Functional Genomics Data Amy Tang, PhD amytang@ebi.ac.uk ArrayExpress Production Team Functional Genomics Group EMBL-EBI
  2. What’s covered this morning? http://www.ebi.ac.uk/training/course/bioinformatics-transcriptomics-data-and-tools-cambridge-uk What do we mean by “functional genomics data”? Why do we need databases for them? Two databases: ArrayExpress Expression Atlas What’s in each database, how to browse, search, interpret, download data (Microarray/sequencing data analysis; How to submit data to ArrayExpress?) 2 ArrayExpress
  3. Functional genomics (FG) data The aim of FG is to understand the function of genes and other (non-genic) parts of the genome Often involved high-throughput technologies (microarrays, high-throughput sequencing [HTS]) Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 3 ArrayExpress
  4. Example of FG data sets in ArrayExpress Questions addressed: Gene expression - when? where? how much? changes? Gene function - roles of genes in cellular processes, pathways 4 ArrayExpress
  5. Example of FG data sets in ArrayExpress Questions addressed: Gene/genome regulation - e.g. histone modifications, CpG (DNA) methylation 5 ArrayExpress
  6. The two databases: how are they related? Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 6 ArrayExpress
  7. The two databases: how do they compare? 7 ArrayExpress
  8. ArrayExpresswww.ebi.ac.uk/arrayexpress Public repository for functional genomics data (both microarray and sequencing) Together with GEO at NCBI and CIBEX at DDBJ, serves the scientific community as a data archive supporting publications Provides access to curated data in a structured and standardised format – essential for easy sharing of experimental information Submissions are curated based on community standards: MIAME guidelines & MAGE-TAB format for microarray MINSEQE guidelines & MAGE-TAB format for HTS data 8 ArrayExpress
  9. Community standards for data requirement MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) The checklist: 9 ArrayExpress
  10. What is an experimental factor? The main variable(s) studied, often related to the hypothesis of the experiment and is the independent variable, e.g. “genotype”. “Factor values” of samples should vary (e.g. “p53-/-”, “wild type”). X A 10 ArrayExpress
  11. Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files Array Design Format file Describes probes on an array, e.g. sequence, genomic mapping location Investigation Description Format file Experiment title Experiment description Submitter’s contact details Definition of all protocols ADF (microarray only) IDF Raw and processed data files Sample Data Relationship Format file Starting materials with annotation Derived materials (e.g. RNA extracts) All assays (hybs/seq. lanes) Resulting data file(s) for each assay Normalized.txt SDRF .CEL A1.CEL 2.fq.gz 1.fq.gz 11 ArrayExpress
  12. MAGE-TAB Example: IDF
  13. MAGE-TAB Example: SDRF
  14. How much data in ArrayExpress?(as of 29 Oct 2013) 14 ArrayExpress
  15. HTS data in ArrayExpress(as of 29 October 2013) Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown 15 ArrayExpress
  16. ArrayExpress Browsing ArrayExpress www.ebi.ac.uk/arrayexpress
  17. ArrayExpress Browsing ArrayExpress experimentswww.ebi.ac.uk/arrayexpress/experiments/browse.html All columns can be sorted by clicking at the heading
  18. File download on the Browse page Direct download link (e.g. here it’s for a single raw data archive [i.e. *.zip] file) A link to a page which lists all the archive files available for download. (No direct link because there are >1 archives) This is specifically for HTS experiments. Direct link to European Nucleotide Archive (ENA)’s page which lists all the sequencing assays (which are called “runs” at the ENA). 18 ArrayExpress
  19. ArrayExpress single-experiment view Sample characteristics, factors and factor values The microarray design used MIAME or MINSEQE scores ( * = compliant) All files related to this experiment ( e.g. IDF, SDRF, array design, raw data, R object ) Send data to GenomeSpace and analyse it yourself 19 ArrayExpress
  20. Samples view – microarray experiment All columns can be sorted by clicking at the heading Direct link to data files for one sample Sample characteristics Factor values Scroll left and right to see all sample characteristics and factor values 20 ArrayExpress
  21. Samples view – sequencing experiment Direct link to European Nucleotide Archive (ENA) record about this sequencing assay Direct link to fastq files at European Nucleotide Archive (ENA) 21 ArrayExpress
  22. ArrayExpress Searching for experiments in ArrayExpresswww.ebi.ac.uk/arrayexpress/experiments/browse.html
  23. ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo Ontology: a way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) Combine terms from a subset of well-maintained and compatible ontologies, e.g. Gene Ontology (cellular component + biological process terms) NCBI Taxonomy  Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html
  24. Building EFO - an example Take all experimental factors Find the logical connection between them Organize them in an ontology disease disease sarcoma is the parent term [-] neoplasm disease neoplasm cancer is a type of [-] cancer neoplasm cancer neoplasm is synonym of [-] sarcoma disease sarcoma cancer is a type of [-] Kaposi’s sarcoma Kaposi’s sarcoma Kaposi’s sarcoma sarcoma is a type of ArrayExpress
  25. Exploring EFO - an example ArrayExpress
  26. Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo EFO developed to: increase the richness of annotations in databases expand on search terms when querying ArrayExpress and Expression Atlas using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) using child terms (e.g. “bone”  “rib” and “vertebra”) promote consistency (e.g. F/female/, 1day/24hours) facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) 26 ArrayExpress
  27. ArrayExpress Searching ArrayExpressUsing EFO terms and filters Filter your search results by: Species of interest One array design (platform), molecule (DNA, RNA, protein, etc) technology (microarray or HTS) “Auto-complete” with suggestions (like Google search) Avoid acronyms as search terms Enter keyword, click search, then filter next.
  28. ArrayExpress What search terms can I use? ArrayExpress accession number, e.g. “E-MEXP-568” Secondary accession number e.g. GEO series “GSE5389” Experiment title, description Submitter's email address Publication title, authors and journal name, PubMed ID Sample attributes and experimental factor / factor values: “genetic modification” “heart” “diabetes” “neural stem cells” “penicillin” “ChIP-chip” “methylation profiling” “Arabidopsis” “p53” * Powered by EFO expansion. Use EFO terms wherever possible.
  29. Example search: “leukemia” Exact match to search term Matched EFO synonyms to search term Matched EFO child term of search term 29 ArrayExpress
  30. Advanced search Allows you to restrict your search to a specific field Format of search term: field_name:search_term Some examples: More examples: https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment
  31. ArrayExpress QUESTIONS?
  32. Hands-on exercise 1 Find RNA-seq assays studying human prostate adenocarcinomaHands-on exercise 2Find experiments studying the effect of sodium dodecyl sulphate on human skin ArrayExpress
  33. The two databases Direct submission Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to other databases, e.g. Links to analysis software, e.g. 33 ArrayExpress
  34. The two databases: how do they compare? 34 ArrayExpress
  35. At least 3 replicates for each value of the experimental factor and maximum 4 factors Adequate sample annotation using EFO terms Adequate array (platform) design to map probes to genes and allow re-annotation of external references (e.g. Ensembl gene ID, Uniprot ID) RNA-seqexpt: good quality reads and reference genome build Presence of good quality rawdata files: e.g. CEL raw data files for Affymetrix assays, fastq files for RNA-seq experiments Atlas experiment selection criteria ArrayExpress
  36. New atlas is launching in 3 days’ time! Launch date: week of 1 Dec 2013 Old Where to find the Atlases before and after launch? New ArrayExpress
  37. New Atlas: “Baseline” and “differential” 37 ArrayExpress
  38. Experiencing the old and new Atlases today Taster and preview Old Example use case and exercise Example use case and exercise New ArrayExpress
  39. ArrayExpress “Old” Atlasconstruction – analysis pipeline Cond.1 Cond.2 Cond.3 A dummy example from one experiment: genes Cond.1 Cond.2 Cond.3 Linear model* (Bio/C Limma) Moderated T-test Output: 2-D matrix Input data (Affy CEL, Agilent feature extraction files, RNA-seqfastq files) 1= differentially expressed 0 = not differentially expressed * More information about the statistical methodology: http://nar.oxfordjournals.org/content/38/suppl_1/D690.full
  40. “Old” Atlasconstruction – analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” = a single expression value for gene X Gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Compare and calculate statistic ArrayExpress
  41. “Old” Atlasconstruction – analysis pipeline Exp.1 Cond.1 Cond.2 Cond.3 Apply linear modelling statistics to each of the n experiments Statistical test genes Exp. 2 Cond.4 Cond.5 Cond.6 Statistical test genes Cond.X Cond.Y Cond.Z Exp. n genes Statistical test Each experiment has its own “verdict” or “vote” on whether a gene is differentially expressed or not under a certain condition ArrayExpress
  42. ArrayExpress “Old” Atlasconstruction – results Summary of the “verdicts” from different experiments
  43. Mapping microarray probes to genes Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. From Ensembl genes, we also get: Compara genes External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes 43 ArrayExpress
  44. Example Atlas use case: KCC2 gene and BPA Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ Epigenetic downregulation BPA + PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your questions: In which human organ/tissue is the KCC2 gene differentially expressed? Under what condition(s) is the human KCC2gene differentially expressed? What is the expression pattern of KCC2/Kcc2orthologues? ArrayExpress
  45. ArrayExpress “Old” Atlas home page Restrict query by direction of differential expression (up, down, both, neither) Query for single gene or a group of genes Query for conditions The ‘advanced query’ option allows building more complex queries
  46. Gene search (old Atlas): human KCC2 gene ArrayExpress
  47. (1) Summarised expression data for one gene Default: Sort by levels of diff. expression Group by experimental factor / intent Clicking at a factor/condition  changes profile display ArrayExpress
  48. (2) The anatomogram ArrayExpress
  49. ArrayExpress (3) Detailed expression profile Drill down to - 1 probe (210040_at) - mapped to 1 gene (KCC2) - in 1 experiment (E-GEOD-3526) Samples mapped to “brain” experimental factor by EFO * * * * * * * *
  50. (4) Jump to orthologues from gene summary Orthology comes from EnsemblCompara database ArrayExpress
  51. (5) Compare orthologues with parallel heatmaps ArrayExpress
  52. Baseline Atlas construction Only RNA-seq data sets are used. @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq 1. Align with TopHat Reference genome from Ensembl 2. Cufflinks FPKMs bam Mapped reads ArrayExpress
  53. Baseline Atlas search for human KCC2 ArrayExpress
  54. Baseline Atlas search results ArrayExpress
  55. Human KCC2 gene in Baseline Atlas FPKM threshold slider ArrayExpress
  56. Old Atlas ‘condition-only’ query ArrayExpress
  57. ArrayExpress Old Atlas ‘condition-only’ query (cont’d)heatmap view
  58. Old Atlas gene + condition query ArrayExpress
  59. Old Atlas query refining ArrayExpress
  60. Old Atlas query refining AND ArrayExpress
  61. Old Atlas query refining AND ArrayExpress
  62. ArrayExpress QUESTIONS?
  63. Hands-on exercise 3Find information on Tbx5 expression in mouse in relation to Holt-Oram syndromeHands-on exercise 4Find transcription factor genes belonging to the androgen signaling pathway in prostate cancer ArrayExpress
  64. Diff. atlas changes: (1) analysis pipeline How differential expression is calculated in one experiment: “Is gene X differentially expressed in condition 1 in this experiment?” Gene X = a single expression value for gene X Cond.1 mean Cond.2 mean Mean of all samples Cond.3 mean Create “contrasts” and calculate statistic ArrayExpress
  65. Diff atlas changes (2): modern interface Lots of mouse-over tips/help (?) FDR cut-off Clearer indication of experimental factor and contrast Colour gradient showing significance of differential expression Experiment design, data analysis methods, full analytics data for download MA plots ArrayExpress
  66. Diff. atlas changes: (2) modern interface Clearer indication of experimental factor and contrast ArrayExpress
  67. ArrayExpress Diff. atlas changes: (3) verdict “summary”? = ? What if there are differences in sample attributes?
  68. Diff. atlas changes: (4) Histograms? ArrayExpress
  69. ArrayExpress QUESTIONS?
  70. ArrayExpress-Atlas Crossword ArrayExpress
  71. ArrayExpress Find out more about the two databases…. Visit our eLearning portal, Train Online: http://www.ebi.ac.uk/training/online/ for tutorials on ArrayExpress and Expression Atlas ArrayExpressBioConductorR package: http://bioconductor.org/packages/release/bioc/html/ArrayExpress.html ArrayExpress help: www.ebi.ac.uk/arrayexpress/help/index.html Email us at: miamexpress@ebi.ac.uk Atlas mailing list: arrayexpress-atlas@ebi.ac.uk
  72. ArrayExpress Open-source tools for FG data analysis Gene Pattern (Broad Institute) http://www.broadinstitute.org/cancer/software/genepattern/ GenomeSpace (incorporates Gene Pattern, ArrayExpress provides link to send data directly to GenomeSpace) http://genomespace.org/ Galaxy (allowing more modular customisation of workflow) BioConductorR (Comprehensive help doc on standard workflows) http://www.bioconductor.org/help/ BioConductor Case Studies (Hahne et al.) Microarray Technology in Practice (Russell et al.)
  73. Data submission to ArrayExpress Archive ArrayExpress
  74. ArrayExpress Data submission to Arrayexpress Read this help page carefully before preparing any files Use the MAGE-TAB submission tools to create a tailor-made template spreadsheet (IDF and SDRF) for your experiment
  75. ArrayExpress Submission of HTS data ArrayExpress acts as a “broker” for submitter. Meta-data and processed data: ArrayExpress Raw sequence reads* (e.g. fastq, bam): ENA *See http://www.ebi.ac.uk/ena/about/sra_data_formatfor accepted read file format
  76. ArrayExpress What happens after submission? Can keep data private until publication. Will provide login account details to you and reviewer for private data access Email confirmation Submission ‘closed’ so no more editing on your end Curation: We will email you with any questions May ‘re-open’ submission for you to make changes Get your submission in the best possible shape to shorten curation and processing time!
  77. ArrayExpress Submission checklist
  78. ArrayExpress Need help with submitting your data? Visit our eLearning portal, Train Onlinefor the specific tutorial on how to submit data using MAGE-TAB: www.ebi.ac.uk/training/online/course/arrayexpress-submitting-data-using-mage-tab ArrayExpress help page on submisisons: www.ebi.ac.uk/arrayexpress/help/submissions_overview.html Watch this short YouTube video on how to navigate the MAGE-TAB submission tool: http://youtu.be/KVpCVGpjw2Y Email curators at: miamexpress@ebi.ac.uk
More Related