1 / 67

Building a Global Map of (Human) Gene Expression

Building a Global Map of (Human) Gene Expression. Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010. From one genome to many biological states.

paigeg
Télécharger la présentation

Building a Global Map of (Human) Gene Expression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building a Global Map of (Human) Gene Expression Misha Kapushesky European Bioinformatics Institute, EMBL St. Petersburg, Russia May, 2010

  2. From one genome to many biological states • While there is only one genome sequence, different genes are expressed in many different cell types and tissues, different developmental or disease states • The size and structure of this “expression space” is still largely unknown • Most individual experiments are looking at small regions • We would like to build a map of the global human gene expression space

  3. A microarray experiment Traditional research The map we want to build Everest Lhasa Kathmandu Mapping the human transcriptome

  4. How to build such a global map • This space is huge - There are thousands of potentially different states – cell types, tissue types, developmental stages, disease states, systems under various treatments (drugs, radiation, stress, …) – • It is not feasible to study them all in a single laboratory experiment (costs, rare samples, …) • However thousands of gene expression experiments are performed every year (microarrays, new generation sequencing) • Can we use the published data to build the global expression map?

  5. ArrayExpress • www.ebi.ac.uk/arrayexpress • Data from over 280,000 assays and over 10,000 independent studies (microarrays, sequencing, …) • Gene expression and other functional genomics assays • Over 200 species • Data collection and exchange from GEO

  6. Can we integrate these data to answer questions that go beyond what was done in the individual studies? • On a quantitative level - data on only the same microarray platform can be integrated

  7. Angela Gonzales (EBI) Misha Kapushesky (EBI) Janne Nikkila (Helsinki University of Technology) Helen Parkinson (EBI), Wolfgang Huber (EMBL) Esko Ukkonen (University of Helsinki) A global map of human gene expression MargusLukk et al, Nature Biotechnology, 28, p322-324 (April, 2010)

  8. The most popular gene expression microarray platform: Affymetrix U133A • We collected over 9000 raw data files from Affymetrix U133A from GEO and ArrayExpress • Applying strict quality controls, removing the duplicates • Data on 5372 samples remained • from 206 different studies generated • in 163 different laboratories • grouped in 369 different biological ‘conditions’ (tissue types, diseases, various cell lines, etc) • The 369 conditions grouped in different larger ‘metagroups’

  9. Different metagroupings (4 and 15):

  10. After RMA normalisation we obtain: 5372 samples (369 different conditions) ~18,000 genes

  11. 2nd 1st Principal Component Analysis – each dot is one of the 5372 samples

  12. 2nd 1st 16 21/12/2019 Human gene expression map

  13. 2nd Hematopoietic axis 17 21/12/2019 Human gene expression map

  14. 2nd Hematopoietic axis 18 21/12/2019 Human gene expression map

  15. Malignancy Hematopoietic axis 19 21/12/2019 Human gene expression map

  16. Hematopoietic and malignancy axes Lukk et al, Nature Biotechnology, 28: 322

  17. 2nd 1st 3rd

  18. 3rd PC Coloured by tissues of origin

  19. Neurological axis Tissues of origin

  20. First 3 (5) principal components • Hematopoietic axis – blood, ‘solid tissues’, ‘incompletely differentiated cells and connective tissues’ • Malignancy axis - Cell lines – cancer – normals and other diseases • Neurological axis – nervous system / the rest • RNA degradation • Samples seem to ‘cluster’ by the tissues of origin

  21. Hierarchical clustering of 97 groups with at least 10 replicates each 26 21/12/2019 Human gene expression map

  22. Comparison of the 97 larger sample groups to the rest Incompletely differentiated cell type and connective tissue group

  23. Conclusions so far • We have identified 6 major transcription profile classes in these data: • cell lines • incompletely differentiated cells and connective tissues • neoplasms • blood • brain • muscle • Cell lines cluster together!

  24. Gene expression across the 5372 samples • The expression of most genes is relatively constant • There are only 1034 probesets (mapping to less than 900) genes where normalised signal variability has standard deviation > 2

  25. Immune repsonse Nervous system development Lipid raft Mitosis Neurotransmitter uptake Cytoskeletal protein binding Extracellular matrix Extracellular regions Extracellular matirx Extracellular region Mitosis Defence response Nervous system development Actin cytoskeleton organisation and biogenesis Protein carrier activity No significant resout Antigen presentation, exogenous antigen Trans – 1,2-dyhydrobenzene, 1,2-dyhydrogenase activity S100 alpha binding 2 1 3 4 7 6 5 9 10 8 11 14 13 12 16 17 18 19 15 Clustering of 97 sample groups and 1000 most variable probesets (about 900 genes)

  26. Clustering based on subset of these genes produce similar results • Clustering based on 350 most variable probesets gives almost the same result • Even clustering based on 30 most variable probesets is very close

  27. 24 most variable genes

  28. www.ebi.ac.uk/gxa/human/U133a

  29. Can we go beyond the 6 major classes?

  30. Hierarchical clustering of all 369 sample groups • Some finer groups: • Cancer: • Sarcomas • Carcinomas • Neuroblastomas • Normal: • Liver and gut 39 21/12/2019 Human gene expression map

  31. Normal blood and blood non-neoplastic disease Leukemia Other blood neoplasm Blood cell lines

  32. Identifying condition specific genes by supervised analysis • Using linear models to find condition specific genes, multiple testing correction, differential expression cut-offs • Example - 174 leukemia specific genes • include most well known markers (e.g, BCR, ETV6, FLT3, HOXA9, MUST3, PRDM2, RUNX1, and TAL1) • Many confirmed as associated with leukemia

  33. Beyond the major 6 classes the ‘signal’ becomes weak • The problem may be lab effects • The large biological effects are stronger than the lab effects • However, when we zoom into particular subclasses, the lab effects may be taking precedence

  34. Our current view on global transcriptome A microarray experiment Traditional research The map we want to build Everest Lhasa Kathmandu Mapping the human transcriptome

  35. Mono- nuclear cells AML Nervous system tumors Frontal cortex Cerebellum Caudate nucleus Brain Brain and nervous system Hippocampal tissue Muscular dystrophy Skeletal muscle Heart and heart parts 97 groups – colours recycled

  36. Second approach • Integrating data on statistics level

  37. Ele Holloway Ibrahim Emam Pavel Kurnosov Helen Parkinson Anrey Zorin Tony Burdett Gabriella Rustici Eleanor William Andrew Tikhonov Gene Expression Atlas

  38. Global Differential Expression Analysis • Meta-Analysis Approaches • Vote counting: • number of independent studies supporting an observation for a particular gene • Effect size integration: • compute effect size statistics in each study, assess relevant statistical model and compute combined z-score, for each gene/condition/study combination (extension of Choi et al, 2003) • Selected ~10% of the data from ArrayExpress (including GEO imports), manually curated for quality and mapped to a custom-built ontology of experimental factors, EFO: http://www.ebi.ac.uk/efo • Data on differential expression of genes in 1000+ studies, comprising ~30000 assays, in over 5000 conditions • For each experiment, differentially expressed genes have been identified computationally via moderated t-tests and statistical meta-analysis

  39. AML CML normal genes Analysing each contributing dataset separately: one-way ANOVA

  40. Combining the datasets Experiments 1, 2, 3, …, m …

  41. Effect size-based meta-analysis • We have for each gene in each experiment/condition: • p-value for significance • simulaneous t-statistics & confidence intervals • d.e. label (“up” or “down”) • However, we would like to: • Measure of strength of d.e. effect size • Ability to combine d.e. findings statistically • Effect Size • Standardized mean difference or similar (e.g., correlation coef.)

  42. Meta-analysis Procedure • For each gene-experiment-condition combination • Compute effect size from simultaneous d.e. t-statistics • Combine effect sizes across multiple studies • Using fixed-effects or random-effects models • Obtain for each gene-condition combination: • Mean effect size estimate • Combined z-score • Overall p-value

More Related