1 / 56

Annotation-based meta-analysis of microarray experiments

Annotation-based meta-analysis of microarray experiments. Chris Stoeckert Yale Biostatistics Seminar Series Feb. 26, 2008. *. Data Integration at CBIL. http://www.cbil.upenn.edu. Databases. *. Knowledge Representation. *. Database schemas (GUS) and Data standards (MGED, OBI).

faolan
Télécharger la présentation

Annotation-based meta-analysis of microarray experiments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation-based meta-analysis of microarray experiments Chris Stoeckert Yale Biostatistics Seminar Series Feb. 26, 2008

  2. * Data Integration at CBIL http://www.cbil.upenn.edu

  3. Databases *

  4. Knowledge Representation * Database schemas (GUS) and Data standards (MGED, OBI)

  5. Data Modeling * * * * * Integrative tools for ortholog identification, expression analysis, chromosomal aberrations, TF regulatory networks, protein interaction networks

  6. Annotation-based meta-analysis of microarray experiments • Meta-analysis • Examples illustrating information gained and problems caused by incomplete annotations • Standards for annotating experiments • Standards from the MGED Society and multi-community standards (e.g., OBI). • Computing with Annotations • Dissimilarity measures to quantitatively compare experiments and assays based on annotations • Sample applications using dissimilarity measures

  7. The Problem • Analysis of microarray datasets has led to new challenges in statistics (many genes, few samples). • Focus of the analysis has been on the genes • Look for correlations, differences in expression • Look for greater than expected associations in types of genes • What can be learned from an analysis of sample characteristics and experimental parameters? • If experiments were better annotated, what would we be able to do? What are the benefits of better annotation? • What statistical measures and tests can be applied for this purpose?

  8. Meta-analysis of Microarray Datasets • Meta-analyses have been performed using microarray data from different experiments studying similar conditions to identify genes with significant signatures in those conditions. • Generally, these analyses look for robust signals that overcome experiment-specific biases in sample types, collection, and treatment and rely on the fact that with enough experiments these effects will wash out in the noise. • Detailed information about both the biological intent and context of a study is crucial for meaningful selection of experiments to be input into a meta-analysis. Meta-analysis is complicated by differences in experimental technologies, data post-processing, database formats, and inconsistent gene and sample annotations.

  9. Butte and Kohane. Nat Biotech. 2006 Meta-analysis example: “Creation and implications of a phenome-genome network”

  10. Butte and Kohane. Nat Biotech. 2006 Meta-analysis example: “Creation and implications of a phenome-genome network” • Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus. • Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression. • “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.”

  11. Another Example: CAMDA 2007 Dataset • ~ 6000 arrays of diseased and normal human samples and cell lines collected from ArrayExpress and GEO. • Meta-analysis on large scale open to many but … • Issues relating to annotation remain. On what basis do you compare: • “nasal_epithelium” • “nasal_epithelium, pulmonary_disease_cystic_fibrosis” Provided by ArrayExpress

  12. Annotation-based meta-analysis of microarray experiments • Meta-analysis • Examples illustrating information gained and problems caused by incomplete annotations • Standards for annotating experiments • Standards from the MGED Society and multi-community standards (e.g., OBI). • Computing with Annotations • Dissimilarity measures to quantitatively compare experiments and assays based on annotations • Sample applications using dissimilarity measures

  13. What is MGED? • The Microarray and Gene Expression Data Society. • A grass roots organization started in 1999 to develop standards for the sharing and storing of microarray data • A society with participants from academia, industry, government, and journals • A series of meetings to showcase cutting edge work and promote standards.

  14. The MGED Society Mission The Microarray and Gene Expression Data (MGED) Society is an international organization of biologists, computer scientists, and data analysts that aims to facilitate the sharing of data generated using the microarray and other functional genomics technologies for a variety of applications including expression profiling. The scope of MGED includes data generated using any technology when applied to genome-scale studies of gene expression, binding, modification and other related applications.The focus is on establishing standards for data quality, management, annotation and exchange; facilitating the creation of tools that leverage these standards; working with other standards organizations and promoting the sharing of high quality, well annotated data within the life sciences and biomedical communities. http://www.mged.org

  15. MGED Standards • What information is needed for a microarray experiment? • MIAME: Minimal Information About a Microarray Experiment • How do you “code up” microarray data? • MAGE-OM: MicroArray Gene Expression Object Model • What words do you use to describe a microarray experiment? • MO: MGED Ontology

  16. MIAME: Minimal Information About a Microarray Experiment • Brazma et al. Nature Genetics. 2001 • Version 2.0 proposal available • The raw data for each hybridisation • The final processed data for the set of hybridisations in the experiment (study) • The essential sample annotation, including experimental factors and their values • The experiment design including sample data relationships • Sufficient annotation of the array design • Essential experimental and data processing protocols

  17. labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid labelled nucleic acid Gene expression data matrix normalization hybridisation hybridisation hybridisation hybridisation hybridisation Array design RNA extract RNA extract RNA extract RNA extract RNA extract Microarray Sample Sample Sample Sample Sample genes array array array array Protocol Protocol Protocol Protocol Protocol Protocol Experiment integration MIAME in a nutshell Stoeckert et al. Drug Discovery Today TARGETS 2004

  18. MAGE-OM: MicroArray Gene Expression Object Model • MAGE-ML • XML version of MAGE-OM • Spellman et al. Genome Biology 2002 • Version 1.1 • V2.0 will be part of FuGE: Functional Genomics standard with participation from HUPO (Human Proteome Organization), the Metabolomics Society, and other communities. • Jones et al. Nature Biotech. 2007 • MAGE-TAB • Tab-delimited • Rayner et al. BMC Bioinformatics 2006 • Investigation Description Format (IDF) • Sample and Data Relationship Format (SDRF) • Array Design Format (ADF)

  19. MGED Ontology • Whetzel et al. Bioinformatics 2006 • Now in version 1.3.1 • Version 2 will be derived from OBI (Ontology for Biomedical Investigations). • Like FuGE, OBI is a standard resource being built by multiple communities.

  20. Ecosystem of Biomedical Standards Integrative Standards: FuGE MIBBI OBI Many Communities: MGED PSI MSI OBOBIRN CaBIG … Many Community Standards: MIAME MIAPE CIMR GO MAGE-ML GelML spML ChEBI MAGE-TAB mzDataXML PATO MGED Ontology PSI-MI sepCV NMR Ontology

  21. http://obi.sf.net OBI – Ontology for Biomedical Investigations • Diverse background • Omics standardization effort people (MGED, PSI, MSI) • People ‘running’ (public) repositories, primary + secondary databases • - Software engineers, modellers, biologists, data-miners • People from the semantic web technology • Vendors and manufacturers (new) • Different maturity stages • Some needs to ‘rebuild’, e.g. MGED Ontology (microarry) • Some are starting now, e.g. MSI (metabolomics), EnvO (environment) • Plurality of (prospective) usage • Driving data entry and annotation • - Indexing of experimental data, minimal information lists, x-db queries • Applying it to text-mining • - Benchmarking, enrichment, annotation • Encoding facts from literature • - Building knowledge bases relying on RDF triple stores

  22. http://obi.sf.net -> National Center for Biomedical Ontologies OBI – Communities and Structure 1.Coordination Committee (CC): Representatives of the communities -> Monthly conferences 2. Developers WG: CC and other communities’ members Weekly conferences calls 3. Advisors:

  23. Sending terms to other OBI branches or external resources, e.g.

  24. http://obi.sf.net OBI – Main Activities and Timelines • Continue branch activities (iterative process) • Branches editors working on submitted terms • Normalize terms, add metadata tags (e.g. definition and source) • Bin terms into the relevant top level classes and identify relationships • Sort terms by relevance to one or other branch, or to other ontologies • First evaluation of OBI draft + Release OBI 0.1 (Feb 08) • Review branches and merge with the trunk into a core • Apply use cases and competency questions •  Evaluate how the ontology perform, also what is missing, what is wrong • 5th and 6th face-2-face meeting for Coordinators and Developers • BBCCRC, Vancouver, Canada, (Jan/Feb 08), self funded + MGED sponsor • EBI, Cambridge, UK (Summer 08), BBSRC funds

  25. Annotation-based meta-analysis of microarray experiments • Meta-analysis • Examples illustrating information gained and problems caused by incomplete annotations • Standards for annotating experiments • Standards from the MGED Society and multi-community standards (e.g., OBI). • Computing with Annotations • Dissimilarity measures to quantitatively compare experiments and assays based on annotations • Sample applications using dissimilarity measures Elisabetta Manduchi, Junmin Liu

  26. Potential General Applications • Identification of experiments/assays of interest from large DBs • Organization of web resources • QC (annotation and ontology development) • Meta-analysis assessments • Guidance in data QC and pre-processing

  27. Computing with Annotations • Ideal situation: complete, accurate, consistent annotations • Example of Application (from CAMDA/ AE use case): • Assess assay quality using NUSE (normalized unscaled standard errors) • Need to work with appropriate groups of assays Annotated Assays Compute Dissimilarities Cluster Applications

  28. Actual situation: • incomplete annotation (missing values) • heterogeneous granularity • variation in ontologies used for a given annotation field • Flow: Applications Annotated Assays Compute Dissimilarities Cluster QC Refine Annotation

  29. Test Case: EPConDB • 24 published EPConDB experiments manually classified into 5 classes: • Preliminary study computing with experiment annotation • intent • context • Explored dissimilarity measures between experiments • Identified 5 annotation components, which could be weighted as desired • For each component, defined component-wise dissimilarity • Took weighted average of component-wise dissimilarities

  30. Pancreatic Growth after Partial Pancreatectomy and Exendin-4 Treatment (De Leon et al., 2006). • Manually classified under “Pancreas development and growth” • Experiment Design Types: MO terms providing a high level description for the experiment. For this experiment, these are: • MethodologicalDesign.time_series_design, • PerturbationalDesign.compound_treatment_design, • PerturbationalDesign.stimulus_or_stress_design • Experimental Factor Types: MO terms describing the type of factors under test in the experiment. In our example these are: • ComplexAction.compound_based_treatment (samples were treated with either Extendin-4 or Vehicle, or nothing), • ComplexAction.specified_biomaterial_action (some samples had a pancreatectomy, others did not, others had a sham operation), • ComplexAction.timepoint (samples treated with a compound were treated for different amounts of hours). • Organisms: the organisms to which the biosources used in the various assays belong. In our example this was just “Mus musculus”. • Other Biomaterial Characteristics of these biosources. In our example: • Age.birth (initial time point for computing the age), Age.weeks (unit of measure for age); • DevelopmentalStage.Theiler Stage 28, • OrganismPart.pancreas, • Sex.male, • StrainOrLine.BALB/c. • Treatment Types: the types of treatments applied to the original biosources which led to the final labeled extracts hybridized to the array. In our example, these were: • ComplexAction.specified_biomaterial_action, • ComplexAction.compound_based_treatment, • ComplexAction.split, ComplexAction.nucleic_acid_extraction, • ComplexAction.labeling.

  31. For each of the 5 annotation components and for each pair of experiments We have two sets of terms, one per experiment: say A and B Define the component-wise dissimilarity between these two experiments using the Jaccard or the Kulczynski distance Choose one or the other according to how you want to weigh containments With these distances, iteratively (leave one out) classified each experiment based on smallest distance to other experiments. Component-wise experiment dissimilarity Jaccard Kulczynski

  32. Automated vs. Manual Classification of EPConDB Studies • 5 Annotations used • Experiment design • Experimental factors • Organism • Other Biomaterial Characteristics • Treatment • Manual Classifications • Pancreas development and growth • Differentiation of insulin-producing cells • Islet/ beta-cell stimulation/injury • Tissue surveys • Targets and roles of transcriptional regulators • Result: achieved ~ 75% correct classification using all but organism. Note this reflects manual classification based on intent - other classifications might have different optimal weights. Need to retry with more (~75 now)

  33. Test Case: RAD • Clustering of 62 public experiments from RAD • No predefined classifications • Tried PAM and k from 5 to 20 • Use silhouettes to determine optimal clusters • Manual assessment • QC value

  34. Silhouettes • Rousseeuw P.J. (1987), J. Comput. Appl. Math., 20, 53–65 • For each study i,: • a(i) = average dissimilarity between i and all other study of the cluster to which i belongs • d(i,C) = average dissimilarity of i to all assays in cluster C. • b(i) = minCd(i,C): dissimilarity between i and its “neighbor” cluster • s(i) = ( b(i) - a(i) ) / max( a(i), b(i) ) • If i is in a singleton cluster, then s(i)=0. • large s(i) (almost 1): very well clustered • small s(i) (around 0): the experiment lies between two clusters • negative s(i): probably placed in the wrong cluster.

  35. Unsupervised classification of microarray experiments in RAD • Best silhouette ave. value was 0.36 with Kulczynski with weights 1,1,0,0,1 and PAM with k=8 or weights 1,1,1,0,1 and PAM with k=14 • Singleton and odd clusters revealed misannotated studies (QC) • Not optimized but gives us a sense for the whether there is sufficient signal in the annotations (at least in our database) to usefully organize

  36. CAMDA 2007 Dataset = E-TABM-185 • ~ 6000 arrays of diseased and normal human samples and cell lines all on Affymetrix HG-U133A collected from ArrayExpress and GEO. • Available at ArrayExpress as E-TABM-185 • http://www.ebi.ac.uk/microarray-as/aer/?#ae-browse/q=E-TABM-185[2] • Real use case for identifying quality issues (R. Irizarry) that required appropriate groups to distinguish biological from technical factors Provided by ArrayExpress

  37. Partial view of Annotations from E-TABM-185.sdrf • Snippet from MAGE-TAB (courtesy Helen Parkinson, EBI) • Ten distinct annotations to choose from • Drawn from multiple studies so not annotated by the same person • Many missing values

  38. What gains would organizing E-TABM-185 provide? • Compare studies at the assay level (individual samples). How should we define dissimilarity measures between assays? • Improve power. Group related assays based on all relevant annotations - not just on one or two. • Make relevant comparisons. The way a sample is processed can affect expression as much as what tissue it came from so grouping on one or two annotations can add variability if chosen poorly. • Interpret clusters. Just as overenrichment of GO terms can help interpret clusters of genes, overenrichment of specific annotations may help interpret biclusters

  39. Dissimilarity between Assays • First need to select which annotation fields are of interest • Typically these are all “context” fields as “intent” refers to an experiment as a whole • Original approach applied to assays: • Choose annotation fields of interest: : e.g., organism part, disease, etc. • Pull them together into one annotation set • Compute dissimilarities based on the overlap of the annotation sets (Kulczynski or Jaccard).

  40. Issues with original approach Example: OrganismPart and DiseaseState: • Suppose A1 and A2 have the following annotations: A1: nasal_epithelium, -- A2: nasal_epithelium, -- and A3 and A4 have the following annotations: A3: nasal_epithelium, pulmonary_disease_cystic_fibrosis A4: nasal_epithelium, pulmonary_disease_cystic_fibrosis • These 2 pairs have the same Jaccard and Kulczynski distances, equal to 0. • Shouldn’t the 2nd pair be considered “closer” since we have more info indicating that?

  41. Try again • Penalize missing values • In principle we might want to penalize differently missing values due to incomplete, as opposed to not applicable, annotation… • Group annotational fields when appropriate. For our use case of comparing Affy probes want to see where they differ. This will be due to cross-hybridization and degradation that are dependent on: • CellType, OrganismPart, CellLine • DiseaseState, DiseaseStage • Weigh groups and annotational fields within a group

  42. Subweights sj Weights wi Dissimilarity revised • Base it on Hamming distance idea • Number of annotation fields where the annotations differ • Layer it by groups and add weights • Provide a configuration file, e.g. 3 {3:CellType | 2:OrganismPart | 1:CellLine} 1 {3:DiseaseState | 1:BioSourceType | 2:DiseaseStage}

  43. Given assays: A=(a1,1, a1,2, …, a1,n1; a2,1,a2,2, …, a2,n2; …; am,1, am,2, …, am,nm) B=(b1,1, b1,2, …, b1,n1; b2,1,b2,2, …, b2,n2; …; bm,1, bm,2, …, bm,nm) weights: w1, w2, …,wm and subweights: (s1,1, s1,2, …, s1,n1; s2,1,s2,2, …, s2,n2; …; sm,1, sm,2, …, sm,nm) Define: diss(A,B) = where Ii,j is 1 if either one of ai,j or bi,j is missing or ai,j≠bi,j,and Ii,j is 0 otherwise; W is the sum of the wi’s; Si is the sum of the si,j’s. annotation group

  44. Example with: A={-|nasal_epithelium|-}{-|-|-} B={-|nasal_epithelium|-}{pulmonary_disease_cystic_fibrosis |-|-}, Then for: w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine} w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage} we have diss(A,B)= Assay Dissimilarity

  45. Note that, in the presence of missing values, the dissimilarity of an assay to itself will be non-zero with this definition E.g. with w1=3 {s1,1=3:CellType|s1,2=2:OrganismPart |s1,3=1:CellLine} w2=1 {s2,1=3:DiseaseState|s2,2=1:BioSourceType| s2,3=2:DiseaseStage} and A={-|nasal_epithelium|-}{-||-}, we have diss(A,A)= Self-dissimilarity

  46. Hierarchical Clustering of Assays • Use clustering to evaluate utility of measures • Are we gaining anything? • Clustered with the PHYLIP neighbor software and the UPGMA method (agglomerative, average linkage method) • Note that our starting point is NOT a gene-expression dataset, rather a dissimilarity matrix which limits choice of tools

  47. Clustering the E-TABM-185 Assays Based on Annotations • We had 2 annotation files: • The original one • One with higher-level OrganismPart terms (manually curated by Helen Parkinson @ EBI) • For each, we built a dissimilarity matrix and then a tree • Cut clusters and ran silhouettes to partition and evaluate • For each, we generated clusterings with n varying from 100 to 600 in steps of 10

  48. Forester ATV (http://www.phylosoft.org/atv/)

  49. How did we do? • Try to use some ‘gold-standard’ {-|human_universal_reference|-}{-|frozen_sample|-} • select the smallest n where these (24) assays constitute all the assays in a single cluster • n=220 and n=140 respectively • Use silhouette measure to pick the best • Original annotation: n=260, s=0.22 • Manually curated annotation: n=150, s=0.21 • Conclusion: we were able to automatically partition the assays in a meaningful way but need to improve

  50. Issues: synonyms • The current dissimilarity considers as different terms which are synonyms, e.g. • “frontal cortex” and “frontal lobe” • “malignant neoplasm” and “cancer” • Improvement ideas: • Map terms to a thesaurus, e.g. NCI Metathesaurus (same spirit as Butte and Kohane with UMLS but do this in a directed and automated fashion)

More Related