1 / 97

Mining Public Data for Insights into Human Disease

Mining Public Data for Insights into Human Disease. 11/16/2009 Baliga Lab Meeting Chris Plaisier. Utility of Gene Expression for Human Disease. Microarray Technology. Big Picture. Data Access. Gene Expression Microarray Repositories. Gene Expression Omnibus (GEO) Hosted by: NCBI

marek
Télécharger la présentation

Mining Public Data for Insights into Human Disease

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Public Data for Insights into Human Disease 11/16/2009 Baliga Lab Meeting Chris Plaisier

  2. Utility of Gene Expression for Human Disease

  3. Microarray Technology

  4. Big Picture

  5. Data Access

  6. Gene Expression Microarray Repositories • Gene Expression Omnibus (GEO) • Hosted by: NCBI • Platform: All accepted • Normalization: Experiment by experiment basis • Access: R (GEOquery), EUtils • Meta-Information: GEOMetaDB • ArrayExpress • Hosted by: EMBL • Platform: All accepted • Normalization: Experiment by experiment basis • Access: Web interface, EMBL API • Meta-Information: ? (API) • Many smaller repositories which have more phenotypic information for specific diseases • Phenotypic information may be hard to access

  7. Gene Expression Omnibus

  8. Latest 3’ Affymetrix Array HGU133 Plus 2.0 HGU133A Samples Per Platform in GEO Affymetrix arrays account for ~67% of human gene expression data in public repositories.

  9. Affymetrix Probesets >54,000 Probesets Perfect Match Probe ProbePair Probeset (11 Probe Pairs) Mismatch 25 nucleotides GeneChip U133 Plus 2.0 Array (Image stored as CEL file.)

  10. Pre-Processing 101

  11. Pre-Processing Gene Expression Data

  12. Normally CDF File Comes from Affymetrix Alternative CDF File Thorougly Cleaned CEL File CEL File AltCDFFile CDFFile Intensities Intensities Zhang, et al. 2005 Removing Miss-Targeted and Non-Specific Probes

  13. Pre-Processing Gene Expression Data

  14. What Makes Cells Different?

  15. PANP: Presence/Absence Filtering • Use Negative Strand Matching Probesets (NSMPs) to determine true background distribution • NSMPs probesets are designed to hybridize to the opposite strand from the expressed strand • Utilize this background distribution from these NSMPs to threshold the entire dataset • Output is a call for each array for each gene • Calls are: • P = presence • M = marginal • A = Absence

  16. Identifying Present Genes • Filter out genes ≥ 50% absent • Whole dataset • Subsets • Only present genes are utilized in future analyses

  17. Pre-Processing Gene Expression Data

  18. Removing Redundancy

  19. Reason for Removing Redundancy Before Running

  20. Removing Redundancy • Collapse Affymetrix Probeset IDs to EntrezIDs • Test for correlation between probesets • If correlation is ≥ 0.8 then combine probesets • If not then leave them separate

  21. Pre-Processing Gene Expression Data

  22. = Implemented in R = Implemented in Python Pre-Processing Pipeline

  23. Big Picture

  24. Glioma:A Deadly Brain Cancer Wikimedia commons

  25. Brain Anatomy Wikimedia commons

  26. What do they do?

  27. Neurophysiology

  28. Hierarchy ofNervous Tissue Tumors

  29. Glioma Gliomas account for 40% of all tumors and 78% of malignant tumors. Buckner et al., 2007

  30. Glioma Survival 10 years 5 years http://www.neurooncology.ucla.edu/

  31. Repository of Molecular Brain Neoplasia Data (REMBRANDT) • REMBRANDT (Madhavan et al., 2009) • Currently 257 individual specimens • Glioblastoma multiforme (GBM) = 110 • Astrocytoma = 50 • Oligodendroglioma = 55 • Mixed = 21 • Non-Tumor = 21 • Phenotypes • Tumor type: • GBM, Astrocytoma, etc. • WHO Grade: • 176 individuals • Age: • 253 individuals • Sex: • 250 individuals (partially inferred using Y chromosome genes) • Survival (days post diagnosis): • 169 individuals

  32. REMBRANT:Chromosome Y Expression 8 males cluster with females 4 females cluster with males Female Male Sex specific gene expression Conversions of male to female should be more common than the other way, because it is difficult for females to express the Y chromosome.

  33. REMBRANT:Chr. Y Expression – Intelligent Reassignment Female Male Sex specific gene expression Intelligent Reassignment – If previous call of sex is for other group then the call is turned into an NA. All unknowns are given a call.

  34. Progression of Astrocytic Glioma Furnari, et al. (2007)

  35. Modeling Glioma • Increasing metastatic potential and severity of glioma could be modeled using this simple schema • Correlation of model to survival post diagnosis is -0.68 0 1 2

  36. Exploring Meta-Information • Age explains 31% of survival post diagnosis • Age explains 25% of the progression model • Sex does not have a significant effect on either survival or the progression model • Yet it is known that glioblastoma is slightly more common in men than in women

  37. Summary • Very ample dataset with good amount of meta-information • Ready for dimensionality reduction and network inference!

  38. Big Picture

  39. Clustering asDimensionality Reduction

  40. Big Picture

  41. Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity

  42. Relative Genome Sizes

  43. Solutions • Pre-process genomic sequences • Reduce data complexity by collapsing redundancies • Utilize filters that select for only the most variant genes

  44. Likely Issues • Size of eukaryotic genomes • Added complexity of regulatory regions • Tissue and cell type heterogeneity • Patient genetic and environmental heterogeneity

  45. Eukaryotic Gene Structure

  46. Eukaryotic Gene Structure

  47. Eukaryotic Gene Structure

  48. Eukaryotic Gene Structure

  49. Transcription Factor Binding Sites (6-12bp motifs) miRNA binding sites (4-9bp motifs) Promoter 3’ UTR Regulatory Regions No set length for promoters in eukaryotes. Grabbing 2Kbp, so we can use 2Kbp or smaller. Median 3’ UTR length is 831bp

More Related