1 / 59

EECS 800 Research Seminar Mining Biological Data

EECS 800 Research Seminar Mining Biological Data. Instructor: Luke Huan Fall, 2006. Administrative. The book “Elements of Statistical Learning" is on reserve in engineering library. The other two books (Data Mining, Bioinformatics) are recalled.

aram
Télécharger la présentation

EECS 800 Research Seminar Mining Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

  2. Administrative • The book “Elements of Statistical Learning" is on reserve in engineering library. • The other two books (Data Mining, Bioinformatics) are recalled. • Presentation paper selection is due this Friday, 3 are remaining: • Data Mining in Systems Biology • Data Mining in Proteomics • Analyzing Bionetworks

  3. Administrative • Paper presenter • Always keep in mind the following four “w” questions in your presentation • What is the problem • Why the problem is important (“who cares”) • What are the related work (“why bother”) • Impacts of the presented work (“so what”) • Define and explain your computational task • Give intuitions before discussion details • Present the pros and cons of the methods • Audience: • Ask at least one question

  4. Outline • What is Microarray? • Terms from molecular biology—This is NOT Bio101 • Goals • Raw data collection • Raw data analysis • Frequent pattern discovery in Microarry data analysis

  5. Microarray • Microarrays are currently used to do many different things: • to detect and measure gene expression at the mRNA level • to find mutations and to genotype; • to sequence DNA; • to locate chromosomal changes and more. • There are many different types of microarrays • cDNA chips • Affymetrix chips

  6. Goals of a Microarray Experiment • Find the genes that change expression between experimental and control samples • Classify samples based on a gene expression profile • Find patterns: Groups of biologically related genes that change expression together across samples/treatments

  7. Microarray Procedure • In general there are two basic aspects of microarrays: • Data acquisition • Producing chips • preparing samples for detection; • hybridization; • scanning; • Data analysis • Low level analysis: normalization and significance test • High level analysis: clustering, classification, and pattern discovery • We are interested in the data analysis section. However, it is dangerous to go into data analysis without knowing how the data are collected

  8. Microarrays are Popular • PubMed search "microarray"= 13,948 papers • 2005 = 4406 • 2004 = 3509 • 2003 = 2421 • 2002 = 1557 • 2001 = 834 • 2000 = 294

  9. Necessary Background • Gene • Central Dogma • DNA • RNA • Nucleic acid hybridization

  10. Genes • The human genome contains 23 pairs of chromosomes. • In each pair, one chromosome is paternally inherited, the other maternally inherited. • Chromosomes are made of compressed and entwined DNA. • A (protein-coding) geneis a segment of chromosomal DNA that directs the synthesis of a protein.

  11. Central Dogma • The expression of the genetic information stored in the DNA molecule occurs in two stages: (i) transcription, during which DNA is transcribed into mRNA; (ii) translation, during which mRNA is translated to produce a protein. DNA  mRNA  protein

  12. DNA • A deoxyribonucleic acid or DNAmolecule is a double-stranded polymer composed of four basic molecular units called nucleotides. • There are four types of nucleotides: • adenine (A), • guanine (G), • cytosine (C), and • thymine (T). • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T.

  13. RNA • A ribonucleic acid or RNA molecule is a nucleic acid similar to DNA, but • It is single-stranded; • uracil (U) replaces thymine (T) as one of the bases. • RNA plays an important role in protein synthesis and other chemical activities of the cell. • Several classes of RNA molecules • messenger RNA (mRNA), • transfer RNA (tRNA), • ribosomal RNA (rRNA), • and other small RNAs.

  14. Nucleic acid hybridization: here DNA-RNA

  15. DNA Chip Microarrays • Put a large number (~100K) of DNA sequences or synthetic DNA oligomers onto a glass slide in known locations on a grid. • Measure amounts of RNA bound to each square in the grid • Make comparisons • Cancerous vs. normal tissue • Treated vs. untreated • Time course • Many applications in both basic and clinical research

  16. GeneChip

  17. Spot your own Chip Robot spotter Ordinary glass microscope slide

  18. Affymetrix “Gene chip” system • Commercial product • Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) • RNA labeled and scanned in a single “color” • Currently mass produced arrays targeting 17 different organisms • More than 40 different array types/sets • Proprietary system: “black box” software, can only use their chips

  19. Data Acquisition in Microarray • Scan the arrays • Quantitate each spot • Subtract background • Normalize • Export a table of fluorescent intensities for each gene in the array

  20. Hybridization to the Chip

  21. The Chip is Scanned

  22. Images

  23. Data Acquisition scanning cDNA clones (probes) PCR product amplification purification laser 2 laser 1 mRNA target) emission printing overlay images and normalize Hybridise target to microarray microarray analysis

  24. Function (Genome Ontology) Streamlined Array Analysis Normalize Filter •Present/Absent•Minimum value•Fold change Raw data (RMA) Significance Classification Clustering •t-test •SAM •Rank Product •Hierarchical CL •Biclustering •PAM •Machine learning Gene lists

  25. Lower Level Data Analysis • Normalization: • when you have variability in measurements, you need replication and statistics to find real differences • Significance test: • It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

  26. Sources of Variability in Raw Data • Biological variability • Sample preparation • Probe labeling • RNA extraction • Experimental condition • temperature, time, mixing, etc. • Scanning • laser and detector, chemistry of the flourescent label • Image analysis • identifying and quantifying each spot on the array

  27. Self-self hybridizations False colour overlay

  28. Variability Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.

  29. Data Normalization • Can control for many of the experimental sources of variability (systematic, not random or gene specific) • Bring each image to the same average brightness • Can use simple math or fancy: • divide by the mean (whole chip or by sectors) • LOESS (locally weighted regression) • No sure biological standards

  30. Significance Test • In a microarray experiment, each gene (each probe or probe set) is really a separate experiment • Yet if you treat each gene as an independent comparison, you will always find some with significant differences • (the tails of a normal distribution)

  31. False Discovery • Statisticians call false positives a "type 1 error" or a "False Discovery" • False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array • For a p-value of 0.01 X 10,000 genes = 100 false “different” genes • You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) • The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

  32. Higher Level Data Analysis • Computational tasks: • Clustering • Classification • Statistical validation • Data visualization • Pattern detection • Biological problems: • Discovery of common sequences in co-regulated genes • Meta-studies using data from multiple experiments • Linkage between gene expression data and gene sequence/function/metabolic pathways databases

  33. Types of Clustering • Herarchical • Link similar genes, build up to a tree of all • Self Organizing Maps (SOM) • Split all genes into similar sub-groups • Finds its own groups (machine learning) • Principle Component • every gene is a dimension (vector), find a single dimension that best represents the differences in the data

  34. Cluster by Color Difference

  35. GeneSpring

  36. Classification • How to sort samples into two classes based on gene expression data • Cancer vs. normal • Cancer sub-types: benign vs. malignant • Responds well to drug vs. poor response

  37. Functional Genomics • Take a list of "interesting" genes and find their biological relationships • Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods • Requires a reference set of "biological knowledge"

  38. GO • Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO) • 3 hierarchical sets of terminology • Biological Process • Cellular Component (location within cell) • Molecular Function • about 1000 categories of functions

  39. Gene Ontology

  40. Biological Pathways

  41. Microarray Databases • Large experiments may have hundreds of individual array hybridizations • Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments • Data-mining - look for similar patterns of gene expression across different experiments

  42. Public Databases • Gene Expression data is an essential aspect of annotating the genome • Publication and data exchange for microarray experiments • Data mining/Meta-studies • Common data format - XML • MIAME (Minimal Information About a Microarray Experiment)

  43. GEO at the NCBI

  44. Array Express at EMBL

  45. Array Express at EMBL

  46. Are the Treatments Different? • Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments • Before making these lists, ask the question: "Are the treatments different?" • Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test) • If there are differences, find the genes most responsible • If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.

  47. Association Rules in Microarray Analysis • Row enumeration vs column enumeration • Suppose we have m rows and I columns • The search space of column enumeration is 2I • Reduces the search space to 2m when m << I • Supports rule set pruning using minimum support, minimum confidence, and chi-square

  48. Background – Rule Groups • Large number of rules contained in a microarray data set • Rule sets contain a lot of redundancy • Makes interpretation difficult • Ex: 31 rules could be generated from class label {a, b, c, d, e, Cancer}, all with the same support • FARMER finds rule groups • 31 rules above would be grouped together • Only finds interesting rule groups. Consider abcd->Cancer with a confidence of 90% and ab->Cancer with a confidence of 95%. All rows covered by abcd must be covered by ab. Therefore, abcd->Cancer is not interesting.

  49. Preliminaries and Definitions • D = dataset (rows), R = set of rows {r1,…,rn}, I = set of items {i1,…,im}, C = set of class labels {c1,…,ck} • Row support set: Given I’, R(I’) is the set of rows that contains I’ • Item support set: Given R’, I(R’) is the set of items common among the rows in R’

  50. Example • I’ = {a,e,h} • R(I’) = {r2,r3,r4} • R’ = {r2,r3} • I(R’) = {a,e,h}

More Related