1 / 49

More on TF Motif Finding ChIP-chip / seq

More on TF Motif Finding ChIP-chip / seq. Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520. De novo Sequence Motif Finding. Goal: look for common sequence patterns enriched in the input data (compared to the genome background) Regular expression enumeration Pattern driven approach

inoke
Télécharger la présentation

More on TF Motif Finding ChIP-chip / seq

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More on TF Motif Finding ChIP-chip / seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

  2. De novo Sequence Motif Finding • Goal: look for common sequence patterns enriched in the input data (compared to the genome background) • Regular expression enumeration • Pattern driven approach • Enumerate patterns, check significance in dataset • Oligonucleotide analysis, MobyDick • Position weight matrix update • Data driven approach, use data to refine motifs • Consensus, EM & Gibbs sampling • Motif score and Markov background

  3. Position Weight Matrix Update • Advantage • Can look for motifs of any widths • Flexible with base substitutions • Disadvantage: • EM and Gibbs sampling: no guaranteed convergence time • No guaranteed global optimum

  4. Motif Finding in Bacteria • Promoter sequences are short (200-300 bp) • Motif are usually long (10-20 bases) • Some have two blocks with a gap, some are palindromes • Long motifs are usually very degenerate • Single microarray experiment sometimes already provides enough information to search for TF motifs

  5. Motif Finding in Lower Eukaryotes • Upstream sequences longer (500-1000 bp), with some simple repeats • Motif width varies (5 – 17 bases) • Expression clusters provide decent input sequences quality for TF motif finding • Motif combination and redundancy appears, although single motifs are usually significant enough for identification

  6. Yeast Promoter Architecture • Co-occurring regulators suggest physical interaction between the regulators

  7. Motif Finding in Higher Eukaryotes • Upstream sequences very long (3KB-20KB) with repeats, TF motif could appear downstream • Motifs can be short or long (6-20 bases), and appear in combination and clusters • Gene expression cluster not good enough input • Need: • Comparative Genomics: phastcons score • Motif modules: motif clusters • ChIP-chip/seq

  8. Yeast Regulatory Sequence Conservation

  9. UCSC PhastCons Conservation • Functional regulatory sequences are under stronger evolutionary constraint • Align orthologous sequences together • PhastCons conservation score (0 – 1) for each nucleotide in the genome can be downloaded from UCSC

  10. Conserved Motif Clusters • First find conserved regions in the genome • Then look for repeated transcription factors (TF) binding sites • They form transcription factor modules

  11. Outline • ChIP-chip on yeast • Technology and data analysis: MDscan motif finding, regulatory network • ChIP-X on human • Tiling microarrays and peak finding • High throughput sequencing and peak finding • Data analysis and examples • Analysis: peak finding, gene expression analysis, sequence motif finding, regulatory network • Holistic picture of gene regulation

  12. Motivation • Motif finding works well in bacteria, OK in yeast, marginal in worm/fly, and almost never in mammals • Cistrome: Genome-wide in vivo binding sites of DNA-binding proteins • ChIP-chip and ChIP-seq gives cistrome results

  13. ChIP-chip Technology • Chromatin ImmunoPrecipitation + microarray • ChIP-on-chip or ChIP-chip • Also known as Genome Scale Location Analysis • Detect genome-wide in vivo location of TF and other DNA-binding proteins • Find all the DNA sequences bound by TF-X? • Cook all the dishes with cinnamon • Can learn the regulatory mechanism of a transcription factor or DNA-binding protein much better and faster

  14. Chromatin ImmunoPrecipitation (ChIP)

  15. TF/DNA Crosslinking in vivo

  16. Sonication (~500bp)

  17. TF-specific Antibody

  18. Immunoprecipitation

  19. Reverse Crosslink and DNA Purification

  20. Promoter Array Hybridization Genes Intergenetic ChIP

  21. ChIP-DNA chip Detection • Started in yeast, use promoter cDNA microarray • ~ 6000 spots, each 800-1000 bp • Two color assay • Control: no antibody, or chromatin (a little bit of everything) • Need triplicates to cancel noise • Applied to all yeast TFs • TF modified to contain a tag • Tag can be precipitated with Immunoglobin

  22. ChIP-chip Motif Finding • ChIP-chip gives 10-5000 binding regions ~600-1000bp long. Precise binding motif? • Raw data is like perfect clustering, plus enrichment values • MDscan • High ChIP ranking => true targets, contain more sites • Search TF motif from highest ranking targets first (high signal / background ratio) • Refine candidate motifs with all targets • Used successfully in ChIP-chip motif finding

  23. m-matches for TGTAACGT Similarity Defined by m-match For a given w-mer and any other random w-mer TGTAACGT 8-mer TGTAACGT matched 8 AGTAACGT matched 7 TGCAACAT matched 6 TGACACGG matched 5 AATAACAG matched 4 Pick a reasonable m to call two w-mers similar

  24. A 9-mer ATTGCAAAT Higher enrichment TTTGCGAAT TTGCAAATC Seed motif pattern ChIP-chip selected upstream sequences ATTGCAAAT TTTGCGAAT TTTGCAAAT GCCACCGT ACCACCGT ACCACGGT GCCACGGC … GCAAATCCA GCAAATTCG GCAAATCCA GGAAATCCA GGAAATCCT TTGCAAATC TTGCGAATA TTGCAAATT TTGCCCATC TTTGCAAAT CAAATCCAA CAAATCCAA GAAATCCAC TGCAAATCC TGCAAATTC MDscan Seeds

  25. Seed1 m-matches Update Motifs With Remaining Seqs Extreme High Rank All ChIP-selected targets

  26. Seed1 m-matches Refine the Motifs Extreme High Rank All ChIP-selected targets

  27. Yeast TF Regulatory Network Protein Transcribe Regulate Gene

  28. Ndt80 & Sum1 regulated genes ChIP-chip Better Explains Expression Sum1 regulated genes Ndt80 regulated genes

  29. Tiling Probes Genome Tiling Microarrays • Promoter array doesn’t work for human ChIP-chip • Binding could appear in much further intergenic sequences, introns, exons, or downstream sequences. Genomic DNA on the chromosome

  30. DNA Purification

  31. ChIP Ctrl Chromosome ChIP-chip on Tiling Microarray ChIP-DNA Noise

  32. ChIP-chip • Detect genome-wide location of transcription and epigenetic factors • Affymetrix genome tiling arrays are cheaper • $2000 7 arrays * 6 million probes * (3 ChIP + 3 Ctrl) • But data is noisier and less informative • Two peaks? How about ChIP alone? Over 42M probes? ChIP Log Probe Intensity Ctrl Chromosome Coordinates

  33. ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window

  34. Histogram of (PM – MM) PM – MM ChIP-chip AnalysisMann-Whitney U-test • Affy TAS, Cawley et al (Cell 2004): • Assign 1 to all probe pairs with MM > PM • Each probe: rank probes within [-500bp, +500bp] window • Check whether sum of ChIP ranks is much smaller • Consider all probes equally • Half of the probes have MM > PM

  35. Affymetrix Tiling Array Peak Finding • Challenges: • Massive data, probe values noisy • Only 1/3 of researchers get it to work the first time • Previous algorithms only work by comparing 3 ChIP with 3 Ctrl • Model-based Analysis of Tiling arrays (MAT) • Work with single ChIP (no rep, no ctrl) • Find individual failed samples • More sensitive, specific, and quantitative with 3 ChIP & 3 Ctrl MAT: Johnson et al, PNAS 2006

  36. MAT • Most of the probes in ChIP-chip measures non-specific hybridization and background noise • Estimate probe behavior by checking other probes with similar sequence on the same array • Probe sequence plays a big role in signal value

  37. Model Sequence-Specific Probe Effect • First detailed model of probe sequence on probe signal • AATGC ACTGT GCACA GATCG GCCAT 7 A, 7 C, 6 G, 5 T, map to 2 places in genome • Use all the probes on the array to estimate the parameters Position-specific A, C, G effect Probe signal # of T’s intercept A,C,G,T count squared 25-mer copy number

  38. 6M Probes 2K bins Observed probe intensity Model predicted probe intensity Observed probevariance within eachbin Probe Standardization • Fit the probe model array by array

  39. Raw probe values at two spike-in regions with concentration 2X 2X 2X ChIP Ctrl Sequence-based probe behavior standardization ChIP standardized Ctrl standardized Window-based neighboring probe combination for ChIP-region detection ChIP Window (ChIP – Ctrl) (3 ChIP – 3 Ctrl)

  40. MA2C: Model-based for 2-Color Arrays • Normalize probes by GC bins within each array • How much variance is observed in the GC bin • Give high confidence probes more weight • Running window average or median for peak finding MA2C: Song et al, Genome Biol 2007

  41. Is a ChIP experiment working? • MAT window scores ~ normal with long tails • Estimate pvalue of normal from left half of data • FDR = A / B (Ctrl/ChIP peaks are all FPs) • Spike-in shows MAT FDR estimate is accurate • Can find individual failed replicate A B

  42. ChIP-Seq ChIP-DNA Noise Map 30-mers back to the genome Sequence millions of 30-mer ends of fragments

  43. Binding MACS: Model-based Analysis for ChIP-Seq • Use confident peaks to model shift size

  44. Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias

  45. Peak Calls • Tag distribution along the genome ~ Poisson distribution (λBG = total tag / genome size) • ChIP-Seq show local biases in the genome • Chromatin and sequencing bias • 200-300bp control windows have to few tags • But can look further Dynamic λlocal = max(λBG, [λctrl, λ1k,] λ5k, λ10k) ChIP Control 300bp 1kb 5kb 10kb http://liulab.dfci.harvard.edu/MACS/ Zhang et al, Genome Bio, 2008

  46. CEAS:Cis-regulatory Element Annotation System • Data Analysis Button for Biologists http://ceas.cbi.pku.edu.cn

  47. ER TF?? Estrogen Receptor • Carroll et al, Cell 2005 • Overactive in > 70% of breast cancers • Where does it go in the genome? • ChIP-chip on chr21/22, motif and expression analysis found its partner FoxA1

  48. ER AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators NRIP

  49. ER NRIP AP1 Estrogen Receptor (ER) Cistrome in Breast Cancer • Carroll et al, Nat Genet 2006 • ER may function far away (100-200KB) from genes • Only 20% of ER sites have PhastCons > 0.2 • ER has different effect based on different collaborators

More Related