510 likes | 602 Vues
Lecture 13. Cis-Regulation cont’d GREAT. Gene Regulation. gene (how to) control region (when & where). RNA gene. Protein coding. DNA. DNA binding proteins. Pol II Transcription. Key components: Proteins DNA sequence DNA epigenetics Protein components: General Transcription factors
E N D
Lecture 13 Cis-Regulation cont’d GREAT http://cs273a.stanford.edu [Bejerano Fall10/11]
Gene Regulation • gene (how to) • control region(when & where) RNA gene Protein coding DNA DNA binding proteins http://cs273a.stanford.edu [Bejerano Fall10/11]
Pol II Transcription • Key components: • Proteins • DNA sequence • DNA epigenetics • Protein components: • General Transcription factors • Activators • Co-activators http://cs273a.stanford.edu [Bejerano Fall10/11]
Enhancers http://cs273a.stanford.edu [Bejerano Fall10/11]
Vertebrate Gene Regulation • gene (how to) • control region(when & where) distal: in 106 letters DNA DNA binding proteins proximal: in 103 letters http://cs273a.stanford.edu [Bejerano Fall10/11]
Gene Expression Domains: Independent http://cs273a.stanford.edu [Bejerano Fall10/11]
Distal Transcription Regulatory Elements http://cs273a.stanford.edu [Bejerano Fall10/11]
Repressors / Silencers http://cs273a.stanford.edu [Bejerano Fall10/11]
What are Enhancers? Repressors • What do enhancers encode? • Surely a cluster of TF binding sites. • [but TFBS prediction is hard, fraught with false positives] • What else? DNA Structure related properties? • So how do we recognize enhancers? • Sequence conservation across multiple species • [weak but generic] • Verifying repressors is trickier [loss vs. gain of function]. • How do you predict an enhancer from a repressor? Duh... repressors repressors http://cs273a.stanford.edu [Bejerano Fall10/11]
Insulators http://cs273a.stanford.edu [Bejerano Fall10/11]
Cis-Regulatory Components • Low level (“atoms”): • Promoter motifs (TATA box, etc) • Transcription factor binding sites (TFBS) • Mid Level: • Promoter • Enhancers • Repressors/silencers • Insulators/boundary elements • Cis-regulatory modules (CRM) • Locus control regions (LCR) • High Level: • Epigenetic domains / signatures • Gene expression domains • Gene regulatory networks (GRN) http://cs273a.stanford.edu [Bejerano Fall10/11]
Disease Implications: Genes gene genome protein Limb Malformation Over 300 genes already implicated in limb malformations. http://cs273a.stanford.edu [Bejerano Fall10/11]
Disease Implications: Cis-Reg gene genome NO protein made Limb Malformation Growing number of cases (limb, deafness, etc). http://cs273a.stanford.edu [Bejerano Fall10/11]
Transcription Regulation & Human Disease [Wang et al, 2000] http://cs273a.stanford.edu [Bejerano Fall10/11]
Critical regulatory sequences Lettice et al. HMG 2003 12: 1725-35 Single base changes Knock out http://cs273a.stanford.edu [Bejerano Fall10/11]
Other Positional Effects [de Kok et al, 1996] http://cs273a.stanford.edu [Bejerano Fall10/11]
Genomewide Association Studies point to non-coding DNA http://cs273a.stanford.edu [Bejerano Fall10/11]
WGA Disease http://cs273a.stanford.edu [Bejerano Fall10/11]
9p21 Cis effects Follow up study: http://cs273a.stanford.edu [Bejerano Fall10/11]
Cis-Regulatory Evolution: E.g., obile Elements Gene Gene Gene Gene What settings make these“co-option” events happen? [Yass is a small town in New South Wales, Australia.] http://cs273a.stanford.edu [Bejerano Fall10/11]
Britten & Davidson Hypothesis: Repeat to Rewire! [Davidson & Erwin, 2006] [Britten & Davidson, 1971] http://cs273a.stanford.edu [Bejerano Fall10/11]
Modular: Most Likely to Evolve? Chimp Human http://cs273a.stanford.edu [Bejerano Fall10/11]
Human Accelerated Regions Human Chimp Human-specific substitutions in conserved sequences 24 [Pollard, K. et al., Nature, 2006] [Prabhakar, S. et al., Science, 2008] [Beniaminov, A. et al., RNA, 2008]
http://GREAT.stanford.edu:Generating Functional Hypotheses from Genome-Wide Measurements of Mammalian Cis-Regulation Gill Bejerano Dept. of Developmental Biology & Dept. of Computer Science Stanford University http://bejerano.stanford.edu
Human Gene Regulation 1013 different cells in an adult human. All these cells have the same Genome. 20,000 Genes encode how to make proteins. 1,000,000 Genomic “switches” determinewhich and how much proteins to make. Gene Gene Gene Gene Hundreds of different cell types. http://bejerano.stanford.edu
Most Non-Coding Elements likely work in cis… “IRX1 is a member of the Iroquois homeobox gene family. Members of this family appear to play multiple roles during pattern formation of vertebrate embryos.” gene deserts regulatory jungles 9Mb Every orange tick mark is roughly 100-1,000bp long, each evolves under purifying selection, and does not code for protein. http://bejerano.stanford.edu
Many non-coding elements tested are cis-regulatory http://bejerano.stanford.edu
Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. Proteins DNA Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”,and the nearby gene is activated to produce protein. http://bejerano.stanford.edu
ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak http://bejerano.stanford.edu 30
What is the transcription factor I just assayed doing? • Collect known literature of the form • Function A: Gene1, Gene2, Gene3, ... • Function B: Gene1, Gene2, Gene3, ... • Function C: ... • Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. • Form hypothesis and perform further experiments. Gene transcription start site Cis-regulatory peak http://bejerano.stanford.edu 31
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak • ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 • SRF is known as a “master regulator of the actin cytoskeleton” • In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. http://bejerano.stanford.edu [1] Valouev A. et al., Nat. Methods, 2008
Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) π π π π • Existing, gene-based method to analyze enrichment: • Ignore distal binding events. • Count affected genes. • Rank by enrichment hypergeometric p-value. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with π π π π P = Pr(k ≥1 | n=2, K =3, N=8) π π http://bejerano.stanford.edu
We have (reduced ChIP-Seq into) a gene list!What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Deep sequencing data Microarray tool http://bejerano.stanford.edu
SRF Gene-based enrichment results • Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF SRF Z ~ ~ Where’s the signal? Top “actin” term is ranked #28 in the list. http://bejerano.stanford.edu 35 [1] Valouev A. et al., Nat. Methods, 2008
Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments http://bejerano.stanford.edu
Bad Solution: Associating distal peaks brings in many false enrichments π π π Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has 2,000+ binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Regulatory jungles are oftennext to key developmental genes http://bejerano.stanford.edu
Real Solution: Do not convert to gene list.Analyze the set of genomic regions Gene regulatory domain Genomic region (ChIP-seq peak) Gene transcription start site Ontology term ( ‘actin cytoskeleton’) π π π π GREAT = Genomic RegionsEnrichment of Annotations Tool π p = 0.33 of genome annotated with π n = 6 genomic regions P = Prbinom(k ≥5 | n=6, p =0.33) k = 5 genomic regions hit annotation π π π Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. http://bejerano.stanford.edu
How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms • Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. • Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb • Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks http://bejerano.stanford.edu
GREAT infers many specific functions of SRF from its binding profile Top GREAT enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* Top gene-based enrichments of SRF 30 31 7x10-9 5x10-5 Gene Ontology actin cytoskeleton actin binding Miano et al. 2007 Miano et al. 2007 32 26 5x10-7 2x10-6 Bertolotto et al. 2000 Poser et al. 2000 Pathway Commons TRAIL signaling Class I PI3K signaling 5 1x10-8 TreeFam Chai & Tarnawski 2002 FOS gene family 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 TF Targets (top actin-related term 28th in list) Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq http://bejerano.stanford.edu [McLean et al., Nat Biotechnol., 2010]
GREAT data integrated • Twenty ontologies spanning broad categories of biology • 44,832 total ontology terms tested in each GREAT run (2,800 terms) (6,700) (5,215) (3,079) (834) (911) (5,781) (615) (427) (19) (456) (222) (9) (150) (1,253) (6,857) (288) (8,272) (706) (238) Michael Hiller http://bejerano.stanford.edu
GREAT implementation • Can handle datasets of hundreds of thousands of genomic regions • Testing a single ontology term takes ~1 ms • Enables real-time calculation of enrichment results for all ontologies Cory McLean http://bejerano.stanford.edu
GREAT web app: input page http://great.stanford.edu Pick a genome assembly Input BED regions of interest Dave Bristor http://bejerano.stanford.edu
GREAT web app: output summary Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments http://bejerano.stanford.edu
GREAT web app: term details page Genes annotated as “actin binding” with associated genomic regions Genomic regions annotated with “actin binding” Drill down to explore how a particular peak regulates Plectin and its role in actin binding Frame holding http://www.geneontology.org definition of “actin binding” http://bejerano.stanford.edu
You can also submit any trackstraight from the UCSC Table Browser A simple, well documented programmatic interface allows any tool to submit directly to GREAT. See our Help. Inquiries welcome! http://bejerano.stanford.edu
GREAT web app: export data HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing http://bejerano.stanford.edu
External Web Stats: Catching On last 500 entries only http://bejerano.stanford.edu
Summary • Current technologies identify cis-regulatory sequences • GREAT accurately assesses functional enrichments of cis-regulatory sequences using a genomic region-based approach [McLean et al., Nat Biotechnol., 2010] • Online tool available (version 1.5 coming soon, in QA) http://great.stanford.edu • GREAT is immediately applicable to all sets with a significant cis-regulatory content: • Regulatory Chromatin Markers (e.g., H3K4me1) • Genome Wide Association Studies (GWAS) • Comparative Genomics sets (e.g., ultraconserved elements) http://bejerano.stanford.edu
Acknowledgments GREAT developers Cory McLean Dave Bristor Michael Hiller Shoa Clarke Craig Lowe Aaron Wenger Gill Bejerano Other help FahSathira Marina Sirota Bruce Schaar Terry Capellini Christopher Meyer Jennifer Hardee http://great.stanford.edu http://bejerano.stanford.edu