ChIP-seq Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es Madrid, October 2013

ChIP-seq Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es Madrid, October 2013

Nat Rev Genet. 2009 Oct;10(10):669-80. 2

ChIP-seq [Basic concepts] Nucleosome The basic structural subunit of chromatin. A nucleosome consists of approximately 147 base pairs of DNA and an octamer of histone proteins. Epigenome The chromatin states that are found along the genome, defined for a given time point and cell type. Thus, for a given genome there may be hundreds of thousands of epigenomes, depending on the stability of the chromatin states. DNase I hypersensitive site A chromosomal region that is highly accessible to cleavage by DNase I. Such sites are associated with open chromatin conformations and transcriptional activity. Nat Rev Genet. 2009 Oct;10(10):669-80. 3

Chromatine, chromosomes, DNA 4

ChIP-seq Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a technique for genome-wide profiling of DNA-binding proteins, histone modifications or nucleosomes. ChIP-seq offers higher resolution, less noise and greater coverage than its array-based predecessor ChIP-chip. With the decreasing cost of sequencing, ChIP-seq has become an indispensable tool for studying gene regulation and epigenetic mechanisms. Nat Rev Genet. 2009 Oct;10(10):669-80. 5

ChIP-seq Genome-wide mapping of protein-DNA interactions and epigenetic marks is essential for a full understanding of transcriptional regulation. A precise map of binding sites for transcription factors, core transcriptional machinery and other DNA-binding proteins is vital for deciphering the gene regulatory networks that underlie various biological processes. The combination of nucleosome positioning and dynamic modification of DNA and histones has a key role in gene regulation and guides development and differentiation. Chromatine states can influence transcription directly by altering the packaging of DNA to allow or prevent access to DNA-binding proteins, or they can modifiy the nucleosome surface to enhance or impede recruitment of effector protein complexes. Recent advances suggest that this interplay between chromatin and transcription is dynamic and more complex than previously appreciated. Nat Rev Genet. 2009 Oct;10(10):669-80. 6

ChIP The main tool for investigating these mechanisms is chromatin immunoprecipitation (ChIP), which is a technique for assaying protein-DNA binding in vivo. In ChIP, antibodies are used to select specific proteins or nucleosomes, which enrich for DNA fragments that are bound to these proteins or nucleosomes. The introduction of microarrays allowed the fragments obtained from ChIP to be identified by hybridization to a microarray (ChIP-chip), therefore enabling a genome-scale view of DNA-protein interactions. Using ChIP followed by microarray (ChIP-chip), nucleosome depletion at active promoters in yeast was described in 2004. Nat Rev Genet. 2009 Oct;10(10):669-80. 7

ChIP-seq In ChIP-seq, the DNA fragments of interest are sequenced directly instead of being hybridized on an array. ChIP-seq has higher resolution, fewer artefacts, greater coverage and a larger dynamic range than ChIP-chip, and therefore provides substantially improved data. Although the short reads (~35bp) generated by NGS platforms pose serious difficulties for certain applications - for example, de novo genome assembly - they are aceptable for ChIP-seq. The more precise mapping of protein-binding sites provided by ChIP-seq allows for a more accurate list of targets for transcription factors and enhancers, in addition to better identification of sequence motifs. Nat Rev Genet. 2009 Oct;10(10):669-80. 8

ChIP-seq basics In a ChIP experiment for DNA-binding proteins, DNA fragments associated with a specific protein are enriched. (1)The DNA-binding protein is crosslinked to DNA in vivo by treating cells with formaldehyde and (2)the chromatin is sheared by sonication into small fragments. (3)An antibody specific to the protein of interest is used to immunoprecipitate the DNA-protein complex. Finally, (4)the crosslinks are reversed and the released DNA is assayed to determine the sequences bound by the protein. During the construction of a sequencing library, the immunoprecipitated DNA is subjected to size selection (typically in the ~150-300bp range, although there seems to be a bias towards shorter fragments in sequencing). Nat Rev Genet. 2009 Oct;10(10):669-80. 9

ChIP-seq basics 10

ChIP-seq basics Nat Rev Genet. 2009 Oct;10(10):669-80. 11

Advantages of ChIP-seq over ChIP-chip First, its base pair resolution is perhaps the greatest improvement over ChIP-chip. Although arrays can be tiled at a high density, this requires a large number of probes and remains expensive for mammalian genomes. Arrays also have fundamental limitations in resolution due to the uncertainties in the hybridization process. Second, ChIP-seq does not suffer from the noise generated by the hybridization step in ChIP-chip. Nucleic acid hybridization is complex and dependent on many factors, including the CG content, length, concentration and secondary structure of the target and probe sequences. Therefore, cross-hybridization between imperfectely matched sequences frequently occurs and contributes to the noise. Nat Rev Genet. 2009 Oct;10(10):669-80. 12

Advantages of ChIP-seq over ChIP-chip Third, the intensity signal measured on arrays might not be linear over its entire range, and its dynamic range is limited below and above saturation points. There is an study where distinct and biologically meaningful peaks seen in ChIP-seq were obscured when the same experiment was conducted with ChIP-chip. Finally, in ChIP-seq the genome coverage is not limited by the repertoire of probe sequences fixed on the array. This is particulary important for the analysis of repetitive regions of the genome, which are typically masked out on arrays. Studies involving heterochromatin or microsatellites, for instance, can be done much more effectively by ChIP-seq. Sequence variations within repeat elements can be captured by sequencing and used to map reads to the genome; unique sequences that flank repeats are also helpful in aligning reads to the genome. Nat Rev Genet. 2009 Oct;10(10):669-80. 13

Advantages of ChIP-seq over ChIP-chip Nat Rev Genet. 2009 Oct;10(10):669-80. 14

Drawbacks of ChIP-seq All profiling technologies produce unwanted artefacts, and ChIP-seq is no exception. Although sequencing errors have been reduced substantially as the technology has improved, they are still present, especially towards the end of each read. There is also bias towards GC-rich content in fragment selection, both in library preparation and in amplification before and during sequencing, although notable improvements have been made recently. In addition, when an insufficient number of reads is generated, there is a loss of sensitivity or specificity in detection of enriched regions. There are also technical issues in performing the experiment, such as loading the correct amount of sample: too little sample will result in too few tags. Nat Rev Genet. 2009 Oct;10(10):669-80. 15

Drawbacks of ChIP-seq The main disadvantage with ChIP-seq is its current cost and availability. The overall cost of ChIP-seq, which includes machine depreciation and reagent cost, will have to be lowered further for it to be comparable with the cost of ChIP-chip. For high-resolution profiling of an entire large genome, ChIP-seq is already less expensive than ChIP-chip. However, as the cost of sequencing continues to decline and institutional support for sequencing platforms continues to grow, ChIP-seq is likely to become the method of choice for nearly all ChIP experiments in the near future. Nat Rev Genet. 2009 Oct;10(10):669-80. 16

ChIP-Seq: technical considerations for obtaining high-quality data Nat Immunol. 2011 Sep 20;12(10):918-22 17

ChIP-Seq: technical considerations for obtaining high-quality data Nat Immunol. 2011 Sep 20;12(10):918-22 18

ChIP-seq: technical considerations for obtaining high-quality data There are several technical aspects of ChIP-seq that should be considered for the generation of high quality genome wide data, including antibodies, controls, library construction and statistical analysis. Antibody quality ChIP-seq data depends crucially on the quality of the antibody used. A sensitive and specific antibody will give a high level of enrichment compared with the background, which makes it easier to detect binding events. Many antibodies are commercially available, and some are noted as ChIP grade, but the quality of different antibodies is highly variable and can also vary among batches of a specific antibody. Rigorous validation is a laborious process:for histone modifications for instance, the reactivity of the antibody with unmodified histones or non-histone proteins should be checked by western blotting. Nat Rev Genet. 2009 Oct;10(10):669-80. 19

ChIP-seq: technical considerations for obtaining high-quality data Furthermore, cross-reactivity with similar histone modifications (for example, dimethylation compared with trimethylation at the same residue) should be checked. Antibodies that offer high sensitivity and specificity are necessary for ChIP-seq because they allow the detection of enrichment peaks without substantial background noise. Many commercial antibodies that have been tested for their use in ChIP studies are available. However, results from various groups have shown that NOT all commercial antibodies that are designated as 'ChIP grade' or 'ChIP qualified' can be successfully used to investigate genome wide protein-DNA interactions. Nat Immunol. 2011 Sep 20;12(10):918-22 20

ChIP-Seq: technical considerations for obtaining high-quality data Cell number The abundance of the protein or histone modification to be investigated and the quality of the antibodyshould be considered when determining the number of cells to begin ChIP-seq analysis. As the signal-to-noise ratio is directly correlated with the cell number, the use of more cells tends to produce a higher signal-to-noise ratio. Therefore, it is important to empirically determine the minimum number of cells that can be used, whenever possible. ChIP-seq experiments typically require 1x106 to 10x106 cells, which results in 10-100 ng of ChIP DNA. The former (1x106 cells) is usually sufficient for the analysis of abundant proteins such as RNA polymerase II and localized histone modifications such as trimethylation of histone H3 and Lys4 (H3K4me3), whereas the latter (10x106) may be required for the analysis of less-abundant proteins or difuse histone modifications. Nat Immunol. 2011 Sep 20;12(10):918-22 21

ChIP-Seq: technical considerations for obtaining high-quality data Chromatin fragmentation Before ChIP, chromatin must be fragmented into a manageable size (~ 150-300 base pairs) by sonication or enzymatic means (usually via treatment with micrococcal nuclease). For histone modification, digestion of native chromatin into mononucleosome-sized particles with micrococcal nuclease may be the preferred method, because this generates high-resolution data for nucleosome modifications and eliminates signal artifacts caused by crosslinking with other genomic regions. For mapping of binding sites for transcription factors, sonication of formaldehyde-crosslinked chromatin is the more appropriated method. Nat Immunol. 2011 Sep 20;12(10):918-22 22

ChIP-Seq: technical considerations for obtaining high-quality data Chromatin fragmentation The conditions used to sonicate chromatinneed to be optimized for each cell typebecause they are highly variable and depend on the cell type, the number of cells used, fixation conditions, type of sonicator and sonicator settings. It is important to avoid oversonication of chromatin when transcription factors are to be evaluated by ChIP, whereas oversonication may not be as problematic for analysis of histone modifications. Nat Immunol. 2011 Sep 20;12(10):918-22 23

ChIP-Seq: technical considerations for obtaining high-quality data Control experiment The experimental steps in ChIP involve several potential sources of artifacts. Shearing of DNA, for example, does not result in uniform fragmentation of the genome: open chromatin regions tend to be fragmented more easily than closed regions, which creates an uneven distribution of sequence tags across the genome. Also, repetitive sequences might seem to be enriched because of inaccuracies in the number of copies of the repeats in the assembled genome. Therefore, a peak in the ChIP-seq profile should be compared with the same region in a matched control sample to determine its significance. Nat Rev Genet. 2009 Oct;10(10):669-80. 24

Control experiment There are different types of control sample (negative controls used to substrat the peak signal): a) Input DNA (a portion of the DNA sample removed prior to immunoprecipitation (IP)) b) DNA from nonspecific IP (IP performed using an antibody, such as nonspecific immunoglobulin G). The IP for the factor of interest at a positive control site (where it is expected to bind at high level) should be several fold above that of the control. Input DNA has been used as the control sample in most ChIP-seq studies. Most nonspecific IgG antibodies usually immunoprecipitate much less DNA than specific antibodies, thus the resulting control reads will not cover the genome as sufficiently as a background model would for peak identification. Nat Rev Genet. 2009 Oct;10(10):669-80. 25

Control experiment It is possible to avoid sequencing a control sample if one is only interested in differential binding patterns between conditions or time points and if the variation in chromatin preparations is small. Input chromatin serves as a better control for biasin chromatin fragmentation and variations in sequencing efficiency; additionally, it provides greater and more evently distributed coverage of the genome. An additional control for antibody specificity includes knockdown of the factor of interest by RNA-mediated interference. Nat Rev Genet. 2009 Oct;10(10):669-80. 26

ChIP profiles Nat Rev Genet. 2009 Oct;10(10):669-80. 27

ChIP-Seq: technical considerations for obtaining high-quality data Replicates Many factors, including cell-culture conditions, ChIP and library construction, may contribute to variability between data sets. To ensure reliability of the data, biological replicate experiments are necessary. Although there is no consensus on the correct number of replicates needed, at least duplicate biological experiments should be done. Although only one ChIP-grade antibody is available for the analysis of most histone modifications and transcription factors, it is recommended that ChIP-seq data be confirmed through the use of a different antibody wherever possible, to control for a potential antibody cross-reactivity. Nat Immunol. 2011 Sep 20;12(10):918-22 28

ChIP-Seq: technical considerations for obtaining high-quality data Library construction Libraries may be constructed from ChIP DNA by standard protocols specific to the sequencing platform. Typically, library construction includes end repair, the addition of single adenosine residues, adaptor ligation, size selection and gel purification, followed by PCR with primers specific to the sequencing platform. (During the size-selection step, it is important that the agarose gel be melted at room temperature (~22 ºC) rather than at 50 ºC, as the latter temperature migh result in a bias for guanosine and cytidine because of loss of sequences rich in adenosine and thymidine) During the PCR amplification step, it is important that adaptor-ligated DNA products are not overamplified, which may result in a loss of specific signal, bias or redundancy in the number of sequence tags. Overamplification can typically be avoided by decreasing the number of PCR cycles or decreasing the amount of template DNA used for PCR. Nat Immunol. 2011 Sep 20;12(10):918-22 29

ChIP-Seq: technical considerations for obtaining high-quality data Library construction One way to determine whether overamplification has ocurred is comparison of the size of the adaptor-ligated product to that of the PCR product. Overamplified PCR products will generally have more of a shift in size than will adaptor-ligated products (for example, an adaptor-ligated product 200-400 bp in length may shift to a size of >300-500 base pairs). Nat Immunol. 2011 Sep 20;12(10):918-22 30

ChIP-Seq: technical considerations for obtaining high-quality data Nat Immunol. 2011 Sep 20;12(10):918-22 31

ChIP-Seq: technical considerations for obtaining high-quality data Sequencing Libraries can be sequenced at a distance of 25 bp from one end of the DNA templates, which typically provides relatively good coverage (66%) of uniquely mappable sequences in the human genome. Sequencing reads of 30 or 35 bp improve the mappability to 70.9% or 74.1% respectively, and may be preferred if cost is not an issue. Nat Immunol. 2011 Sep 20;12(10):918-22 32

ChIP-Seq: technical considerations for obtaining high-quality data Paired-end vs single-end Libraries can be sequenced by either a single-end sequencing strategy (which generates short sequence reads from one end of the DNA template) or a paired-end sequencing strategy (which generates short sequence reads from both ends of the DNA template). Paired-end sequencing has the following advantages:more sequencing coverage, improved efficiency of alignment to repetitive regions because more sequencing information is obtained from each DNA template, and greater ability to detect fragment sizes. In cases in which ChIP-enriched DNA fragments partially overlap or contain repetitive sequences, sequencing of both ends may allow more accurate mapping to the genome than does single-end sequencing, which may otherwise result in a loss of repetitive sequences during the analysis. Nat Immunol. 2011 Sep 20;12(10):918-22 33

ChIP-Seq: technical considerations for obtaining high-quality data Sequencing The number of sequencing reads required for reasonable genomic coverage is contingent on several factors, including antibody affinity and the number of target sites in the genome. For analysis of the distribution of histone H3K4me3, 5x106 sequence reads are sufficient to reach saturation of its target sites, whereas 20x106 reads may be required for reasonable coverage for H3K27me3 profiles. A more quantitative approach for determining the appropriate depth of sequencing involves evaluating the saturation point (the number of reads after wich additional sequencing does not identify new binding or enrichment sites). Nat Immunol. 2011 Sep 20;12(10):918-22 34

Depth of sequencing Intuitively, one expects that when a large number of binding sites are present in the genome for a DNA-binding protein or when a histone modification covers a large fraction of the genome,a correspondingly large number of tags will be needed to cover each bound region at the same tag density. Nat Rev Genet. 2009 Oct;10(10):669-80. 20x106 mapped unique reads are usually sufficient for analysis of most modifications and transcription factors. To lower the cost of sequencing, it is also possible to pool several libraries, such as for analysis of H3K4me3 modifications, through the use of indexing adapters. Nat Immunol. 2011 Sep 20;12(10):918-22 35

Depth of sequencing One reasonable criterion for determining sufficient sequencing depth would be that the results of a given analysis do not change when more reads are obtained. In terms of the number of binding sites, this criterion translates to the presence of a 'saturation point' after which no further binding sites are discovered with additional reads. In many cases, new statistically significant peaks are discovered at a steady rate with an increasing number of tags (solid curve), that is, there is no saturation of binding sites. However, when a minimum threshold is imposed for the enrichment ratio between chromatin immunoprecipitation (ChIP) and input DNA peaks, the rate at which new peaks are discovered slows down (dashed curve). Nat Rev Genet. 2009 Oct;10(10):669-80. 36

Depth of sequencing Ba. A peak that is not statistically significant - the enrichment ratio between the ChIP and control experiments is low (1.5). Bb. Two ways in which a peak can be statistically significant. On the left, although the number of tag counts is low, the enrichment ratio between the ChIP and control experiments in high (4). On the right, the peaks have the same ratio as those in Ba, but have a large number of tag counts. Nat Rev Genet. 2009 Oct;10(10):669-80. 37

Depth of sequencing This example shows that continued sequencing might lead to less prominent peaks becoming statistically significant and that there might not necessarily be a saturation point after which no further binding sites are discovered. Nat Rev Genet. 2009 Oct;10(10):669-80. 38

Multiplexing For small genomes, including those of Saccharomyces cerevisiae, Caenorhabditis elegans and D. melanogaster, the number of reads generated in a sequencing unit (for example, one of eight lanes on an Illumina Genome Analyzer) may be several times greater than the number of reads needed to provide sufficient coverage of the genome at a suitable depth for the ChiP-seq experiment. As the number of reads per run continues to increase, the ability to sequence multiple samples at the same time (referred to as 'multiplexing') becomes important for cost effectiveness. In theory, multiplexing of samples is not difficult and only requires different barcode adaptors to be ligated to different samples during sample preparation. Nat Rev Genet. 2009 Oct;10(10):669-80. 39

Multiplexing BMC Genomics. 2009 Jan 21;10:37. 40

Nat Rev Genet. 2009 Oct;10(10):669-80. 41

ChIP-Seq: Data analysis Nat Immunol. 2011 Sep 20;12(10):918-22 42

ChIP-Seq: Data analysis Alignment for ChIP-seq should allow for a small number of mismatches due to sequencing errors, SNPs an indels or the difference between the genome of interest and the reference genome. Reads mapped to multiple sites ('multi-reads') are usually discarded during 'normal' analysis. Consequently, peaks in highly repetitive regions are overlooked. However, repetitive regions have been linked to important biological functions such as disease susceptibility, immunity and defense. A new method has been proposed to incorporate multi-reads into peak detection through the use of a weighted alignment scheme (ref. 24). Nat Immunol. 2011 Sep 20;12(10):918-22 43

ChIP-Seq: Data analysis Another important issue in data analysis is comparison of the amount of histone modification or binding of transcription factorsin two different cell types or under different conditions. Because of variations in ChIP conditions, the amount of noise may vary sustantially between different samples even with the same antibody. Because scaling the data to sequenced depth does not eliminate systematic errors, normalization algorithms are needed for comparisons across samples. The tool DIME has been developed for the classification of regions with considerable enrichment in one ChIP-seq sample relative to their abundance in another ChIP-seq sample based on an estimation of multivariate mixture models. Nat Immunol. 2011 Sep 20;12(10):918-22 44

Challenges in data analysis Genome alignment Image processing and base calling are platform specific and are mostly done using the software provided by the sequencing platform manufacturer, although some new base callers have been proposed recently for the Illumina platform. More important is the choice of strategy for genome alignment, as all subsequent results are based on the aligned reads. Every aligner is a balance between accuracy, speed, memory and flexibility, and no aligner can be best suited for all applications. In any case, this is simpler than in RNA-seq, for example, in which large gaps corresponding to introns must be considered. Nat Rev Genet. 2009 Oct;10(10):669-80. 45

Challenges in data analysis Identification of enriched regions After sequenced reads are aligned to the genome, the next step is to identify regions that are enriched in the ChIP sample relative to the control with statistical significance. Nat Rev Genet. 2009 Oct;10(10):669-80. Several 'peak callers' that scan along the genome to identify the enriched regions are currently available . Control sample signal ChIP-seq peaks 46

Challenges in data analysis Identification of enriched regions The fragments are sequenced at the 5' end and the locations of mapped reads should form two distributions, one on the positive strand and the other on the negative strand, with a consistent distance between the peaks of the distributions. In these methods, a smoothed profile of each strand is constructed and the combined profile is calculated either by shifting each distribution towards the centre Nat Rev Genet. 2009 Oct;10(10):669-80. Strand-specific profiles at enriched sites 47

Challenges in data analysis very important: Validation of a number of peaks is always recommended in a ChIP-seq analysis !!! Nat Rev Genet. 2009 Oct;10(10):669-80. 48

Challenges in data analysis Data management Next-generation sequencing produces an unprecedent amount of data. Raw data and images are on the order of terabyes per machine run, which makes data storage a challenge even for facilities with considerable expertise in the management of genomic data. Data can be stored at three levels:image data, sequence tags and alignment data. Ideally the raw image should be kept so that if a new base caller is developed the raw data can be reprocessed. Nat Rev Genet. 2009 Oct;10(10):669-80. 49

Challenges in data analysis Data management There is no consensus in the community with regard to which data types should be stored, but many argue that the image data are too expensive to maintain an that a reasonable approach is to discard the raw data after a short periord of time and keep only the sequence-level data. In microarrays, investigators are encouraged, and often required, to submit their data upon publication to a public database, such as Gene Expression Omnibus (GEO). In the case of NGS data, the National Center for Biotechnology Information in the US, the European Bioinformatics Institute and the DNA Databank of Japan have developed the Sequence Read Archive (SRA) and the European Nucleotide Archive (ENA). Nat Rev Genet. 2009 Oct;10(10):669-80. 50

ChIP-seq Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es Madrid, October 2013

ChIP-seq Osvaldo Graña CNIO Bioinformatics Unit ograna@cnio.es Madrid, October 2013

Presentation Transcript

Bioinformatics and sequence analysis

Table of Contents – pages iv-v

Quarter 2

Table of Contents – pages iv-v

A Fox and a Kit

Bioinformatics For MNW 2 nd Year

CS 6293 Advanced Topics: Translational Bioinformatics

On-Chip Communication: Networks on Chip (NoCs)

Table of Contents – pages iv-v

Table of Contents – pages iii

Bioinformatics

Table of Contents – pages iii

Table of Contents – pages iii

Table of Contents – pages iv-v

Table of Contents – pages iii

TABLE OF CONTENTS

Table of Contents – pages iv-v

Table of Contents – pages iii

Table of Contents – pages iii

Table of Contents – pages iii