1 / 28

Comparing Methods for Identifying Transcription Factor Target Genes

Comparing Methods for Identifying Transcription Factor Target Genes. Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics. Transcriptional Regulation. TF not bound = no gene expression. TF bound = gene expression.

loc
Télécharger la présentation

Comparing Methods for Identifying Transcription Factor Target Genes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Methods for Identifying Transcription Factor Target Genes • Alena van Bömmel(R 3.3.73) • Matthew Huska (R 3.3.18) • Max Planck Institute for Molecular Genetics

  2. Transcriptional Regulation TF notbound = no gene expression TF bound = gene expression

  3. Transcriptional Regulation TF notbound = no gene expression TF bound = gene expression Problem: There are many genes and many TF's, how do we identify the targets of a TF?

  4. Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq

  5. PWM Genome Scan • Purely computational method • Input: • position weight matrix for your TF • genomic region(s) of interest • Pros: • No need to do wet lab experiments • Cons: • Many false positives, not able to take biological conditions into account Score threshold

  6. PWM genome scan • Download the PWMs of your TF of interest from the database (they might include >1 motif) • Define the sequences to analyze (promoter sequences) • Run the PWM genome scan (hit-based method or affinity prediction method) • Rank the genomic sequences by the affinity signal • Suggested Reading: • Roider et al.: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics (2007). • Thomas-Chollier et al.Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature Protocols (2011).

  7. PWM-PSCM

  8. TRAP • Convert the PSSM(position specific scoring matrix) to PSEM (position specific energy matrix) • Scan the sequences of interest with TRAP • Results in 1 score per sequence=binding affinity • Doesn’t separate the exact TF binding sites (easier for ranking) • Sequences must have the same length! ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/Release TRAP trap.molgen.mpg.de/cgi-bin/home.cgi

  9. Matrix-scan • Use directly the PSSM • Finds all TFBS which exceed a predefined threshold (e.g. p-value) • More complicated to create ranked lists of genomic sequences (more hits in the sequence) • Exact location of the binding site reported matrix-scan http://rsat.ulb.ac.be/

  10. Finding the target genes • target genes will be the top-ranked genes (promoters) • which are the top-ranked genes? (top-100,500,1000...?) • There’s no exact definition of promoters, usually 2000bp upstream, 500bp downstream of the TSS

  11. Microarrays → R/Bioconductor (details later)

  12. Folie 12 Microarrays (2) • Pros: • There is a lot of microarray data already available (might not have to generate the data yourself) • Inexpensive and not very difficult to perform • Computational workflow is well established • Cons: • Can not distinguish between indirect regulation and direct regulation

  13. ChIP-seq Map reads to the genome Call peaks to determine most likely TF binding locations

  14. Folie 14 ChIP-seq (2) • Pros: • Direct measure of genome-wide protein-DNA interaction(*) • Cons: • Don't know whether binding causes changes in gene expression • More complicated experimentally and in terms of computational analysis • Most expensive • Need an antibody against your protein of interest • Biases are not as well understood as with microarrays

  15. ChIP-seq analysis • Download the reads from given source (experiments and controls) • Quality control of the reads and statistics (fastqc) • Mapping the reads to the reference genome (bwa/Bowtie) • Peak calling (MACS) • Visualization of the peaks in a genome browser (genome browser, IGV) • Finding the closest genes to the peaks(Bioconductor/ChIPpeakAnno) Visualised peaks in a genome browser • Suggested Reading: • Bailey et alPracticalGuidelines for the Comprehensive Analysis of ChIP-seq Data.PLoSComputBiol(2013). • Thomas-Chollier et al. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.Nature Protocols (2012).

  16. Sequencing data • raw data=reads usually very large file (few GB) • formatfastq (ENCODE) or SRA (Sequence Read Archive of NCBI) Analysis • Quality control with fastqc • Filteringof reads with adapter sequences • Mapping of the reads to the reference genome (bwa or Bowtie) Example of fastq data file

  17. Quality control with fastqc • per base quality • sequence quality (avg. > 20) • sequence length • sequence duplication level (duplication by PCR) • overrepresented sequences/kmers (adapter sequences) • produces a html report • manual(read it!) • software at the MPI Example of per base seq quality scores FASTQC=/scratch/ngsvin/bin/chip-seq/fastqc/FastQC/fastqc

  18. Mapping with bwa • mapping the sequencing reads to a reference genome • manual(read it!) • map the experiments and the controls • reference genome in fasta format (hg19) • create an index of the reference file for faster mapping (only if not available) • align the reads (specify parameters e.g. for # of mismatches, read trimming, threads used...) • generate alignments in the SAM format (different commands for single-end and pair-end reads!) • software and data at the MPI: BWA = /scratch/ngsvin/bin/executables/bwa hg19: /scratch/ngsvin/MappingIndices/hg19.fa bwaindex: /scratch/ngsvin/MappingIndices/BWA/hg19

  19. File manipulation with samtools • utilities that manipulate SAM/BAM files • manual (read it!) • merge the replicates in one file (still separate experiment and control) • convert the SAM file into BAM file (binary version of SAM, smaller) • sort and index the BAM file • now the sequencing files are ready for further analysis • software at the MPI: SAMTOOLS = /scratch/ngsvin/bin/executables/samtools

  20. Peak finding with MACS • find the peaks, i.e. the regions with a high density of reads, where the studied TF was bound • manual(read it!) • call the peaks using the experiment (treatment) data vs. control • set the parameters e.g. fragment length, treatment of duplication reads • analyse the MACS results (BED file with peaks/summits) • software at the MPI: MACS = /scratch/ngsvin/bin/executables/macs

  21. Finding the target genes • find the genes which are in the closest distance to the (significant) peaks • how to define the closest distance? (+- X kb) • use ChIPpeakAnno in Bioconductor or bedtools

  22. Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq Thresholds

  23. Bioinformatics • Read mapping (Bowtie/bwa) • Peak Calling (MACS/Bioconductor) • Peak-Target Analysis (Bioconductor) • Microarray data analysis (Bioconductor) • Differential Genes (R) • GSEA • PWM Genome Scan (TRAP/MatScan) • Statistics (R) • Data Integration (R/Python/Perl) • Statistical Analysis (R)

  24. Bioinformatics tools READ THE MANUALS! • Bowtiebowtie-bio.sourceforge.net/manual.shtml • bwabio-bwa.sourceforge.net/bwa.shtml • MACS github.com/taoliu/MACS/blob/macs_v1/README.rst • TRAPtrap.molgen.mpg.de/cgi-bin/home.cgi • matrix-scan http://rsat.ulb.ac.be/ • Bioconductorwww.bioconductor.org/(more info in R course) Databases • GEOwww.ncbi.nlm.nih.gov/geo/ • ENCODE genome.ucsc.edu/ENCODE/ • SRAwww.ncbi.nlm.nih.gov/sra • JASPAR http://jaspar.genereg.net/

  25. Schedule • 03.03. Introduction lecture, R course • 04.03. R & Bioconductor homework submission • 11.03. Presentation of the detailed plan of each group (which TF, cell line, tools, data, data integration, team work ) 10:30am, 11:30am • every Tuesday 10:30am, 11:30am progress meetings • 17.04. Final report deadline • 24.04. (tentative) Presentations • 28.04. Final meeting, discussion of final reports

  26. GR Group • Expression and ChIP-seqdata: Luca F, Maranville JC, et al., PLoS ONE, 2013 • PWM database: jaspar.genereg.net

  27. c-Myc Group • Expression data: Cappellen, Schlange, Bauer et al., EMBO reports, 2007 • Musgrove et al., PLoS One, 2008 • ChIP-seq data: ENCODE Project • PWM database: jaspar.genereg.net

  28. Additional analysis • Binding motifs • are the overrepresented motifs in the ChIP-peak regions different? • do we find any co-factors? • Recommended tool: • RSAT rsat.ulb.ac.be binding motifs binding motifs binding motifs

More Related