" A short guide to breeding and taming highly intelligent ChIPMunks "

Using ChIPMunk for motif discovery-quick-start guide-slide 3:10 - preparing dataslide 11:12 - running ChIPMunkslide 13:14 - do I need ChIPHorde? "A short guide to breeding and taming highly intelligent ChIPMunks"

Some basic questions • Can I use ChIPMunk for the WHOLE PEAK SEGMENTS from ChIP-Seq experiment? • YES! But you will need to supply the “base coverage profile” (also called as the “peak shape”). • Should I cut short segments around ChIP-Seqpeak summits for ChIPMunk? • NO! Use the whole peaks with the base coverage data when possible. • Want more details? Move to the next slides!

Prerequisites • To use ChIPMunk motif discovery tool you need: • Java runtime environment (JRE, also called as Java Virtual Machine), use version no less than 1.5 • May be you already have Java, test it by typing java –version • Linux users: check your distro-specific package manager. ChIPMunk will run under Oracle Java as well as under OpenJDK. • Windows users: go directly to java.com ! [NOTE]You do not need JDK (Java Development Kit), only JRE/JVM.

Extracting ChIPMunk • Let’s assume you have successfully got your chipmunk_v?_binary.zip from the official ChIPMunk website (see downloads section): http://autosome.ru/ChIPMunk • Unpack it to any suitable folder. You now should see autosome directory. This is the ChIPMunk Java package autosome.ru. • Now you can run you ChIPMunk from the folder, that contains the autosomepackage. For simplicity you may wish to store the files with sequences just one level upper of the ChIPMunk’sautosome folder. ! [NOTE] Do not try move anything outside of the autosome folder. Your ChIPMunk should live there.

Preparing your data: overview • No prior information: simple multi-fasta, Simple data set • Some arbitrary weights or quality values assigned for each sequence: multi-fasta with weights in headers, Weighted data set • Prior positional profile along each sequence: multi-fasta with profiles in headers, Peak data set • Peak and Weighted data sets can be useful not only for ChIP-Seq data but for any kind of data set where you have some quality rating or known positional preferences.

Preparing your data: Simple data set • The simplest case: you already have a number of sequences to be used for motif discovery with ChIPMunk. No additional information is available. • You should arrange a simplest multi-fasta file like > header1 ACTGTGTGAAA > header2 AGTGTGTGTGTG ! [NOTE] You can omit fasta headers since ChIPMunk would simply skip them. Remember – this is Simple data set.

Preparing your data: Weighted data set • Let’s assume you have some prior information like any quality rating or any prior measure of presence/power of binding sites. • You should arrange a simple multi-fasta file specifying your arbitrary quality of each sequence in fasta headers: > 10.0 ACGGTGTAAAAA > 2.0 GGTAGTGTCGTAGTG ! [NOTE] Your weights (quality values) should always be positive. Never use negative or zero-quality. Remember, this is Weighted data set.

Preparing your data: Peak data set • If you have any prior profile information like shape of ChIP-Seq peaks than you can provide a profile in the fasta-header like: > 1.0 2.0 3.0 2.0 1.5 2.0 AGTAAC > 1.0 2.0 3.0 2.0 1.5 CAGTA ! [NOTE] The length of each profile should be equal to the length of the corresponding sequence. Remember, this is Peak (or Profiled) data set.

ChIP-Seq data: what to do • The best usage case: ChIP-Seq data with base coverage (often provided in wiggle-files, .wig). Extract peak heights for each position of each sequence and generate the Peak multi-fasta. • Only peak height h and peak summit position is known. You should manually generate triangle profiles with triangle shape, having 0.0 height at both ends of the sequence and h height at peak summit position. • Only peak height h is known. Then use the weighted data set specifying h as weight/quality. ! [NOTE] When available always use base coverage information or generate triangle profiles. This is extremely important for ChIPMunk performance.

Running ChIPMunk: specifying data set • So, now you know the type of your sequence.mfa dataset. It is either Simple (s:sequence.mfa), weighted (w:sequence.mfa) or peak (p:sequence.mfa). • Remember to supply it to ChIPMunk likep:sequence.mfa if your file is placed in your current directory. You can specify the local path to your file after p: if your file is located somewhere else on your drive. ! [NOTE] We highly advise to use the peak data set if possible.

Running ChIPMunk: default mode java -Xms512M -Xmx1G autosome.ru.ChIPMunk p:your_sequences_with_profiles.mfa > output.log This will produce output.log with all informative output and allow Oracle Java to use from 512Mb to 1Gb of RAM. ! [NOTE] This will be the best way to search for unknown motif and allow ChIPMunk automatically use default parameter settings.

Running ChIPMunk: tweaking parameters • The most obvious things you can tweak are: the motif lengths range (from 7 to 22bp for example): java -Xms512M -Xmx1G autosome.ru.ChIPMunk722 yes 1.0 w:your_weighted_set.mfa • The number of starting seeds, increasing the number from default 100 will improve precision: java -Xms512M -Xmx1G autosome.ru.ChIPMunk 7 22 yes 1.0 w:your_weighted_set.mfa 200 • Allow ChIPMunk to automatically estimate the background model instead of predefined 0.5 GC%: java autosome.ru.ChIPMunk 7 22 yes 1.0 p:peak_data.fasta 200 20 1 2 random local ! [NOTE] Don’t hesitate to consult with ChIPMunk manual or to contact ivan-dot-kulakovskiy-at-gmail-dot-com. There are many useful advanced options for ChIPMunk.

ChIPHorde extension: do I need it? • You want to find the most significant motif in the set (for example find a common motif for a given transcription factor, TF) • ChIPHorde? NO, ChIPMunk is enough. • You want to check different motif lengths (like 10, 12 and 15 bps) and manually select the best motif. • ChIPHorde? NO, run ChIPMunk several times with 10 to 10, 12 to 12 and 15 to 15 motif length ranges. • OR YES, you can run ChIPHorde in its ‘dummy’ mode like: java autosome.ru.ChIPHorde 10:10,12:12,15:15 dummy yes 1.0 w:your_weighted_sequence_set.mfa ! [NOTE] So, if you want to find the MOST SIGNIFICANT motif for a dataset then you DO NOT NEED ChIPHorde extension. But you can use it in dummy mode to check different lengths and then manually select required motifs.

You need ChIPHorde if • You suspect different distinct motifs for your TF. Use ‘filter’ mode (dropping sequences with motif hits from the previous step): java autosome.ru.ChIPHorde 7:21,7:21,7:21 filter yes 0.0 w:your_weighted_sequence_set.mfa • You want to find potential cofactor TFs. Use ‘mask’ mode (masking good motif hits from the previous step): java autosome.ru.ChIPHorde 7:21,7:21,7:21 mask yes 0.0 w:your_weighted_sequence_set.mfa The length range from 7 to 21bp is used to search for three different motifs. ! [NOTE] ZOOPS factor (0.0 in this example) may heavily affect results. Please consult the manual! The length ranges are also important, especially in ‘mask’ mode.

" A short guide to breeding and taming highly intelligent ChIPMunks "