MotifClick: cis -regulatory k -length motif s finding in cliq ues of 2( k -1)- m er s

MotifClick: cis-regulatory k-length motifs finding in cliques of 2(k-1)-mers Shaoqiang Zhang http://bioinfo.uncc.edu/szhang April 3, 2013

Transcription factor binding sites Promoter region Terminator TSS Operon TF1 TF2 s Gene1 Gene2 Gene3 -300 -35 -10 +1 Transcription 3’ UTR a a mRNA b b cis-regulatory elements Gene regulation in prokaryotes

Orthologous genes Genome1 Genome2 Genome3 Genome4 Genome5 Gene Cis-regulatory motif / binding site motif. Gene Gene BS1 BS2 BS3 TGTGAGATAGATCACA CATGATTTAAATCGCA …………………………… TGTGATCAACATCACA Gene motif Gene Phylogenetic footprinting technique logo Transcription Factor binding sites (TFBS) TF BS1 BS3 Gene1 Gene2 Gene4 Gene6 Gene3 Gene5 BS2 Co-regulated genes (Regulon)in a single genome BS1 BS2 BS3

Motif Frequency matrix Motif profile matrix (Position weight matrix) Motifs TTGTTACGTTATAACA CGGTTATATTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA TGGTTATGTTATAACA TGGTTATGTTATAACA TGGTTATGTTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA TTGTTATGTTATAACG ATGTTATATTATTACA TTGTTATGTTATAACA TTGTTATGTTATAACA TTGTTATGTTATAACA TTGTTATGTTATAACA TTGTTATGTTATAACA TTGTTATAGTATAACA TTAAAATGTTATAACA TTAAAATGTTATAACA TTAATATGTTATAACA TTGTTATAATATAACA ATGTTACATTATAACA ATGTTACATTATAACA ATGTTACATTATAACA ATGTTACATTATAACA CGGTTATGTTATAACA TGGTTATGTTATAACA TGGTTATGCTATAACA TTAAAATGTTATAACA TTAATATGTTATAACA

All MEME BioProspector CUBIC MotifSampler MDscan Coverage of known BSs Weeder CONSENSUS Top number of output motifs Motif finding from co-regulated/orthologous genes A lot of motif finding programs have been developed such as MEME, BioProspector, MotifSampler, MotifCut, MDscan, Weeder, CONSENSUS etc. We have also developed a motif finding program -------MotifClick http://motifclick.uncc.edu

MotifClick: sub-motifs The binding sites of a TF may be divided into distinct sub-motifs. Merge cliques

Previous works MotifCut BOBRO • Graph construction:G=(V,E) un-weighted graph, whereV={candidate motif segments}E={for each pair of input sequences, top 10 pairs of segments with the largest numbers of conserved segments in the input seqs} • Finding clique from an edge • Expand each clique to a closure by adding candidate segments • Sort motif closures in the p-value order • Graph construction:G=(V,E,W)weighted graphV={all k-mers}E={each pair of k-mers}W={the probability that two k-mers belong to the same motif under the nucleotide background distribution} • Maximum density subgraph finding (max-flow min-cut algorithm) • Refine density subgraph • Sort motifs in the order of constructing maximum density graphs.

Main idea • Weighted graph: reduce constructed graph scale by using 2(k-1)-mers. • Edge weight: use match number and consider the background. • Clique finding: use the program we designed in GLECLUBS (find clique from each node). • Expansion: expand cliques into quasi-cliques to include more segments. • Rank: based on the size of cliques.

Graph construction: Vertex set Input a set of N sequences 2(k-1) Each k-mer is located in exactly one 2(k-1)-mer s1 k-1 size of the last one is in [k,2(k-1)] step length = k-1 si sN

Graph construction: Edge set For each pair of 2(k-1)-mers M’ and M”, calculate the maximum match number: E coli known binding sites a 0.02 b k-mer 0.2 If max match number >=cutoff and the two k-mers a and b with the max matches have Sum of squared distance Then link M’ and M” with an edge. Probability of each base in a binding site

How to select cutoffs and ? Randomly select a k-mer in the input seqs set, find a k-mer having max matches with it in each seq. s1 Random Sampling times=max{10, N/4} si k-mer with max matches Keep 95% k-mers by deleting min ones and calculate the average match number of the 95% sN =average match number NOTE: the cutoff can be amended later 5%

Graph construction: G=(V,E) s1 si sj sN MotifCut: max density subgraphs BOBRO: maximal clique starting from an edge MotifClick: maximal cliques starting from each node

Graph construction: G=(V,E) We can correct the cutoff by calculating the graph density. If the graph density>100, set until density<=100. And update the graph. Cutoff=10 Cutoff=11

Cliques finding Neighbor graph of vertex v Break ties by deleting the vertex with minimum sum of weights in the induced subgraph CliquesGroup= Max sum of matches Min sum of matches Top 1 motif: Clique1 (core) + Other cliques (expansion)

Merge other cliques into Clique1 5-clique or 4-clique After merging some other cliques into clique1, update the cliques group by removing clique1 and the cliques merged into clique1. ?????

Gapless alignments K-mer MUSCLE4.0: too strict to get ideal results Cutoff= average match number discard discard For all k-mers in the quasi-clique of 2(k-1)-mers, find the k-mer with max number of neighbors. Max number of neighbors Final alignment

Main steps • Read input fasta file into a matrix • Calculate background • Select match cutoff by estimating average match number • Build graph of 2(k-1)-mers • Calculate graph density • Update graph by deleting edges with matches=cutoff if graph density > density cutoff • Find all cliques associated with each vertex • Select the clique with max sum of matches and merge it with other cliques • Do gapless alignments on the expanded quasi-clique. • Update clique group, and go back step 8.

Estimate average match number Set match cutoff=average match num+1 Build graph of 2(k-1)-mers Set match cutoff=cutoff+1 Update graph Graph density<100 No Yes Find all cliques associated with each vertex Update clique group Select the clique with max sum of matches and merge it with other cliques Gapless alignments using average match number as cutoff Flowchart of MotifClick

Improvement How many kinds of nucleotides appear in a binding site? http://www.yeastgenome.org SGD (S. cerevisiae Genome Database) So, we only search the k-mers containing at less 3 kinds of nucleotides

Improvement Percent of max length of single-nucleotide segments in BSs TTTTTTCA 0.75

0.02 0.06 0.10 0.14 0.02 0.18 0.22 Sum of squared distance SSD cutoff=0.2

Percentage SSD

Command-line options Coded by standard C++ and compiled by GNU C++ compiler under Linux and Mac, and by MinGW (Minimalist GNU for Windows) under Windows(32bits). ********* USAGE: ********* MotifClick <dataset> [OPTIONS] > OutputFile <dataset> file containing DNA sequences in FASTA format OPTIONS: -w motif width (default=16) -n maximum number of motifs to find (default=5) -b 2 if examine sites on both of DNA strands (default=1 only forward) -d upper bound of graph density (default=100) -s 0 if want more degenerate sites (default=1 if want fewer sites) ********* -s 1: match cutoff=average match number+1 -s 0: match cutoff=average match number http://bioinfo.uncc.edu/szhang/computing.htm

Synthetic data test We test programs for k-mer sizes 8, 12, and 16. Hu et al. have used RegulonDB database to evaluate five algorithms, AlignACE, MEME, BioProspector, MDscan, and MotifSampler, for the prediction of prokaryotic binding sites, and found that MEME often achieved the best sensitivity, and BioProspector often achieved the highest specificity. Tompa et al. have used TRANSFAC database to assess 13 computational tools for the discovery of transcription factor binding sites in eukaryotes and found that Weeder was the best, and MEME were also good. Shaoqiang Zhang et al find MEME and Bioprospector cover true BSs, Then CUBIC, MDscan, MotifSampler, consensus, Weeder can only find motifs with length 6,8,10,12 (parameters: small (6,8), medium(6,8,10), large(6,8,10,12), extra(6-12, mainly 8,10) Compare with Motif finding tools: MEME, BioProspector, Weeder and MotifCut

Synthetic data test Binding sites level accuracy: • Sensitivity : Sn=TP/(TP+FN)=(number of correctly predicted BSs)/(number of actual BSs) • Specificity: Sp=TP/(TP+FP)=(number of correctly predicted BSs)/(number of predicted BSs) • Performance coefficient: PC=TP/(TP+FP+FN)= )=(number of correctly predicted BSs)/(number of {actual U predicted BSs}) • F-measure/Harmonic mean: F=2*Sn*Sp/(Sn+Sp)

Synthetic data test We generated synthetic sets of background sequences using 3rd-order Markov model. A motif containing 20 binding sites The motif instance of 20 BSs was randomly seeded into a synthetic fasta file of 20 seqs, not necessarily one BS per seqs. Motif seqs set Synthetic background seqs set We will test on 400 length X 20 seqs, 600X20, 800X20, and1000X20.

Synthetic data test (8-mer/Octamer) Synthetic background seqs: the dependencies of 3rd-order Markov were estimated from all intergenic seqs of the yeast genome. Motifs containing 20 BSs with information contents of 12 bits( at most 6 positions are conserved) were chosen from SGD database. Yeast background: AT: 0.65 GC:0.35 Number of mutations allowed Unfair to other tools Meme inputfile.fasta –dna –mod anr –w 8 –nmotifs 1 –text > file.meme.out Weederlaucher.out inputfile SC medium M T1 SGD weederTFBS.out –f inputfile.fasta –W 8 –O SC –e 3 –R 50 –M –T 1 adviser.out inputfile.fasta S BioProspector –i inputfile.fasta –W 8 –d 1 –r 1 –o file.biop.out Binding sites count Motif_cuts.exe inputfile.fasta 8 1 MotifClicker inputfile.fasta –w 8 –n 1 –s 1 >file.motifclick.out MotifClicker inputfile.fasta –w 8 –n 1 –s 0 >file.motifclick.out binding site length

0.02 0.06 0.10 0.14 0.02 0.18 0.22 Sum of squared distance Background seqs sets size 400*20, 600*20, 800*20, 1000*20, Seed motifs into 100 instances of each size

Synthetic data test (8-mer) 400*20 Average SSD=0.06 100 instances of 400*20 seq sets Average SSD=0.10 Note: Weeder did not output any results on the two motifs after setting number of ouput motifs as “T1”, so we decided to use “T2” and only consider top 1 motif of “T2”.

K-mer size 8 (using two motifs with SSD=0.06 and SSD=0.10, respectively, on 100 datasets) Sensitivity Specificity PC F-measure

Dodeca-mer (12-mer) • Synthetic background seqs: the dependencies of 3rd-order Markov were estimated from all intergenic seqs of the E. coli K12. • Motifs containing 20 BSs with information contents of 14 bits( at most 7 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. • Seed motifs into 100 background seq sets. • Test on 400*20, 600*20, 800*20, and 1000*20 • We abandoned Weeder, because it can only set motif length as “small” (length 6 with 1 mutation,length 8 with 2 mutations), “medium” (like small, plus length 10 with 3 mutations, “large” (like medium,plus length 12 with 4 mutations), and “extra”(length 6 with 1 mutation, length 8 with 3 mutations, length 10 with 4 mutations, length 12 with 4 mutations).That is, Weeder only accepts motif length even values between 6~12.and for length 12 only accepts at most 4 mutations.

K-mer size 12, seed into 100 background seqs sets Sn Sp PC F-measure

12-mer, add noise Sn Sp PC F-measue

16-mer • Synthetic background seqs: the dependencies of 3rd-order Markov were estimated from all intergenic seqs of the E. coli K12. • Motifs containing 20 BSs with information contents of 16 bits( at most 8 positions are conserved) and the average SSD=0.02 between each BS and background were chosen from RegulonDB database. • Seed motifs into 100 background seq sets. • Test on 400*20, 600*20, 800*20, and 1000*20

16-mer Sn Sp PC F-measure

16-mer,add noise Sn Sp F-measure PC

Motif finding in Yeast (8-mer) Motif finding in 5137 intergenic sequence sets of orthologous genes, which contain 99 TFs, belonging to 2932 BSs in SGD. *At least 3 orthologous genes for each intergenic sequence set. http://www.yeastgenome.org

Motif finding in Ecoli K12 (16-mer) Ecoli K12: 2313 operon groups, RegulonDB v6.0: 122 TFs, 1411 BSs. Weeder and Consensus are the worst because they need high-quality input seqs set.

Conclusions • Synthetic data:MotifCut has highest specificity. MotifClick have highest sensitivity. MotifClick has the most complements with other tools. • Yeast data and Ecoli dataMotifClick and MEME have close numbers of true predictions and more true predictions than other tools.MotifClick has the most complements with other tools.

MotifClick: cis -regulatory k -length motif s finding in cliq ues of 2( k -1)- m er s