Regulatory Motif Finding

Regulatory Motif Finding Mohammed AlQuraishi

Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance

Cell = Factory, Proteins = Machines Biovisions, Harvard

DNA • Instructions for making the machines “Coding” Regions “Regulatory” Regions (Regulons) • Instructions for when and where to make them

Transcriptional Regulation • Regulatory regions are comprised of “binding sites” • “Binding sites” attract a special class of proteins, known as “transcription factors” • Bound transcription factors can inhibit DNA transcription

DNA Regulation Source: Richardson, University College London

Cell Regulation • Transcriptional regulation is one of many regulatory mechanisms in the cell Focus of Talk Source: Mallery, University of Miami

Structural Basis of Interaction

Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08

Motif Finding • Basic Objective: • Find regions in the genome that bind transcription factors • Many classes of algorithms, differ in: • Types of input data • Motif representation

Input Data • Single sequence • Evolutionarily related set of sequences • Sequence + other data • Microarray expression profile • ChIP-chip • Others…

Motif Representation • Probabilistic • Word-Based Focus of Talk

Motif Representation • Structural discussion immediately raises difficulties

Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08

Motif Representation • Structural discussion immediately raises difficulties • Least Expressive: • Single sequence • Most Expressive: • 4k-dimensional probability distribution • Independently assign probability for each possible kmer G A C C G

Motif Representation • Standard Solution: • Position-Specific Scoring Matrix (PSSM) • Assuming independence of positions, assign a probability for each position • Fraught with problems… (Will revisit this)

Reference • Authors: • Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou • Title: • MotifCut: regulatory motifs finding with maximum density subgraphs • Publication: • Bioinformatics Vol. 22 no. 14 2006, pages e150–e157

Overview • Motif Finding Algorithm (“MotifCut”) • Motivation • Oversimplicity of PSSMs • Intractability of more complex models

Oversimplicity of PSSMs • Assumes independence between positions • ~25% of TRANSFAC motifs have been shown to violate this assumption • Two Examples: ADR1 and YAP6

Oversimplicity of PSSMs • Assumes independence between positions • Generates potentially unseen motifs

Basic Features of MotifCut • Does not assume an underlying PSSM • Represents a motif with a graph structure • In principle maximally expressive • In practice not quite • Motif finding cast as maximum density subgraph • Subquadratic complexity

Motif Graph Representation • Nodes are kmers • Edge weights are distances between kmers 1 AGTGCGAC AGTGGGAC 1 1 0 2 AGTGGGAC 2 AGTGCTAC • Generative model: Frequency of kmer node equal to frequency of generating kmer • Distance definition is complicated (Will come back to) • Same kmer node can appear multiple times

Motif Finding • Find highest density subgraph • Density is defined as sum of edge weights per node • Somewhat limits representational power

Motif Finding • Read new sequence • Generate graph as previously described • Kmers are generated by shifting one base pair • Each kmer in the sequence gets a node, including identical kmers • Graph contains as many nodes as there are base pairs • Connect edges with weights based on distances between nodes • Find densest subgraph

Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers G G A A C C C C G G 3 2 0 1 C T A

Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers • “Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: • F(hamming distance) = likelihood of two kmers to be in same motif

Edge Weights • Let’s make this a bit more precise: • But how to compute ? • Simulate it! • Way too many variables to account for analytically: Background model, kmer length, hamming distance, etc…

“Genome Simulation” • Background + Motifs • No genes, promoters, signaling sequences, etc. • Background Model • 3rd order Markov model • Probability of next base depends on previous 3 bases • Modeled on the yeast genome • Incorporates GC bias • Motif Model • PSSM • Based on empirically observed information content of yeast motifs

“Genome Simulation” • Use Markov model to generate 10k – 20k length sequences of background • Seed with 20 motifs generated by the PSSM • Result is a simulated genome of yeast • We know which parts are the real motifs, and which are not

Edge Weights • Back to : • is number of true motifs of k-length that are l-distance away • is number of non-motifs of k-length that are l-distance away

Edge Weights True Motifs G G G G G G False Motifs (Part of Background) G G G T G G G G G G G G G G G G G G G G G C G G

Edge Weights Let’s perform calculation from the perspective of this motif • All ≤1 distance away (Hamming distance) • α(k = 6, l = 1) = 1 • β(k = 6, l = 1) = 1 G G G G G G G G G G G G T G G T G G G G G G G G G G G G G G G G G G G G G G G G G G C C G G G G

Edge Weights • Computation provides an empirical estimate for • Parameterized by two quantities: • k, the kmer length • l, the Hamming distance between two kmers • Fit to a sigmoidal function

Edge Weights • Normalization step • Won’t go into details • This covers problem formulation • How is motif finding actually done?

Maximum Density Subgraph • Standard graph theory method • Max-flow / min-cut • O(nm log(n2m)) • Need faster method • Developed heuristic approach that utilizes max-flow / min-cut method with modifications

Maximum Density Subgraph • Remove all edges below a certain threshold

Maximum Density Subgraph • Pick one vertex (do this for every vertex)

Maximum Density Subgraph • Put back all neighboring edges for that vertex

Maximum Density Subgraph • Use standard algorithm to calculate densest subgraph

Results • Synthetic Tests • Plenty of test cases • Measure performance as data set size grows • Avoid over biasing on empirical data • Know real answer, can unambiguously test performance • Yeast Test • Gold standard data (Harbinson et al., 2004)

Synthetic Tests • Varied: • Motif length • Information content • Simulated genome (as before) • Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7

Synthetic Tests Results

Yeast Test Results

Performance

Regulatory Motif Finding