830 likes | 1.19k Vues
Regulatory Motif Finding. Mohammed AlQuraishi. Talk Outline. Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut) Analysis of Motif Finders’ Performance. Talk Outline. Biology Background Algorithmic Problem Papers New Motif Finding Algorithm (MotifCut)
E N D
Regulatory Motif Finding Mohammed AlQuraishi
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Cell = Factory, Proteins = Machines Biovisions, Harvard
DNA • Instructions for making the machines “Coding” Regions “Regulatory” Regions (Regulons) • Instructions for when and where to make them
Transcriptional Regulation • Regulatory regions are comprised of “binding sites” • “Binding sites” attract a special class of proteins, known as “transcription factors” • Bound transcription factors can inhibit DNA transcription
DNA Regulation Source: Richardson, University College London
Cell Regulation • Transcriptional regulation is one of many regulatory mechanisms in the cell Focus of Talk Source: Mallery, University of Miami
Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Motif Finding • Basic Objective: • Find regions in the genome that bind transcription factors • Many classes of algorithms, differ in: • Types of input data • Motif representation
Input Data • Single sequence • Evolutionarily related set of sequences • Sequence + other data • Microarray expression profile • ChIP-chip • Others…
Motif Representation • Probabilistic • Word-Based Focus of Talk
Motif Representation • Structural discussion immediately raises difficulties
Structural Basis of Interaction • Key Feature: • Transcription factors are not 100% specific when binding DNA • Not one sequence, but family of sequences, with varying affinities 0.54 0.48 G G G G G G G G A A A A C G C C C C T C C C C T A G G G G G 0.32 0.25 0.11 0.08
Motif Representation • Structural discussion immediately raises difficulties • Least Expressive: • Single sequence • Most Expressive: • 4k-dimensional probability distribution • Independently assign probability for each possible kmer G A C C G
Motif Representation • Standard Solution: • Position-Specific Scoring Matrix (PSSM) • Assuming independence of positions, assign a probability for each position • Fraught with problems… (Will revisit this)
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance
Reference • Authors: • Eugene Fratkin, Brian T. Naughton, Douglas L. Brutlag, and Serafim Batzoglou • Title: • MotifCut: regulatory motifs finding with maximum density subgraphs • Publication: • Bioinformatics Vol. 22 no. 14 2006, pages e150–e157
Overview • Motif Finding Algorithm (“MotifCut”) • Motivation • Oversimplicity of PSSMs • Intractability of more complex models
Oversimplicity of PSSMs • Assumes independence between positions • ~25% of TRANSFAC motifs have been shown to violate this assumption • Two Examples: ADR1 and YAP6
Oversimplicity of PSSMs • Assumes independence between positions • Generates potentially unseen motifs
Basic Features of MotifCut • Does not assume an underlying PSSM • Represents a motif with a graph structure • In principle maximally expressive • In practice not quite • Motif finding cast as maximum density subgraph • Subquadratic complexity
Motif Graph Representation • Nodes are kmers • Edge weights are distances between kmers 1 AGTGCGAC AGTGGGAC 1 1 0 2 AGTGGGAC 2 AGTGCTAC • Generative model: Frequency of kmer node equal to frequency of generating kmer • Distance definition is complicated (Will come back to) • Same kmer node can appear multiple times
Motif Finding • Find highest density subgraph • Density is defined as sum of edge weights per node • Somewhat limits representational power
Motif Finding • Read new sequence • Generate graph as previously described • Kmers are generated by shifting one base pair • Each kmer in the sequence gets a node, including identical kmers • Graph contains as many nodes as there are base pairs • Connect edges with weights based on distances between nodes • Find densest subgraph
Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers G G A A C C C C G G 3 2 0 1 C T A
Edge Weights • Heart of the algorithm, will focus on this • Semantics: • Edge weight is the likelihood of two kmers to be in the same motif • Use Hamming distance as a way to quantify distance between kmers • “Interpret” hamming distance as a measure of the likelihood of two kmers to be in same motif: • F(hamming distance) = likelihood of two kmers to be in same motif
Edge Weights • Let’s make this a bit more precise: • But how to compute ? • Simulate it! • Way too many variables to account for analytically: Background model, kmer length, hamming distance, etc…
“Genome Simulation” • Background + Motifs • No genes, promoters, signaling sequences, etc. • Background Model • 3rd order Markov model • Probability of next base depends on previous 3 bases • Modeled on the yeast genome • Incorporates GC bias • Motif Model • PSSM • Based on empirically observed information content of yeast motifs
“Genome Simulation” • Use Markov model to generate 10k – 20k length sequences of background • Seed with 20 motifs generated by the PSSM • Result is a simulated genome of yeast • We know which parts are the real motifs, and which are not
Edge Weights • Back to : • is number of true motifs of k-length that are l-distance away • is number of non-motifs of k-length that are l-distance away
Edge Weights True Motifs G G G G G G False Motifs (Part of Background) G G G T G G G G G G G G G G G G G G G G G C G G
Edge Weights Let’s perform calculation from the perspective of this motif • All ≤1 distance away (Hamming distance) • α(k = 6, l = 1) = 1 • β(k = 6, l = 1) = 1 G G G G G G G G G G G G T G G T G G G G G G G G G G G G G G G G G G G G G G G G G G C C G G G G
Edge Weights • Computation provides an empirical estimate for • Parameterized by two quantities: • k, the kmer length • l, the Hamming distance between two kmers • Fit to a sigmoidal function
Edge Weights • Normalization step • Won’t go into details • This covers problem formulation • How is motif finding actually done?
Maximum Density Subgraph • Standard graph theory method • Max-flow / min-cut • O(nm log(n2m)) • Need faster method • Developed heuristic approach that utilizes max-flow / min-cut method with modifications
Maximum Density Subgraph • Remove all edges below a certain threshold
Maximum Density Subgraph • Pick one vertex (do this for every vertex)
Maximum Density Subgraph • Put back all neighboring edges for that vertex
Maximum Density Subgraph • Use standard algorithm to calculate densest subgraph
Results • Synthetic Tests • Plenty of test cases • Measure performance as data set size grows • Avoid over biasing on empirical data • Know real answer, can unambiguously test performance • Yeast Test • Gold standard data (Harbinson et al., 2004)
Synthetic Tests • Varied: • Motif length • Information content • Simulated genome (as before) • Correlated predicted PSSMs to real ones, counted as true positive if correlation > 0.7
Talk Outline • Biology Background • Algorithmic Problem • Papers • New Motif Finding Algorithm (MotifCut) • Analysis of Motif Finders’ Performance