MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays

MOPAC: Motif-finding by Preprocessing and AgglomerativeClustering from Microarrays Thomas R. Ioerger1 Ganesh Rajagopalan1 Debby Siegele2 1Department of Computer Science 2Department of Biology Texas A&M University

Analyzing Gene Expression Patterns • DNA microarrays • ~4000 genes E. coli, ~6000 genes for yeast • Compare expression levels between conditions • Example: starvation response in E. coli • starve cells for nutrient sources • reintroduce => recovery => exponential growth • which genes show changes in response?

types of response: • up-regulation • down-regulation • transient response (spike) • (arbitrary temporal patterns) • Problem: can cluster genes based on response pattern, but then what? • not all genes in cluster are regulated the same way

Couple with genomic analysis • search for common motifs in up-stream regions • subsets of co-regulated genes within clusters • Assumptions: 1. regulation occurs by interaction of transcription factors with small motifs (~10-20bp) within several hundred bp of transcription start site 2. among many motifs, the ones of interest will be common to some genes in a cluster, but not found in any genes outside (with different responses) 3. the motif does not have to be shared by all genes in the cluster, only a subset

Related Work • Many algorithms exist for motif finding • assume cluster (gene set) is already defined • word/string analysis models • probabilistic models • Gibbs sampling (AlignACE, MotifSampler) • Expectation Maximization (MEME) • HMMs • graph algorithms (e.g. clique) • Pevzner and Sze • what if motif only appears in a subset of genes? • count as parameter in MotifSampler, MEME

Overview Our Approach 1. Definition of regulation patterns 2. Extraction of upstream sequences (for up-reg) 3. Define control set (genes with no change) 4. Make a list of all 12-mers in upstream regions 5. Find motifs that occur (more than once) in up-regulated set, but not at all in control set 6. Group the motifs using clustering, form consensus of patterns

Define Regulation Patterns • measured at 0, 5, and 15min after recovery • discrete representation of changes in expression levels • relative to exp. growth phase conditions +1: >2-fold increase -1: >2-fold decrease 0: otherwise (no significant change) • up-regulation patterns: (0,1,1) (0,1,0) (0,0,1) (-1,1,1) (-1,1,0) (-1,0,1) • define control set: (0,0,0) (1,1,1) (-1,-1,-1)

Extraction of Upstream Sequences • nominally, 600bp upstream of translation start site (i.e. ORF; not transcription start) • If gene is a member of an operon: • take 300bp upstream of gene • plus 300bp upstream of translation start of first gene in operon • databases: K12 sequence: GOLD • operon relationships: E. coli Linkage Map (Berlyn et al.) • use reverse complement if transcribed in rev.

Pre-processing • extract all 12-mers (overlapping) from upstream regions of up-regulated genes • note: better than DFS • remove those that appear in the control set • remove those that are dissimilar to everything else (“de-noising”) • score=mean distance to all motifs not in same upstream region or operon • remove if score>~9/12 mis-matches

Clustering • compute similarity matrix among motifs • repeatedly merge closest neighbors • minimum spanning tree • single-linkage clustering • Stop merging when dist>3/12 mismatches • Form consensus: relax constraints on nucleotides at position by disjunction • ACCATGGTATC • ACGATGGTATT • ACTATAGTATC • AC(CTG)AT(AG)GTAT(TC)

Experiments • Starvation of E. coli for glucose in medium • 3 time-points: starved (0min), 5min, 15min • Data collected in Siegele lab • up-regulated: 22 genes • control set: 1361 genes

Motifs Found

Sequence Logos

Distance to Transcription Start

Other Forms of Validation • Palindromicity: 11/13 motifs have index>0.5 • TRANSFAC database: • e.g. motif 2 matches pattern for MetJ-MetF site • a number of other hits for known transcription factors • biological verification awaits... • role in regulation pathway for starvation response?

Conclusions • Augment cluster-analysis of expression patterns with motif analysis • Efficient method for generating candidates • from 12-mers in upstream regions • Efficient method for screening them • empirically, against a control set, rather than probabilistic background model • Advantage: Pattern does not have to be in all the genes in a set • Challenges: defining appropriate upstream regions and the right control set (as filter)

MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays

MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays

Presentation Transcript

Chapter 2: Data Preprocessing

Finding Regulatory Motifs in DNA Sequences

The Greek Key Motif

Fuzzy C-Means Clustering

Finding Regulatory Motifs in DNA Sequences

fMRI Analysis with the FreeSurfer Functional Analysis Stream (FS-FAST) Preprocessing, First Level Analysis, and Group An

807 - TEXT ANALYTICS

Graph P artitioning a nd Clustering for Community Detection

DATA MINING LECTURE 5

Clustering Documents

BIOINFORMATICS Datamining #1

Clustering and NLP

An Approx. Algo. For Alignment MSA using Motif Discovery

Bioinformatics

Finding Ancestors using the Internet

Microarrays: Common Analysis Approaches

6.096 Lecture 10

UNIT-II Data Preprocessing

Overview