Sequence features of DNA binding sites reveal structural class of associated transcription factor

Sequence features of DNA binding sites reveal structural class of associated transcription factor Narlikar L and Hartemink AJ. Bioinformatics. 2006 Jan 15;22(2):157-63. Carol Sniegoski

The Central Dogma of Molecular Biology Double-stranded chain of nucleotide bases (A-T, C-G) Single-stranded chain of nucleotide bases (A,U,C,G) Polypeptide chain

DNA Basics • Two chains form a double helix • Chains have orientation • 5’ end is “upstream”; 3’ end is “downstream” • Sugar-phosphate backbone provides framework for bases (A,C,G,T) • Hydrogen bonds between complementary base pairs hold chains together • A pairs with T, C pairs with G

Protein Basics • Proteins are folded up polypeptide strings • Sequence determines form; form determines function • Function is focused at key domains (active sites, binding sites) • Predicting form from sequence is an unsolved problem • Experimental methods: NMR; X-ray crystallography • Computational methods: predicting de novo; predicting based on sequence similarity to other known proteins Ball-and-stick model Space-filling model Cartoon model

Protein Structure Primary protein structure The order of amino acids Secondary protein structure Common repeating structures, often formed by hydrogen bonds Tertiary protein structure The full 3-dimensional folded structure Quaternary protein structure Proteins organized of multiple polypeptide chains

Protein Domains • Structural domains • Elements of tertiary structure • May be composed of one or more motifs (secondary structure) • Many domains appear in a variety of protein families • Domains are important to a protein’s biological function

Proteins Do (Almost) Everything

Gene Expression Control Points Activating the gene structure Initiating transcription of mRNA from DNA Processing the mRNA transcript Transporting the processed transcript from nucleus to cytoplasm Translating mRNA into protein Controlling mRNA degradation

Components Needed for Transcription • RNA polymerase (RNAP) • Enzyme that transcribes DNA into RNA. • DNA • Accessible DNA sequence to be transcribed (gene). • Various cis-acting DNA regulatory sequences located near the sequence to be transcribed. • (Cis-acting = part of the DNA sequence; affects one copy of a gene.) • The regulatory sequences serve as binding sites recognized by transcription factors. • Transcription factors (TFs) • Set of trans-acting accessory proteins required to initiate transcription. • (Trans-acting = freely diffusible; affects both copies of a gene.) • TFs have binding domains that recognize and bind to specific DNA sequences.

RNA Polymerase • The RNA polymerase protein transcribes DNA into RNA. • It is not responsible for knowing when or where to start transcription. RNA polymerase New RNA transcript DNA double helix

DNA Regulatory Sequences • Characteristic regulatory sequences in DNA are bound by specific transcription factors. • Complexes of bound factors both locate and promote gene transcription. Transcription startpoint • Promoter regions are usually located within 200 bp upstream of startpoint. • Initiator (Inr): consensus sequence “YYAN(T/A)YY”, within 5 bp of startpoint • TATA box: consensus sequence “TATAAAA”, 25 bp above startpoint • GC box: consensus sequence “GGGCGG” • CAAT box: consensus sequence “CCAAT” • Enhancer regions (not shown) are located farther upstream or downstream.

DNA Regulatory Sequences • Modular • Specific to a gene or a set of genes • Specific to a condition or range of conditions • Support complex control of gene transcription gene Example DNA sequences gene gene gene upstream downstream Transcription startpoint TATA box CAAT box Octamer motif GC box

Transcription Factors • Any factor that is needed for the initiation of transcription but is not part of RNA polymerase • Three operationally defined classes of transcription factors: • General factors • Form an initiation complex with RNA polymerase around the transcription startpoint • Always required for initiation of transcription • Unregulated • Upstream factors • Bind to specific DNA consensus sequences (promoters and enhancers) upstream of the startpoint • Required for adequately efficient initiation of transcription • Unregulated • Inducible factors • Operate like upstream factors • Highly regulated • Responsible for controlling transcription patterns in time and space

Activating Inducible TFs (1)

Activating Inducible TFs (2)

Transcription Factors • Transcription factors bind to DNA and to each other to form complexes that initiate transcription TFIIIB (with 3 subunits) now binds to its binding site near the startpoint of transcription TFIIIA binds to a site within the promoter region Finally RNA polymerasebinds and begins transcribing the gene TFIIIC binds to form a stable complex

Transcription Factors • Even factors bound to remote enhancers can contribute to the initiation complex Enhancer Gene Basal transcription complex Enhancer-bound complex

Binding Site Specificity • Many TFs’ DNA-binding domains use similar types of mechanisms. • Binding domain structures can be grouped into classes. • Each class binds particular sets of DNA sequences (binding sites). • Binding sites are usually somewhat degenerate (variable). • Two common models for characterizing binding sites: Regular expressions PSSM (Position-Specific Scoring Matrix) • Construct a regular expression that matches only the sequences at known binding sites. • Can match variable-length sequences. • Does not provide information about probability or binding affinity. • Next slide.

PSSM Position-Specific Scoring Matrix • Align known binding sites for the TF, all of length n. • Create a 4xn matrix showing the number of times each base appears at each position. • To determine the TF’s binding affinity for sequence S, calculate • log( (P|M) / (P|B) ) . Probability of seeing S in the motif Probability of seeing S outside the motif A 3 2 0 12 0 0 0 0 1 3 C 5 2 12 0 12 0 1 0 2 1 G 3 7 0 0 0 12 0 7 5 4 T 1 1 0 0 0 0 11 5 4 4 PSSM matrix built from an alignment of 12 binding sites of length 10 bp for yeast TF Pho4p

The Experiment Goal: Predict the type of DNA-binding domain that a TF has based on features of the DNA sequences to which it binds. • Data: Encoded data about TF factors’ classes and the sequences to which they bind, as taken from the TRANSFAC database.

TRANSFAC Database TRANSFAC® is a database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors. It covers the whole range from yeast to human. It started 1988 with a printed compilation and was transferred into computer-readable format in 1990. The FACTOR table contains 6133 entries in 50 classes, but this figure does not reflect the number of independent transcription factors. Homologous factors from different species such as human and mouse SRF are given different entries since they may differ in some molecular aspects. Factors originally described by different research groups as binding to different genes may turn out identical when cloned. Also, more factors are recognized as representatives of whole TF families that are products of distinct but similar genes or alternative splice products. We have in general not entered proteins just because of the presence of a putative DNA-binding motif. Thus there are many more zinc finger or homeo domain proteins known than are included in FACTOR, but for many no data about DNA-binding specificity or other gene regulatory features are available. The SITE table gives information on individual (putatively) regulatory protein binding sites. It contains 7915 entries. 6360 of them refer to sites within 1504 eukaryotic genes. 1295 are artificial sequences. 260 have consensus binding sequences given in the IUPAC code.

TRANSFAC Classes 1 Superclass: Basic Domains*1.1 Class: Leucine zipper factors (bZIP). (IV) *1.2 Class: Helix-loop-helix factors (bHLH). (III) 1.3 Class: Helix-loop-helix / leucine zipper factors (bHLH-ZIP). 1.4 Class: NF-1 1.5 Class: RF-X 1.6 Class: bHSH 2 Superclass: Zinc-coordinating DNA-binding domains *2.1 Class: Cys4 zinc finger of nuclear receptor type. (II) 2.2 Class: diverse Cys4 zinc fingers. *2.3 Class: Cys2His2 zinc finger domain. (I) 2.4 Class: Cys6 cysteine-zinc cluster. 2.5 Class: Zinc fingers of alternating composition 3 Superclass: Helix-turn-helix *3.1 Class: Homeo domain. (IV) 3.2 Class: Paired box. *3.3 Class: Fork head / winged helix. (V) 3.4 Class: Heat shock factors 3.5 Class: Tryptophan clusters. 3.6 Class: TEA domain. 4 Superclass: beta-Scaffold Factors with Minor Groove Contacts 4.1 Class: RHR (Rel homology region). 4.2 Class: STAT 4.3 Class: p53 4.4 Class: MADS box. 4.5 Class: beta-Barrel alpha-helix transcription factors 4.6 Class: TATA-binding proteins etc.

TRANSFAC Class Hierarchy Transcription Factor ClassificationLast modified 2002-10-01 1 Superclass: Basic Domains 1.1 Class: Leucine zipper factors (bZIP). 1.1.1 Family: AP-1(-like) components 1.1.1.1 Subfamily: Jun 1.1.1.1.1 XBP-1 (human). 1.1.1.1.2 v-Jun (ASV). 1.1.1.1.3 c-Jun (mouse); c-Jun (rat); c-Jun (human); c-Jun (chick). 1.1.1.1.4 JunB (mouse). 1.1.1.1.5 JunD (mouse). 1.1.1.1.6 dJRA 1.1.1.2 Subfamily: Fos 1.1.1.2.1 v-Fos (FBR MuLV); v-Fos (FBJ MuLV); v-Fos (NK24). 1.1.1.2.2 c-Fos (mouse); c-Fos (human); c-Fos (rat); c-Fos (chick). 1.1.1.2.3 FosB (mouse). 1.1.1.2.3.1 FosB1 1.1.1.2.3.2 FosB2 1.1.1.2.4 Fra-1 (mouse); Fra-1 (rat). 1.1.1.2.5 Fra-2 (chick); Fra-2 (human). etc.

TRANSFAC Factors Drilldown on 1.1 Class: Leucine zipper factors (bZIP) lists factors in the class: CL basic region + leucine zipper; 1.1. CC A DNA-binding basic region is followed by a leucine zipper. The leucine zipper consists of repeated leucine residues at every seventh position and mediates protein dimerization as a prerequisite for DNA-binding. The leucines are directed towards one side of an alpha-helix. The leucine side chains of two polypeptides are thought to interdigitate upon dimerization (knobs-into-holes model). The leucine zipper dictates dimerization specificity. Upon DNA-binding of the dimer, the basic regions adopt alpha-helical conformation as well. Possibly, a sharp angulation point separates two alpha-helices of the subregions A and B leading to the scissors grip model for the bZIP-DNA complex. The DNA is contacted through the major groove over a whole turn. BFT03820 ABF1; Species: thale cress, Arabidopsis thaliana. BFT03823 ABF2; Species: thale cress, Arabidopsis thaliana. BFT03824 ABF3; Species: thale cress, Arabidopsis thaliana. BFT03825 ABF4; Species: thale cress, Arabidopsis thaliana. BFT04543 ABI5; Species: thale cress, Arabidopsis thaliana. BFT04565 ACA1; Species: yeast, Saccharomyces cerevisiae. BFT00027 AP-1; Species: clawed frog, Xenopus. BFT00029 AP-1; Species: human, Homo sapiens. BFT00030 AP-1; Species: monkey, Cercopithecus aethiops. BFT00031 AP-1; Species: rat, Rattus norvegicus. BFT00032 AP-1; Species: mouse, Mus musculus. BFT03199 ARR1; Species: yeast, Saccharomyces cerevisiae. BFT02783 ATB-2; Species: thale cress, Arabidopsis thaliana. etc.

TRANSFAC Sites Drilldown on factor ABF1 lists the sequences to which it binds: SQ GGACGCGTGGC. SQ TGTCGTGGGGACACGTGGCATACGAGGC. SQ TGTCGGGGACACGTGGCGCTAACGAGGC. SQ TGTCGGGACACGTGGCGCAACACGAGGC. SQ TGTCGGGACACGTGGCCCACCCGGAGGC. SQ TGTCGGGACACGTGGCACAAATAGAGGC. SQ TGTCGTCAATGGACACGTGGCTAGAGGC. SQ TGTCGTCGGACACGTGGCACGAAGAGGC. SQ GCCTCGACAGGACACGTGGCACGCGACA. SQ TGTCGATCAATGGACACGTGGCAGAGGC. SQ GCCTCGGTGACACGTGGCTTGACCGACA. SQ TGTCGGAAGTGGTGACACGTGGCGAGGC. etc.

Feature Encoding (1) • Encode each TF as a 1390-length feature vector. • Don’t worry about too many features; the classifier will identify the important ones. • For 1387 features, calculate the arithmetic mean of the feature vectors for the sequences the TF binds. • Add 3 extra binary features indicating whether the TF is plant, animal, or fungus.

Feature Encoding (2) • Encode each binding site as a 1387-length feature vector. 1364 integer features encoding subsequence frequency for subsequences up to length 5: 41 = 4 features for subsequences of length 1 (A, T, C, G) 42 = 16 for subsequences of length 2 (AA, AT, AC, AG, TA, TT, TC, TG, …) 43 = 64 for subsequences of length 3 44 = 256 for subsequences of length 4 45 = 1024 for subsequences of length 5

Feature Encoding (3) • 8 binary features encoding the presence or absence of an ungapped palindrome of half-length 3, 4, 5, or 6, either spanning the whole sequence or not. • A palindromic sequence is equal to its complementary sequence read backwards. • A and T, C and G are complementary bases. • 1 for a palindrome of half-length 3, spanning (e.g., ACG CGT) • 1 for a palindrome of half-length 3, not spanning (e.g., … ACG CGT …) • 1 for a palindrome of half-length 4, spanning (e.g., ACGC GCGT) • 1 for a palindrome of half-length 4, not spanning (e.g., … ACGC GCGT …) • etc.

Feature Encoding (4) • 8 binary features encoding the presence or absence of a gapped palindrome of half-length 3, 4, 5, or 6, either spanning the whole sequence or not. • A gapped palindrome is a palindrome with a non-palindromic insertion in the exact middle. • 1 for a gapped palindrome of half-length 3, spanning (e.g., ACG ... CGT) • 1 for a palindrome of half-length 3, not spanning (e.g., … ACG … CGT …) • 1 for a palindrome of half-length 4, spanning (e.g., ACGC … GCGT) • 1 for a palindrome of half-length 4, not spanning (e.g., … ACGC … GCGT …) • etc.

Feature Encoding (5) 7 binary features encoding the presence or absence of a special sequence identified in the literature as over-represented in the binding sites of certain classes of TF. Sequence Class G . . G Cys2His2 (I) G . . G . . G Cys2His2 (I) [GC] . . [GC] . . [GC] Cys2His2 (I) AGGTCA | TGACCT Cys4 (II) CA . . TG bHLH (III) TGA .* TCA bZip (IV) TAAT | ATTA Homeodomain (VI) Regular expression representation: . Any single character. [] Any single character inside the brackets. | Either the expression preceding or the expression following. * Zero or more of the preceding expression.

Encoding Example Encode sequenceGGACGCGTGGC. Length 2 subsequence: 6 features = 1 or 2 10 features = 0 Length 3 subsequence: 9 features = 1 55 features = 0 Length 4 subsequence: 8 features = 1 248 features = 0 Length 5 subsequence: 7 features = 1 1017 features = 0 Length 1 subsequence: A = 1 C = 3 G = 6 T = 1 Palindromes: 1 feature = 1 7 features = 0 Gapped palindromes: 8 features = 0 Special sequences: ? At least 1345 of the 1387 features for this binding sequence are zero-valued.

Dataset n = 587 columns, one for each TF x1,2 x1,1 x1,587 . . . x2,1 d = 1390 rows, one for each feature . . . . . . x1390, 1 y1,1 1-of-m class encoding . . . . . y6,1

SMLR Algorithm Sparse Multinomial Logistic Regression • Learns a multi-class classifier • Simultaneously performs feature selection • Reports the probabilities of a sample belonging to each of the m classes, given m sets of feature weights, one for each class.

Linear Regression Model/predict a dependent variable as a linear function of independent variables: yi = b1xi1 + b2xi2 + … + bnxn + εi Find the best-fit line (e.g., estimate the bi’s) by minimizing the sum of the squares of the vertical deviations from each data point to the line: R2 = ∑ [yi – f(xi b1, b2. ..., bn)]2

Logistic Regression Used when dependent variable y is binary. Logit function of p is expressed as a linear combination of xi . logit(p) = log ( p/(1-p) ) = w0 + w1x1 + … + wnxn = wTx p p = P ( y = 1 | x, w) e wTx 1 +e wTx = x = probability that x belongs to class y, given x and w w = [ w0 w1 … wn ]T , x = [ x0 x1 … xn ]T d feature values for one sample single weight vector of length d

Multinomial Logistic Regression Generalization of logistic regression. Used when dependent variable y is multiclass. (i)T e w x p = P ( y(i) = 1 | x, w) = = probability that x belongs to the class encoded by y(i) = 1, given w m ∑ (j)T x e w j=1 w = [ w(1)T w(2)T … w(m)T ]T , x = [ x0 x1 … xd ]T , y = [ y(1) y(2) … y(m)]T weight vectors of length d for each of m classes d feature values for one sample one-of-m class encoding

Estimating w In logistic regression, w is usually estimated using maximum likelihood (ML). Want to find w that maximizes the probability of classifying samples correctly. P ( yj | xj , w ) = probability of classifying sample xj correctly, given the values of w. n log-likelihood l(w) = ∑ log ( P ( yj | xj , w ) ) j=1 jT e w xj wj indicates the weight vector for the class to which xj belongs n = ∑ log ( ) m (i)T ∑ j=1 e w xj i =1 n m = ∑ ( wjTXj ) – log ∑ (i)T xj e w j=1 i=1 n m m = ∑∑yj(i)w(i)TXj – log ∑ (i)T e w xj j=1 i=1 i=1 This is only 1 when xj is in class i, 0 else

Estimating a Sparse w We want w to be sparse, with many zero values, deselecting many features. Use the maximum a posteriori (MAP) method: Penalize the ML estimate by placing a prior p(w) on the parameters w. Choose a prior distribution that induces sparsity: the Laplace distribution. ^ wMAP = argmax L(w) = argmax ( l(w) + log p(w) ) w w probability that w comes from a Laplace distribution sum of log-likelihoods of xi being classified correctly, given xi and w

Laplace Distribution –|x - μ|/b p(x) = (1/2b) e –λ ||w||1 p(w) e –λ∑j |w|j e • Remember ln p(w) is the MAP penalty function. Larger |w|j smaller p(w)  very negative ln p(w) Smaller |w|j larger p(w)  less negative ln p(w) ln p(w) is at its max at ln p(w) = 0 p(w) = 1 e = e0 = 1 –λ∑j |w|j • The λ parameter needs to be set appropriately. • Larger λ  greater sparsity, fewer features selected. • Authors chose λ=1 using cross-validation.

Results • 77 TFs misclassified during LOOCV, for 87% accuracy. • 20% accuracy during LOOCV after permuting class labels • (28% accuracy expected). (%error)(#TFs) = #TFs misclassified .23(97) = 22.31 .09(97) = 8.73 .11(61) = 6.71 .08(165) = 13.2 .17(52) = 8.84 .15(115) = 17.25 --------------------- .13(587) = 77.04

Results • Analyzed feature selection consistency across LOOCV trials. • Most features were selected either very infrequently (1047 features were selected in < 10% of trials) • or very frequently (290 features were selected in > 90% of trials). This leaves 53 features selected inconsistently.

Results • Used trained classifier to predict TF class based on experimentally determined binding site motifs. • Used 14 TFs in TRANSFAC but not in training set. • TF binding sites were experimentally determined. • Motifs were extracted from the binding sites using PSSM. • Other potential binding sites with the same motifs were located using PSSM methods. • These binding sites formed the input data. • Class was predicted correctly for 12 of 14 TFs.

Conclusions • The authors have developed a multiclass classifier that assigns TFs DNA-binding domain classes based on features in their binding site sequences. • They argue that this capability demonstrates that DNA binding sites contain significant predictive information about TFs’ binding mechanisms. • They note that their classifier consistently selects certain features and argue for their biological plausibility. • Nearly 1/3 of features are predictors of Class I, zinc finger proteins with poor sequence specificity. • Palindromic features are predictors of Class II, zinc finger proteins that form dimers. • They argue that their method has implications for how TF binding sites should be modeled. • Regular expression models are not probabilistic • PSSM models are length invariant • They note that their classifier might be useful to biologists. • Help to engineer proteins that bind to specific DNA sequences • Predict which class of TF binds to sites find using conventional motif finding algorithms

Cell-Signaling Pathways

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Presentation Transcript

Identification of Transcription Factor Binding Sites

Searching for transcription factor binding sites with TRANSFAC

Prediction of transcription factor binding to DNA using rule induction methods

Finding conserved transcription factor binding sites in promoter sequences

Detection of Transcription Factor Binding Sites

Finding Transcription Factor Binding Sites

Finding Transcription Factor Binding Sites

Location Analysis of Transcription Factor Binding

A Genomic Survey of Heat Shock Transcription Factor Binding Sites in Saccharomyces cerevisiae

Modeling Sequence Specificity of Transcription Factors with DNA structural features

DNA binding domains and activation domains of transcription factors

Last time … * Constraint on transcription factor binding sites

Detecting binding sites for transcription factors by correlating sequence data with expression.

Transcription factor binding sites and gene regulatory network

Identification of Transcription Factor Binding Sites

Transcription factor binding motifs (part II)

Transcription factors binding sites Group 2:

DNA-binding Domains Structural considerations of the DNA double helix

Transcription factor binding motifs (part I)

Detection of Transcription Factor Binding Sites

Location analysis of transcription factor binding sites

Transcription of DNA