Paper by: Phadera Gius , Aaron Arvey , William Chang, William Stafford Noble, Christina Leslie

Journal report: High Resolution Model of Transcription Factor-DNA Affinities Improve In Vitro and In Vivo Binding Predictions Paper by: PhaderaGius, Aaron Arvey, William Chang, William Stafford Noble, Christina Leslie Memorial Sloa-Kettering Cancer Center, NY Presented by Yaron Orenstein for ACGT group meeting, 19 January 2011

Introduction – Biological Background • Gene regulatory programs are orchestrated by transcription factors (TFs). • These proteins usually bind to binding sites (BSs) in the promoter region and enable or impend transcription of the gene. • Accurately modeling the DNA sequence preferences of TFs is a key piece in unraveling the regulatory code.

Modeling BSs: PSSM model • The most popular model to represent binding sites is the PSSM: position specific scoring matrix. • These motifs may match thousands of sites in intergenic regions, producing an unreliable list of potential TF target genes.

All possible 8-mers model • This model contains a list of all possible 8-mers ranked by the TF preference. • This information can be obtained for example from PBM data and calculating an enrichment-score for each 8-mer. • The disadvantage is clearly its large size and uninterpretability. In addition, the sequence similarities between 8-mers is not considered.

Protein Binding Microarray data • PBM array contains ~41,000 probe sequence of length 35bp each, covering all possible DNA 10-mers. • For each probe the binding intensity is reported.

Support vector regression • Motivation: predict real values based on a feature set. • Given a training set , find a function f which best predicts y. • For example, if f is linear, then f(x) = <w,x>+b, where w is the set of feature weights. • is minimized under some error constraints.

Example for SVR • A simple way to predict binding intensity from PBM data based on 8-mer features. • Use indicator features for each 8-mer: • 1 if sequence x contains the 8-mer. • 0 if it does not.

An overview

Methods • They developed a training strategy for the SVR model that involves three key components: • The choice of kernel. • The sampling procedure for selecting the most informative training sequences. • The feature selection method.

The di-mismatch kernel • Let be a set of unique k-mers that occur in the set of training sequences. • Define the set of substrings of length k in s (of length N: • Then s is represented by the feature vector: • And counts the number of matching dinucleotides between and .

Example for the di-mismatch kernel • Two non-consecutive pair of mismatches lead to a count of mismatches 6: • 4 consecutive mismatches lead to a count of 5:

Sampling PBM data to obtain an informative training set • They selected the set of “positive” training probes to be those sequences associated with normalized binding intensities Z ≥ 3.5. • If there were more than 500, they selected the top 500 ranked by their binding signals. • The same number of “negative” training probes was selected from the other end of the distribution.

Feature Selection • They selected the feature set to be those k-mers that are over-represented either in the “positive” or “negative” probe class • They computed the mean di-mismatch score for each k-mer in each class and ranking features by the difference between these means. • They used at most 4000 k-mers.

Results • First, they tested how well they predict the ranking of probe sequences of one PBM array based on learning from another PBM array. • They used the metric of: Top 100, meaning how many of the top 100 probes were ranked to be in the top 100 by the model. • They compared to PSSM and E-Score (full 8-mers list) models.

The left scatter plot shows the detection of the top 100 probes using maximum E-scores (x-axis) and the SVR model (y-axis) in the prediction of in vitro TF binding preferences. Each point corresponds to one TF. • The right panel is similar to the left, but compares the SVR versus PBM-derived PSSMs for the 114 mouse TFs.

Testing on Chip-Chip data

Prediction of in-vivo occupancy • They computed the binding occupancy using a sliding 36-mer window for scoring. • They compared to: • PSSM. Log-odds scores were used. • E-score over a fixed threshold. • E-score based occupancy (using the median probe intensity of PBM probes containing the highest-scoring 8-mer pattern).

Predicted binding profile for: • (left) yeast TF Ume6 along IGR iYFL022C • (right) yeast TF Gal4 along IGR iYFR026C

They computed the detection of the top 200 inter genomic regions by the top 200 predictions, where the top 200 “bound” IGRs were determined by their p-value ranking. • Prediction of in vivo is weak to very poor (due to indirect and competitive binding as well as other factors). • Still, in 8 out of 9 example the SVR method outperforms the occupancy score method of Zhu et al. (2009). • Against PSSM model it was: 6 wins, 1 ties, 2 losses.

Testing on ChIP-seq data

Testing on ChIP-seq data • They selected 1000 confident peak regions (60bp each) and 1000 “negative” regions from flanking sequences (60bp regions 300bp away from the peaks). • Model performance measured by area under the ROC curve (AUC), using the maximum SVR prediction score (over 36-mer windows) to rank ChIP-seq 60-mers. • ROC = true positive rate vs. false positive rate.

SVRs trained on PBM arrays are able to capture ChIP-seq peaks better than PSSMs or the occupancy score.

Support Vector Machines • Here we want to classify the data to binary classes, i.e. the training set is

Training discriminative models on ChIP-seq data • Trained SVMs using the (13,5) parameters on 60-mer ChIP-seq peaks (positive sequences) and flanking negative sequences. • Evaluation by computing AUCs on the same test sets of 1000 ChIP-seq peaks and 1000 flanking negative sequences using 10-fold cross-validation. • Tested against Weeder and Mdscan, which determine overrepresented k-mer and PSSM motifs, respectively.

SVMs trained on ChIP-seq data capture sequence information from the genomic context of ChIP-seq peaks and improve in vivo prediction performance. • There was no advantage to training regression models on ChIP-seq peaks label with real-valued occupancy.

PBM experiments may capture in vivo preference • To investigate how some PBMs contain 2 different binding sites, they did: • Cluster k-mer features based on their co-occurrence in the training sequences. • Projected highly weighted k-mers into 2 dimenstions using principal component analysis (PCA) • Two clusters were found, each representing a different motif. • The SVR was trained on the features of each motif separately and the AUCs were 0.75 and 0.54.

K-mers contributing to the (left) Oct4 PBM model and (right) Sox2 ChIP model, where each point represents a 13-mer and is colored according to its model weight. Star and circle point styles indicate different clusters. • For the PBM derived model, the clusters represent the primary and secondary binding motifs • For the ChIP-derived model, the clusters correspond to the motifs for Sox2 and its cofactor Oct4.

Summary • A flexible new discriminative framework for learning TF binding models from high resolution in vitro and in vivo data. • The SVR/SVM models better predict binding affinity and thus are more suitable for representing complex regulatory regions.

Possible directions to continue • Training jointly on PBM and ChIP-seq data for the same TF. • Develop multi-task training strategies for modeling the binding preferences of a class of structurally relate TFs using features of the amino acid sequence. • Combine in vivo TF sequence preference models with data on chromatin state to predict TF target genes in new cell types.

Paper by: Phadera Gius , Aaron Arvey , William Chang, William Stafford Noble, Christina Leslie