1 / 23

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers. Pengyu Hong 10/06/2005. mRNA transcript. Binding sites. Regulators. Genes. Motivation. Understand transcriptional regulation. Gene X. TF. Model transcriptional regulatory networks. Motivation.

morton
Télécharger la présentation

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005

  2. mRNA transcript Binding sites Regulators Genes Motivation • Understand transcriptional regulation Gene X TF • Model transcriptional regulatory networks

  3. Motivation Previous works on motif finding • AlignACE (Hughes et al 2000) • ANN-Spec (Workman et al 2000) • BioProspector (Liu et al 2001) • Consensus (Hertz et al 1999) • Gibbs Motif Sampler (Lawrence et al 1993) • LogicMotif (Keles et al 2004) • MDScan (Liu et al 2002) • MEME (Bailey and Elkan 1995) • Motif Regressor (Colon et al 2003) • … …

  4. A A C A T C C G • • • • • • Motivation A widely used model – Motif Weight Matrix (Stormo et al 1982) 1 2 3 4 5 6 7 8 A 0.19 1.11 -0.17 1.65 -2.65 -2.66 -1.98 0.92 C -0.14 -0.49 1.89 -1.81 1.70 2.32 2.14 -2.07 G -1.39 0.25 -1.22 -1.07 -2.07 -2.07 -2.07 1.13 T 0.86 -1.39 -2.65 -2.65 0.41 -2.65 -1.16 -1.80 Score of the site = = 10.84 vs. threshold + A sequence is a target if it contains a binding site (score > threshold). Computational << Molecular

  5. Motivation Non-linear binding effects, e.g., different binding modes. • • • CACCCATACAT • • • Mode 1 Preferred binding • • • CATCCGTACAT • • • Mode 2 • • • CA C/T CC A/G TACAT • • • • • • CACCCGTACAT • • • Mode 3 Non-preferred binding • • • CATCCATACAT • • • Mode 4

  6. Modeling Model a TF-DNA binding classifier as an ensemble model. ensemble model weight base classifier

  7. qm(Si) hm(Si) Modeling The mth base classifier Sequence scoring function: fm(sik) is a site scoring function (weight matrix + threshold). The scoring function considers (a) the number of matching sites (b) the degree of matching

  8. (a) Decide the number of base classifiers. (b) Learn the parameters of each base classifier and its weight. Training – Boosting Modify the confidence-rated boosting (CRB) algorithm (Schapire et al. 1999) to train ensemble models

  9. Margin of training samples Generalization error Training error Why Boosting? Booting is a Newton-like technique that iteratively adds base classifiers to minimize the upper bound on the training error. (Schapire et al. 1998)

  10. Challenges •Positive sequences – targets of a TF •Negative sequences • Sequences are labeled, but not the sites in the sequences. • Cannot be well separated by the weight matrix model (linear). • Number of negative sequences >> number of positive sequences.

  11. Boosting Initialization •Positive • Total weight of the positive samples == Total weight of the negative samples. • Since the motif must be an enriched pattern in the positive sequences, use Motif Regressor to find a seed motif matrix W0. •Negative

  12. Boosting Train a base classifier (BC) •Positive •Negative • Use the seed matrix W0 +to initialize the mth base classifier qm() and let m=1. • Refine m and the parameters of qm() to minimize where yi is the label of Si and dim is the weight of Si in the mth round. BC 1 • Negative information is explicitly used to train qm() and m.

  13. Boosting Adjust sample weights and gives higher weights to previously misclassified samples. •Positive •Negative • yi is the label of Si • dim is the weight of Si in the mth round. • dim+1 is the new weight of Si. BC 1

  14. Boosting Add a new base classifier •Positive •Negative BC 1 BC 2

  15. Boosting Add a new base classifier •Positive •Negative Decision boundary

  16. Boosting Adjust sample weights again •Positive •Negative Decision boundary

  17. Boosting Add one more base classifier •Positive •Negative BC 3

  18. Boosting Add one more base classifier •Positive •Negative Decision boundary

  19. Boosting •Positive Stop if the result is perfect or the performance on the internal validation sequences drops. •Negative Decision boundary

  20. Results Data: ChIP-chip data of Saccharomyces cerevisiae (Lee et al. 2002 ) • Positive sequences • p-value < 0.001 • Number of positive sequences  25. • Negative sequences • p-value  0.05 & ratio  1 Got 40 TFs.

  21. Results Leave-one-out test results Boosted models vs. Seed weight matrices Vertical axis: Improvements on specificity Horizontal axis: TFs

  22. Results Capture Position-Correlation + RAP1 0  Weight Matrix Base classifier 1 Base classifier 2 Base classifier 3 Boosting

  23. Results Capture Position-Correlation REB1 Weight Matrix Base classifier 1 Base classifier 2 Boosting

More Related