1 / 19

Algorithms for variable length Markov chain modeling

Algorithms for variable length Markov chain modeling. Author: Gill Bejerano Presented by Xiangbin Qiu. Review of Markov Chain Model. Often used in bioinformatics to capture relatively simple sequence patterns, such as genomic CpG islands. Problem.

dara-craft
Télécharger la présentation

Algorithms for variable length Markov chain modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Algorithms for variable length Markov chain modeling Author: Gill Bejerano Presented by Xiangbin Qiu

  2. Review of Markov Chain Model • Often used in bioinformatics to capture relatively simple sequence patterns, such as genomic CpG islands.

  3. Problem • The low order Markov chains are poor classifiers • Higher order chains are often impractical to implement or train. • The memory and training set size requirements of an order-k Markov chain grow exponentially with k!

  4. Variable length Markov Model (VMM) • The models are not restricted to a predefined uniform depth (e.g. order-k). • The model is constructed that fits higher order Markov dependencies where such contexts exist, while using lower order Markov dependencies elsewhere. • The order is determined by examining the training data.

  5. Description of Author’s Work • Four main modules are implemented: • Train • Predict • Emit • 2pfa

  6. Probabilistic Suffix Tree (PST) • A special tree data structure

  7. PST-Definitions • Σ the alphabet, string set: i= 1, 2 ..m • Empirical probability: • Conditional empirical probability:

  8. Parameters • Minimum probability: • Smoothing factors: • Memory length: L • Difference measure parameter: r

  9. Building the PST

  10. Biologically Extended PST- a Variant of PST Model

  11. Incremental Model Refinement • ↑ • L ↑ • r → 1

  12. Prediction using a PST

  13. Results and Discussion • When averaged over all 170 families, the PST detected 90.7% of the true positives. • Much better than a typical BLAST search, and comparable to an HMM trained from a multiple alignment of the input sequences in a global search mode.

  14. Results and Discussion (Cont.)

  15. Results and Discussion (Cont.)

  16. Limitations

  17. Why Significant? • While performance comparable to HMM models • Built in a fully automated manner • Without multiple alignment • Without scoring matrices • Less demanding than HMMs in terms of data abundance and quality

  18. Future Work • An additional improvement is expected if a larger sample set is used to train the PST. Currently the PST is built from the training set alone. • Obviously, training the PST on all strings of a family should improve its prediction as well.

  19. Confused?

More Related