1 / 15

Logic-statistic modeling and analysis of biological sequence data: A research agenda

Logic-statistic modeling and analysis of biological sequence data: A research agenda. Henning Christiansen Roskilde University, Denmark henning @ ruc.dk, http:// www . ruc . dk / ~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007.

rainer
Télécharger la présentation

Logic-statistic modeling and analysis of biological sequence data: A research agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logic-statistic modeling and analysis of biological sequence data:A research agenda Henning Christiansen Roskilde University, Denmark henning@ruc.dk, http://www.ruc.dk/~henning International Workshop on Abduction and Induction in AI and Bioinformatics Aix-en-Provence, 15 sep 2007

  2. Motivation and overall goal Computational analysis of biological sequence data traditionally based on • HMM, SCFG, ad hoc techniques • Each system has its particular type of models A bottle-neck which cannot be remedied (only) by faster and parallel computers We want to promote the application of more expressive and flexible models: logic-statistic methods a la PRISM (Sato, Kameya) I.e. stepping from regular and context-free languages to Turing-complete language

  3. To reach this goal, we will... • Approach inherent computational problems • optimizations by program analysis & transformations • interface to existing and efficient software • Develop biologically relevant test cases • for biologist to learn how to use such models • to have relevant test cases for the first part

  4. Project setup Funded by the NABIIT program under the Danish Strategic Research Council, 2007–2011 Main academic partners: • Roskilde U. Computer Science: H. Christiansen, J. Gallagher; 1 PhD student (still open!), postdocs • Roskilde U., Biology: O. Skovgaard; 1 PhD student • Aalborg U. Computer Science: M. Jaeger Academic associates: • Taisuke Sato, Tokyo Inst. of Techn. • A. Krogh, Copenhagen Univ. Industrial partners: • Chr. Hansen, Denmark. • Wordwide supplier of probiotic products for the dietary supplement industry • CLC bio • Leading supplier of bioinformatics software

  5. PRISM (Sato, Kameya) for sequence analysis, introduced by a toy example PRISM extends Prolog with discrete random variables Includes machine learning and prediction methods: • learn best probabilities to explain training data • with learned prob’s, determine best answer to a query Example: Loop structures - a non-context-free phenomenon gggctgg gggctgg Assume a collection of sequences where loop structures have been identified in the lab Task: Build and train model so it can be used for prediction

  6. Example model in PRISM • Assume (arbitrarily): • ‘noise’ ≈ a 1. order Markov model • ‘contact zone’ ≈ a 2. order sequence(...):- noise(...), contact(K,....), noise(....), contactCopy(K, ...), noise(...). values(moreNoise,[stop, continue]). values(moreContact,[stop,continue]). values(which(_),[a,c,g,t]). values(which(_,_),[a,c,g,t]). noise(F,S1,S2):- msw(moreNoise,YN), noise2(F,S1,S2,YN). noise2(_,S,S,stop). noise2(F,[B|S1],S2,continue):- msw(which(F),B), noise(B,S1,S2). contact(K,F1,F2,S1,S2):- msw(moreContact,YN), contact2(K,F1,F2,S1,S2,YN). contact2([],_,_,S,S,stop). contact2([B|K],F1,F2,[B|S1],S2,continue):- msw(which(F1,F2),B), contact(K,F2,B,S1,S2). contactCopy([],S,S). contactCopy([B|K],[B|S1],S2):- contactCopy(K,S1,S2). sequence(K,S):- noise(-,S,S1), contact(K,-,-, S1,S2), noise(-,S2,S3), contactCopy(K, S3,S4), noise(-,S4,[]). This is the entire model! Training data: sequence([c,c,g,g,g,t,c,g,c],[a,c,c,g,g,g,t,c,g,c,a,a,t,c,a,a,a,t,c,t,t,t,a,a,c,c,c,g,g,g,t,c,g,c,a,g,a,c,t,a,t,g,t,t,t,a,g,a,a,a,a,c,a,t]). sequence(......, ......). sequence(...., .....). .......

  7. Using a the trained model for prediction ?- viterbig(sequence(K,[t,a,t,a,g,c,g,c,t,a,t,a,g,c,g,c,t,a,t,a])) K = [g,c,g,c] The answer to the query with highest probability. ... plus a lot of other facilities

  8. Our first serious application of PRISM:Testing gene finders (MLDM 2007; with C.M.Dahmcke) Problems: Test data expensive; available test data already used for training gene finders; disagreement about what is a gene, ... Approach: • Develop and train PRISM model with known, annotated data • Use this to create artificial test data, • i.e., sequences with annotations about where-are-the-genes • Check if gene finder programs find the same genes Results: • Three different gene finders found too many and different genes ;-(

  9. Overview of the model (intergenic only) GC-island GC-sparse GC-sparse Target predicate: sequence(sequence-of-ACGT, GC-islands, repeats) GC-islands: list of from-no–to-no repeats: list of from-no–to-no with indication of: type: simple, low-complexity, named,... for named: selected from catalogue; which part; forward, backward, transposed, backward+transposed plus one detailed description of »mutation«: [c,c,c,c,i,i,c,c,d,d,c,c,...] (to suppress complexity in the model; for training data generated by a best-match algorithm) ...

  10. Implemented as a two-layer model Top-level: GC-islands/GC-sparse, length 200 + exponential decay Underlying layer: Mix of repeaters and coloured noise Two-level structure implemented by our own abstract datatype: • uses hidden msw’s to control GC-island/sparse • each RV maintained in two versions (hidden) • position, counter to produce annot. GC-islands msw(RV:random-var, value, GC-islands, position)

  11. Lesson learned from gene finder experiment • A nontrivial model can be organized in a reasonable way by an experienced logic programmer • Preprocessing to freeze one mutation set reduced complexity of learning phase - general technique? • Model could be trained in minutes from marked up sequences of total 106 letters. • With Prolog’s list repr. for sequences we needed 64-bit architechture (sic!) and lo-o-o-o-ot of RAM • PRISM is a very flexible tool for combining and varying different models, inventing a little data structure etc., but keeping a model with well-defined semantics • We lacked, and would suggest to add to PRISM • Distributions over integers (normal d. or “generic smooth”) • Over-layered, and especially negative criteria • (ouf: random variables become dependent)

  12. Anticipated problems for prediction with PRISM and possible solutions • Storage consumption, the sequence as array + PRISM’s explanation graphs (???) • Execution time • Systematic approach to pruning: Generalize known methods for semantics-preserving program transformations to semantics-approximating transformations • Integrate with existing and efficient software • Automatically and hidden??? If, e.g., analysis of PRISM program says “this-looks-like-a-HMM” • Clean interfaces ???? • Reduce complexity by splitting the sequence (-- how to integrate this with a nice semantics?)

  13. Biological problems considered • Gene finding in health promoting bacteria • Phylogenetic gene prediction • Prediction of gene function and acquisition by orthology • (Gene finding for eukaryotic species)

  14. Project hypotheses summarized • Logic-statistic models, a la PRISM or similar, have much higher expressibility and flexibility than traditional models used for sequence analysis (in formal as well as practical sense) • If we can solve some of the computational problems involved and learn how to use such powerful modeling tools, there is a potential for new discoveries in biology.

  15. Thanks for your attention! PS. We are seeking (desperately) a good PhD student for the computational issues. Good salary and conditions offered!

More Related