Biological Data Mining

Biological Data Mining A comparison of Neural Network and Symbolic Techniques http://www.cmd.port.ac.uk/biomine/

1. Objectives • The project aims: • to develop and validate techniques for extracting explicit information from bioinformatic data • to express this information as logical rules and decision trees • to apply these new procedures to a range of scientific problems related to bioinformatics and cheminformatics

2. Extracting information • Artificial neural networks can be trained to reproduce the non-linear relationships underlying bioinformatic data with good predictive accuracy • but it is often hard to comprehend those relationships from the internal structure of the network • with the result that networks are often regarded as ‘black boxes’. • Decision treesusing symbolic rules are easier to interpret • leading to a greater likelihood of understanding the relationships in the data • allowing the behaviour of individual cases to be explained.

3. Extracting Decision Trees • The Trepan procedure (Craven,1996) extracts decision trees from a neural network and a set of training cases by recursively partitioning the input space. • The decision tree is built in a best-first manner, expanding the tree at nodes where there is greatest potential for increasing the fidelity of the tree to the network.

4. Splitting Tests • The splitting tests at the nodes are m-of-n expressions, e.g. 2-of-{x1, ¬x2, x3}, where the xi are Boolean conditions. • Start with a set of candidate tests • binary tests on each value for nominal features • binary tests on thresholds for real-valued features • Use a beam search with a beam width of two. • Initialize the beam with the candidate test that maximizes the information gain.

5. Splitting Tests (II) • To each m-of-n test in the beam and each candidate test, apply two operators: • m-of-n+1 e.g. 2-of-{x1, x2} => 2-of-{x1, x2, x3} • m+1-of-n+1 e.g. 2-of-{x1, x2} => 3-of-{x1, x2, x3} • Admit new tests to the beam if they increase the information gain and are significantly different(chi-squared) from existing tests.

6. Example: Substance P Binding to NK1 Receptors • Substance P is a neuropeptide with the sequence: H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • Wang et al. used the multipin technique to synthesize 512 = 29 stereoisomers generated by systematic replacement of L- by D-amino acids at 9 positions • The aim was to measure binding potencies to NK1 receptors & identify the positions at which stereo-chemistry affects binding strength.

7. Application of Trepan • A series of networks with 9:9:1 architectures were trained using 90% of the data as a training set. • For each network a decision tree was grown using Trepan. • The trees showed high fidelity with the networks on a 10% test set.

8. Results • Binding activity was determined by five positions, viz. • H-Arg-Pro-Lys-Pro-Gln-Gln-Phe-Phe-Gly-Leu-Met-NH2 • The positions identified agree with the FIRM (Formal Inference-based Recursive Modelling) analysis of Young and Hawkins • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

9. A Typical Trepan Tree

10. Test set confusion matrix: tree versus network

11. Test set confusion matrix: tree versus observed

12. Future Work • Complete the implementation of the Trepan algorithm. • model the distribution of the input data and generate a set of query instances to be classified by the network & used as additional training cases during tree extraction. • Extend the algorithm to enable the extraction of regression trees. • Provide a Bayesian formulation for the decision tree extraction algorithm.

13. Future Applications • Apply Trepan to ligand-receptor binding problems. • compare the performance of these algorithms with existing symbolic data mining techniques (ID3/C5).

14. References • Wang J-X et al. (1993)Study of stereo-requirements of substance P binding to NK1 receptors using analogues with systematic D-amino acid replacements. Biorganic & Medicinal Chemistry Letters, 3, 451-456. • Young S & Hawkins D.M. (2000) Analysis of a large, high-throughput screening data using recursive partitioning. Molecular Modelling & Prediction of Bioactivity (ed. Gundertofte & JØrgensen).

Grantholder Professor Martyn Ford Centre for Molecular Design University of Portsmouth martyn.ford@port.ac.uk Research Fellows Dr Shuang Cang Mar - Sept 2000 Dr Abul Azad Jan 2001 -

Collaborators • Dr Antony Browne School of Computing, Information Systems and Mathematics, London Guildhall University. abrowne@lgu.ac.uk • Professor Philip Picton School of Technology and Design, University College Northampton. phil.picton@northampton.ac.uk • Dr David Whitley Centre for Molecular Design, University of Portsmouth. david.whitley@port.ac.uk

Biological Data Mining