Dr. Daniel NEAGU, UK Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China

An Effective Combination based on Class-Wise Expertise of Diverse Classifiers for Predictive Toxicology Data MiningADMA 2006, Xi’an, China Dr. Daniel NEAGU, UK Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China Ms. Shanshan WANG Dept. of Computer Science, Nanjing University of Aeronautics and Astronautics, China

Bradford, UK • Bradford, West Yorkshire • National Museum of Film and Television • School of Informatics, University of Bradford

Overview (1) • Introduction to ML applications to KDD • Proposal of Combination Operators • Model Construction and Classification Algorithms • Model Library for Predictive Toxicology • Collection of datasets • Central store for models and results • Formal structure to speed access and improve organisation; reduce ‘misplaced’ files • Remote Access • Secure access to data from remote locations possible in the future

Overview (2) • Comparative Studies • Results from UoB Model Library • Study of different Machine Learning techniques • Variety of Feature Selection techniques • Many datasets and endpoints • Large variation in accuracy of created models • One aim is to automatically build ensembles based on best class-wise models • Results and Conclusions

Current Context • Nowadays more scientific data is generated and flows within systems: • Man power/ laboratories • Techniques and computational power (Moore’s Law) • Funds/ Legislation • More data is stored and available: • Storage technology faster and cheaper (Storage Law) • DBMS capable of handling bigger DB • Web/on line access to distributed data • Consequences • Human expert is overloaded: very little data is checked • Knowledge Discovery is NEEDED for data understanding and use Hardware Data collection/ management SW (Algorithms)

General definitions • Data is defined as facts regarding things (such as people, objects, events) which can be digitally transmitted or processed. • Information is generally defined as data that have been processed and presented in a form suitable for human interpretation with the purpose of revealing meanings (such as patterns or rules). • Models are defined as creating representations of patterns. • Knowledge: the theoretical and practical comprehension of a certain domain, that supports making decisions. • Intelligence: the capability of learning, understanding and finding solutions for problems in a specific domain. • 1234567.89is data. • "Your bank balance has jumped 80.87% to £1234567.89"is information. • "Nobody owes me that much money"is knowledge. • "I'd better talk to the bank before I spend it, because of what has happened to other people"is intelligence. http://foldoc.doc.ic.ac.uk

The nontrivial process of identifying valid, novel, potentially useful and, ultimately understandable patterns in data. Involves the following steps: understanding the application domain and definition of the goals selecting the target data set data cleaning and pre-processing data reduction and projection choosing the function of data modelling and the algorithm data mining interpretation evaluation and utilization of the discovered knowledge Data sources Feature Selection Models Knowledge Extracted information Select/preprocess Transform Data mining Interpret/Evaluate/Assimilate Data preparation Knowledge Discovery in Databases (KDD)

Predictive Data Mining • The processes of data classification/ regression having the goal to obtain predictive models for a specific target, based on predictive relationships among large number of input variables. • Classification identifies characteristics of data and identifies a data item as member of one of several predefined categorical classes. • Regression uses the existing numerical data values and maps them to a real valued prediction (target) variable.

GAs RI k-NN Machine Learning Applications in Data Mining Dynamics (ISI Thompson Web of Knowledge) References to Machine Learning techniques with applications in Predictive Data Mining:

Multi-Classifier Systems • Different classifiers potentially offer complementary or at least additional information about patterns to be classified • Various approaches to classifier combinations: • Majority voting [4] • Entropy-based combination [5] • Dempster-Shafer theory-based combination [6], [7] • Bayesian classifier combination [8] • Similarity-based classifier combination [9] • Fuzzy inference [10] • Gating networks [11] • Statistical models [2]

The Proposed Effective Combination Scheme • We propose a hybrid classifier combination scheme which makes use of class-wise expertise of diverse classifiers – a priori knowledge obtained from the training set - to achieve potentially better performance. • 2 Operators proposed:

x Best Model for Class 1 If x is classified as C1 A1 1 Testing data No x If x is classified as C2 A2 Best Model for Class 2 2 Output Data Pre-processing A3 Training data … Best Model for Class L … If x is classified as CL L No x Am Best Model for All Classes Otherwise Architecture of the Effective Multiple Classifier System

Model construction algorithm

Classification Algorithm

ML applications for Predictive Toxicology • The EC proposal for the REACH regulation indicates that the information requirements under REACH can be (partially) fulfilled by using scientifically valid (Q)SAR models. • To guide the validation of computer-based methods, five OECDprinciples for the validation of (Quantitative) Structure-Activity Relationships were adopted: • a defined endpoint • an unambiguous algorithm • a defined domain of applicability • appropriate measures of goodness-of-fit, robustness and predictivity • a mechanistic interpretation, if possible

Datasets (1) • DEMETRA* • LC50 96h Rainbow Trout acute toxicity (ppm) • 282 compounds • EC50 48h Water Flea acute toxicity (ppm) • 264 compounds • LD50 14d Oral Bobwhite Quail (mg/ kg) • 116 compounds • LC50 8d Dietary Bobwhite Quail (ppm) • 123 compounds • LD50 48h ContactHoney Bee (μg/ bee) • 105 compounds *http://www.demetra-tox.net

Datasets (2) • CSL APC* Datasets • 5 endpoints • A single endpoint/descriptor set used for our experiments • Mallard Duck • LD50 toxicity value • 60 organophosphates • 248 descriptors *http://www.csl.gov.uk

Datasets (3) • TETRATOX*/LJMU** Dataset • Tetrahymena Pyriformis • inhibition of growth IGC50 • Phenols data • 250 phenolic compounds • 187 descriptors • http://www.vet.utk.edu/tetratox/ • http://www.ljmu.ac.uk

Descriptors • Multiple descriptor types • Various software packages to calculate 2D and 3D attributes* http://www.demetra-tox.net

Model Library • Algorithms chosen for their representability and diversity, easy, simple and fast access • Instance-based Learning algorithm (IBL) • Decision Tree learning algorithm (DT) • Repeated Incremental Pruning to Produce Error Reduction (RIPPER) • Multi-Layer Perceptrons (MLPs) • Support Vector Machine (SVM)

Dataset Two Dataset Three Dataset Four Algorithms Algorithms Algorithms Model Parameter file Results file Feature Selection Feature Selection Feature Selection Dimensionality Dataset One Algorithms Feature Selection

Organisation Source CSL DEMETRA TETRATOX/LJMU Endpoint/ Descriptors APC Mallard_Duck Trout Water Flea Oral Quail Dietary Quail Bee PHENOLS Feature Selection CFS Chi CS GR IG ReliefF SVM KNNMFS Raw File Type Feature Subsets Models Parameters Results Files Model 1 Model 2 Model 3 Model n

Comparison of performance of combination schemes on seven data sets MCS: Majority Voting-based Combination (MVC) Maximal Probability-based Combination (MPC) Average Probability-based Combination (APC) Classifier Combination based on Dempster Rule of Combination (DRC) CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers)

Conclusions • The proposed combination scheme CSCEDC (Combination Scheme based on Class-wise Expertise of Diverse Classifiers): • not only makes use of the expertise of best individual classifiers • but removes their negative influences as well • therefore results presented previously show significant improvement of global performance

Acknowledgements • This work is part-funded by: • EPSRC GR/T02508/01: Predictive Toxicology Knowledge Representation and Processing Tool based on a Hybrid Intelligent Systems Approach • http://pythia.inf.brad.ac.uk/ • EU FP5 Quality of Life DEMETRA QLRT-2001-00691: Development of Environmental Modules for Evaluation of Toxicity of pesticide Residues in Agriculture • http://www.demetra-tox.net • Special thanks also to: • Dr. Q. Chaudhry (CSL York) • Dr. Mark Cronin (LJMU) • and PhD students: • Ms. Ladan Malazizi, BSc, PhD student • Research Theme: Development of Artificial Intelligence-based in-silico toxicity models for use in pesticide risk assessment • Mr. Paul Trundle, BSc, PhD student • Research Theme: Hybrid Intelligent Systems applied to predict Pesticide Toxicity • Ms. Areej Shhab, BEng, MPhil • Research Theme: Applications of Machine Learning in Knowledge Discovery and Data Mining • Mr. M. Craciun (University of Galati), BSc, MSc

Dr. Daniel NEAGU, UK Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China

Dr. Daniel NEAGU, UK Dr. Gongde GUO Dept. of Computer Science, Fujian Normal University, China

Presentation Transcript

Dr. R. J. Ramteke Associate Professor, Dept. of Computer Science North Maharashtra University, Jalgaon

Faculty of Computer Science University of Indonesia Dr. Aniati Murni

Dr. Thomas Hicks Computer Science Department Trinity University

Henning Schulzrinne Dept. of Computer Science Columbia University

Trevor Barker University of Hertfordshire Dept. Computer Science

Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK

Invited By, Dr. Marc E. Fiuczynski Dept., of Computer Science Princeton University, N.J, USA. By,

Dr. Thomas Hicks Computer Science Department Trinity University

Dr. Niels Lobo Computer Science

Dong Lu Dept. of Computer Science Northwestern University

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

Dr. Gregor Wolbring Dept of Community Health Science

Dr. M. Sulaiman Khan (mskhan@liv.ac.uk) ‏ Dept. of Computer Science University of Liverpool 2010

Dr. Thomas Hicks Computer Science Department Trinity University

DR A.ESSOP DEPT OF DERMATOLOGY UNIVERSITY OF PRETORIA