Applications of Data Mining for Predicting Material Properties

Applications of Data Mining for Predicting Material Properties Vipin Kumar University of Minnesota kumar@cs.umn.edu www.cs.umn.edu/~kumar Research supported by NSF, ARL

Why Data Mining? Commercial Viewpoint • Lots of data is being collected and warehoused • Web data • Yahoo! collects 10GB/hour • purchases at department/grocery stores • Walmart records  20 million transactions per day • Bank/Credit Card transactions • Computers have become cheaper and more powerful • Competitive Pressure is Strong • Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Data Mining? Scientific Viewpoint • Data collected and stored at enormous speeds (GB/hour) • remote sensors on a satellite • NASA EOSDIS archives over 1-petabytes of earth science data / year • telescopes scanning the skies • Sky survey data • gene expression data • scientific simulations • terabytes of data generated in a few hours • Data mining may help scientists • in automated analysis of massive data sets • in hypothesis formation

NASA ESE questions: • How is the global Earth system changing? • What are the primary forcings? • How does Earth system respond to natural & human-induced changes? • What are the consequences of changes in the Earth system? • How well can we predict future changes? Data Mining for Climate Data • Global snapshots of values for a number of variables on land surfaces or water • High Resolution EOS Data: • EOS satellites provide high resolution measurements • Finer spatial grids • 1 km1 km grid produces 694,315,008 data points • Going from 0.5º  0.5º degree data to 1 km  1 km data results in a 2500-fold increase in the data size • More frequent measurements • Multiple instruments • High resolution data allows us to answer more detailed questions: • Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties • Finding relationships between leaf area index (LAI) and topography of a river drainage basin • Finding relationships between fire frequency and elevation as well as topographic position • Leads to substantially high computational and memory requirements NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html Detection of Ecosystem Disturbances: This interactive module displays the locations on the earth surface where significant disturbance events have been detected. Disturbance Viewer

Data Mining for Cyber Security • Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks • Traditional Intrusion Detection Systems (IDS) have well-known limitations • Too many false alarms • Unable to detect sophisticated and novel attacks • Unable to detect insider abuse/ policy abuse • Data Mining is well suited to address these challenges MINDS – Minnesota Intrusion Detection System Large Scale Data Analysis is needed for • Correlation of suspicious events across network sites • Helps detect sophisticated attacks not identifiable by single site analyses • Analysis of long term data (months/years) • Uncover suspicious stealth activities (e.g. insiders leaking/modifying information) • Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP) • Helps analyze data from multiple sensors at DoD sites around the country • Routinely detects Insider Abuse / Policy Violations / Worms / Scans

Data Mining for Biomedical Informatics • Recent technological advances are helping to generate large amounts of both medical and genomic data • High-throughput experiments/techniques • Gene and protein sequences • Gene-expression data • Biological networks and phylogenetic profiles • Electronic Medical Records • IBM-Mayo clinic partnership has created a DB of 5 million patients • Single Nucleotides Polymorphisms (SNPs) • Data mining offers potential solution for analysis of large-scale data • Automated analysis of patients history for customized treatment • Prediction of the functions of anonymous genes • Identification of putative binding sites in protein structures for drugs/chemicals discovery Protein Interaction Network

Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniquesmay be unsuitable due to • Enormity of data • High dimensionality of data • Heterogeneous, distributed nature of data Statistics/AI Machine Learning/ Pattern Recognition Data Mining Database systems

Data Mining Tasks... Data Clustering Predictive Modeling Anomaly Detection Association Rules Milk

QSAR Quantitative Structure-Activity Relationship • QSAR is the process by which chemical structure is quantitatively correlated with a well defined process, such as biological activity or chemical reactivity. (Wikipedia) http://en.wikipedia.org/wiki/Quantitative_structure-activity_relationship • Long history, but more importance with the expansion of pharmaceutical industry and advances in the life sciences http://media.wiley.com/product_data/excerpt/03/04712709/0471270903.pdf • Mostly focused on mathematical and statistical models but machine learning and data mining techniques are increasingly being applied, particularly in the areas of prediction and molecule mining.

Some Applications of Data Mining to Chemical Informatics George Karypis, University of Minnesota • Computationally efficient algorithms to mine large databases of molecular graphs and identify key substructures present in active (inactive) compounds. • Sophisticated feature selection and generation algorithms that identify and synthesize substructure-based features that simultaneously simplify the representation of the original compounds while retaining and exposing their key features. • Kernel-based clustering and classification approaches that take into account the relationships between these substructures at different levels of granularity and complexity.

Some Applications of Data Mining to Chemical Informatics George Karypis, University of Minnesota • Example: development of efficient algorithms to find frequent substructures in molecular graphs (either topological or geometric). The topological version of this algorithm, called FSG, is currently available as part of the pattern discovery toolkit PAFI, which can be downloaded from http://glaros.dtc.umn.edu/gkhome/pafi/overview.

Case Study: Predicting Chemical Properties • Goal is to predict chemical properties of interest. • Sensitivity of energetic materials • Properties of stealth coatings • Conformational and reactive site variations in proteins • Information available for prediction includes • Results from computational chemistry programs, e.g., GAUSSIAN. • Electrostatic potential and electron density • Vibrational frequencies and bond energies • Chemical composition • 3D structure • bond length, bond angles, etc. • Electron scattering cross section • Approach is to apply data mining techniques. Electron density iso-surface. Electron density iso-surface colored by electrostatic potential Data on 34 compounds provided by Dr. Betsy Rice, ARL.

Surface Area Surface area of the iso-surface. Computed from the physical positions of the points on the iso-surface. Average Positive Electrostatic Potential Average value of the electrostatic potential for all iso-surface points with positive potential. Average Negative Electrostatic Potential Average value of the electrostatic potential for all iso-surface points with negative potential. Average All Electrostatic Potential Average value of the electrostatic potential for all iso-surface points. Most Positive Electrostatic Potential Maximum value of the electrostatic potential for all iso-surface points with positive potential. Most Negative Electrostatic Potential Minimum value of the electrostatic potential for all iso-surface points with negative potential. Sig Positive Variance of all iso-surface points with positive potential. Sig Negative Variance of all iso-surface points with negative potential. SIG(TOT) Sig Positive + Sig Negative PI The average “absolute” potential for all points on the iso-surface, after subtracting Average All Electro-static Potential. BALANCE Approach Based on Computational Chemistry Data Electron density iso-surface. Electron density iso-surface colored by electrostatic potential Data on 34 compounds provided by Dr. Betsy Rice, ARL.

Visualization: Sensitivity vs. Balance

Visualization: Avg. Neg Pot. Vs. Balance vs. Sensitivity Circle size proportional to sensitivity

Threshold t Approach Based on Frequent Substructures • Frequent substructures were found using FSG, a Frequent Subgraph Discovery program by M. Kuramochi and G. Karypis. H: C: N: O:

Substr_19 1 0 Substr_147 High 6 1 0 Substr_31 Substr_296 1 1 0 0 Substr_298 High Low High 1 0 7 4 3 Low High 3 5 Example Decision Tree Based on Substructures Substr_19 : • Each branch of the tree corresponds to a rule. • If substr_19 is present, the predicted class of this molecule is “High sensitivity”. • If substr_19 and substr_296 are not present, but substr_147 is present the predicted class is “Low sensitivity”. Substr_31 : Substr_147: Substr_296: Substr_298: Precision = 89.3%

Geometric Approach Based on Alpha-Shapes Example: Picric acid (C6H3N3O7) • An alpha-shape is a computational technique for representing 3D shape at different levels of detail (alpha > 0 is a parameter which controls the detail). • Used extensively in the analysis of molecular structure (molecular volume and surface area). • Reveals interesting features such as voids, tunnels, and pockets in a molecule’s structure. alpha = 0 4.4172 alpha = 18.837

Regression Approach Using Multiple Sets of Features • Want to predict the impact sensitivity of each molecule, having a set of most predictive attributes. • Different sets of attributes • Frequent substructures • Alpha-shapes features, e.g., voids, etc. • Topological, chemical, and electronic descriptors • Data provided by Dr. J.M. Cense • Reference: “Prediction of the Impact Sensitivity by Neural Networks”, by H. Nefati and J.M. Cense, Advance ACS Abstracts, March 1996. • Regression techniques: Artificial Neural Networks (ANN), regression-SVM, linear least squares. • We used ANN to predict the sensitivity.

Regression using topological, geometrical, electronic descriptors Target: sensitivity 11 attribute set: (topological only) 1 4 5 6 7 12 20 23 24 25 25 13 attribute set: 1 2 4 6 12 13 14 17 18 20 29 30 32 Data on 200 compounds provided by Dr. J.M. Cense

Replicated NN regression result These are the best results that were reported in the paper by Nefati and Cense. The predicted value is an average of the output of 11 and 13 attribute 2-hidden layer node NNs. Average absolute error 0.175 Average squared error 0.047

An Alpha-shape feature: voids Adding the number of voids for each molecule at alpha max improves the model (with exactly the same parameters with NN from the previous page). Average absolute error 0.133 Average squared error 0.029

Concluding Remarks • Data mining is making important contributions in data analysis is many areas of science and business • A set of techniques and software for preprocessing, mining, and visualizing large data sets can be an important enabling cyber-infrastructure for science and engineering research.

Bibliography • Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley April 2005 • Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison-Wesley, 2003 • Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P. Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 • J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in Data Mining", Communications of the ACMVolume 45, Number 8, pp 54-58, August 2002 • Kevin W. DeRonne and George Karypis, “Effective Optimization Algorithms for Fragment-assembly based Protein Structure Prediction,” Computational Systems Bioinformatics Conference (CSB), 2006. • Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis, “Frequent Sub-structure Based Approaches for Classifying Chemical Compounds,” IEEE Trans. Knowl. Data Eng. 17(8): 1036-1050, 2005. • Michihiro Kuramochi and George Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Trans. Knowl. Data Eng. 16(9): 1038-1051, 2004. • Gayle Eherenman, “Mining What Others Miss,” Mechanical Engineering Magazine, http://www.memagazine.org/backissues/feb05/features/miningwh/miningwh.html, February, 2005.

Applications of Data Mining for Predicting Material Properties