Graphical Data Mining for Computational Estimation in Materials Science Applications

Graphical Data Mining for Computational Estimation in Materials Science Applications Aparna Varde Ph.D. Dissertation August 15, 2006 Committee Members Prof. Elke Rundensteiner (Advisor) Prof. Carolina Ruiz Prof. David Brown Prof. Neil Heffernan Prof. Richard Sisson Jr. (Head of Materials Science, WPI) This work is supported by the Center for Heat Treating Excellence and by Department of Energy Award DE-FC-07-01ID14197

Introduction • Scientific domains: Experiments conducted with given input conditions • Results plotted as graphs: Good visual depictions • Experimental results help in analysis: Assist decision-making • Performing experiment: Consumes time and resources

Motivating Example • Heat Treating of Materials • Controlled heating & cooling of materials to achieve mechanical & thermal properties • Performing experiments involves • One time cost: $1000s • Recurrent costs: $100s • Time: 5 to 6 hours • Human labor • Desirable to estimate • Graphs given input conditions • Conditions to achieve given graph CHTE Experimental Setup

Problem Definition • To develop an estimation technique with following goals: • Given input conditions in an experiment, estimate resulting graph 2. Given desired graph in an experiment, estimate conditions to obtain it

Proposed Estimation Approach: AutoDomainMine

Knowledge Discovery in AutoDomainMine

Estimation of Graph in AutoDomainMine

Estimation of Conditions in AutoDomainMine

Main Tasks Task 1 AutoDomainMine Learning Strategy of Integrating Clustering and Classification [AAAI-06 Poster, ACM SIGART’s ICICIS-05] Task 2 Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] Task 3 Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’S IQIS-06, ACM CIKM-06]

Task 2: Learning Domain-Specific Distance Metrics for Graphs

Motivation • Various distance metrics • Absolute position of points • Statistical observations • Critical features • Issues • Not known what metrics apply • Multiple metrics may be relevant • Need for distance metric learning in graphs Example of domain-specific problem

Proposed Distance Metric Learning Approach: LearnMet • Given • Training set with actual clusters of graphs • Additional Input • Components: distance metrics applicable to graphs • LearnMet Metric • D = ∑wiDi

Evaluate Accuracy • Use pairs of graphs • A pair (ga,gb) is • TP - same predicted, same actual cluster: (g1, g2) • TN - different predicted, different actual clusters: (g2,g3) • FP -same predicted cluster, different actual clusters: (g3,g4) • FN - different predicted, same actual clusters: (g4,g5)

Evaluate Accuracy (Contd.) • How do we compute error for whole set of graphs? • For all pairs • Error Measure • Failure Rate FR • FR = (FP+FN) / (TP+TN+FP+FN) • Error Threshold (t) • Extent of FR allowed • If (FR < t) then clustering is accurate

Adjust the Metric • Weight Adjustment Heuristic: for each Di • New wi = wi – sfi (DFNi/DFN + DFPi/DFP) [KDD’s MDM-05]

Evaluation of LearnMet • Details: MTAP-06 • Effect of pairs per epoch (ppe) • G = number of graphs, e.g., = 25 • GC2 = total number of pairs, e.g., = 300 • Select subset of GC2 pairs per epoch • Observations • Highest accuracy with middle range of ppe • Learning efficiency best with low ppe • Average accuracy with LearnMet 86% Accuracy of Learned Metrics over Test Set Learning Efficiency over Training Set

Task 3: Designing Semantics-Preserving Representatives for Clusters

Motivation • Different combinations of conditions could lead to a single cluster • Graphs in a cluster could have variations • Need for designing representatives that • Incorporate semantics • Avoid visual clutter • Cater to various users

Proposed Approach for Designing Representatives: DesRept

Candidates for Conditions 1. Nearest Representative Set of conditions in Cluster A Nearest Representative for Cluster A • Return set of conditions closest to all others in cluster • Notion of distance: Domain-specific distance metric from decision tree paths [CIKM-06]

Candidates for Conditions (Contd.) 2. Summarized Representative Cluster A • Build sub-clusters of condition using domain knowledge • Return nearest sub-cluster representatives • Sort them Sub-clusters within the Cluster A Summarized Representative for Cluster

Candidates for Conditions (Contd.) 3. Combined Representative • Return all sets of conditions • Sort them in ascending order Cluster A Combined Representative for Cluster A

Candidates for Graphs 1. Nearest Representative • Select graph that is nearest neighbor for all others • Notion of distance: Domain-specific metric from LearnMet

Candidates for Graphs (Contd.) 2. Medoid Representative • Select graph closest to average of all graphs • Average of y-coordinate values since x-coordinates are same

Candidates for Graphs (Contd.) 3. Summarized Representative • Construct average graph with prediction limits • Average: centroid, Prediction limits: domain-specific thresholds

Candidates for Graphs (Contd.) 3. Combined Representative • Construct superimposed graph of all graphs in cluster • Same x-values, so plot y-values on a common x-axis

Effectiveness Measure for Candidates • Minimum Description Length Principle • Theory: Representative, Examples: all items in cluster • Representative: Measure Complexity (ease of interpretation) Complexity = log2 N for graphs, log2 AV for conditions, • N = number of points to store representative graph • A = number of attributes for conditions, • V = number of values in representative set of conditions • Examples: Measure distance of items from representative (information loss) Distance for graphs = log2 (1/G)∑{i=1 to G} D(r,gi) • D: distance using domain-specific metric • G: total number of graphs in cluster • gi: each graph • r: representative graph • Encoding [SIGMOD IQIS-06] Effectiveness= UBC*Complexity + UBD*Distance • UBC, UBD: User bias % weights for complexity and distance

Evaluation of DesRept: Conditions • Details • Data Set Size = 400, Number of Clusters = 20 • Observations • Overall winner is Summarized • As weight for complexity increases, Nearest wins • Designed better than Random

Evaluation of DesRept: Graphs • Details • Data Set Size = 400, Number of Clusters = 20 • Observations • Overall winner is Summarized • As weight for complexity increases, Nearest / Medoid wins • Designed better than Random

User Evaluation of AutoDomainMine System • Formal user surveys in different applications • Evaluation Process • Compare estimation with real data in test set • If they match estimation is accurate • Observations • Estimation Accuracy around 90 to 95 % Accuracy: Estimating Conditions Accuracy: Estimating Graphs

Related Work • Similarity Search [HK-01, WF-00] • Non-matching conditions could be significant • Mathematical Modeling [M-95, S-60] • Existing models not applicable under certain situations • Case-based Reasoning [K-93, AP-03] • Adaptation of cases not feasible with graphs • Learning nearest neighbor in high-dimensional spaces: [HAK-00] • Focus is dimensionality reduction, do not deal with graphs • Distance metric learning given basic formula: [XNJR-03] • Deal with position-based distances for points, no graphs involved • Similarity search in multimedia databases [KB-04] • Use various metrics in different applications, do not learn a single metric • Image Rating: [HH-01] • User intervention involved in manual rating • Semantic Fish Eye Views: [JP-04] • Display multiple objects in small space, no representatives • PDA Displays in Levels of Detail: [BGMP-01] • Do not evaluate different types of representatives

Summary • Dissertation Contributions • AutoDomainMine: Integrating Clustering and Classification for Estimation [AAAI-06 Poster, ACM SIGART’s ICICIS-05] • LearnMet: Learning Domain-Specific Distance Metrics for Graphs [ACM KDD’s MDM-05, MTAP-06 Journal] • DesRept: Designing Semantics-Preserving Representatives for Clusters [ACM SIGMOD’s IQIS-06, ACM CIKM-06] • Trademarked Tool for Computational Estimation in Materials Science [ASM HTS-05, ASM HTS-03] • Future Work • Image Mining, e.g., Comparing Nanostructures • Data Stream Matching, e.g., Stock Market Analysis • Visual Displays, e.g., Summarizing Web Information

Publications Dissertation-Related Papers 1. Designing Semantics-Preserving Representatives for Scientific Input Conditions, A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr., In CIKM, Arlington, VA, Nov 2006. 2. Integrating Clustering and Classification for Estimating Process Variables in Materials Science. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In AAAI, Poster Track, Boston, MA, Jul 2006. 3. Effectiveness of Domain-Specific Cluster Representatives for Graphical Plots. A. Varde, E. Rundensteiner, C. Ruiz, D. Brown, M. Maniruzzaman and R. Sisson Jr. In ACM SIGMOD IQIS, Chicago, IL, Jun 2006. 4. LearnMet: Learning Domain-Specific Distance Metrics for Plots of Scientific Functions. A. Varde, E.Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. Accepted in the International MTAP Journal, Springer Publications, Special Issue on Multimedia Data Mining, 2006. 5. Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM KDDMDM, Chicago, IL, Aug 2005, pp. 107-112. 6. Apriori Algorithm and Game-of-Life for Predictive Analysis in Materials Science. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In KES Journal, IOS Press, Netherlands, Vol. 8, No. 4, 2004, pp. 213 – 228. 7. Data Mining over Graphical Results of Experiments with Domain Semantics. A. Varde, E. Rundensteiner, C. Ruiz, M. Maniruzzaman and R. Sisson Jr. In ACM SIGART ICICIS, Cairo, Egypt, Mar 2005, pp. 603 – 611.

Publications (Contd.) 8. QuenchMiner: Decision Support for Optimization of Heat Treating Processes. A. Varde, M. Takahashi, E. Rundensteiner, M. Ward, M. Maniruzzaman and R. Sisson Jr. In IEEE IICAI, Hyderabad, India, Dec 2003, pp. 993 – 1003. 9. Estimating Heat Transfer Coefficients as a Function of Temperature by Data Mining. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Pittsburgh, PA, Sep 2005. 10 . The QuenchMiner Expert System for Quenching and Distortion Control. A. Varde, E. Rundensteiner, M. Maniruzzaman and R. Sisson Jr. In ASM HTS, Indianapolis, IN, Sep 2003, pp. 174 – 183. Other Papers 11.MEDWRAP: Consistent View Maintenance over Distributed Multi-Relation Sources. A. Varde and E. Rundensteiner. In DEXA. Aix-en-Provence, France, Sep 2002, pp. 341 – 350. 12. SWECCA for Data Warehouse Maintenance. A. Varde and E. Rundensteiner. In SCI, Orlando, FL, Jul 2002, Vol. 5, pp. 352 – 357. 13. MatML: XML for Information Exchange with Materials Property Data, A. Varde, E. Begley, S. Fahrenholz-Mann. In ACM KDD DM-SPP, Philadelphia, PA, Aug 2006. 14. Semantic Extensions to Domain-Specific Markup Languages. A. Varde, E. Rundensteiner, M. Mani, M. Maniruzzaman and R. Sisson Jr. In IEEE CCCT, Austin, TX, Aug 2004, Vol. 2, pp. 55 – 60.

Acknowledgments • First of all, my Advisor: Prof. Elke Rundensteiner • Committee: Prof. Carolina Ruiz,Prof. David Brown, Prof Neil Heffernan • External Member: Prof. Richard D. Sisson Jr., Head of Materials Program • Director of Metal Processing Institute: Prof. Diran Apelian • Domain Expert: Dr. Mohammed Maniruzzaman • Members of Center for Heat Treating Excellence • CS Department Head: Prof. Michael Gennert • Former CS Department Head: Prof. Micha Hofri • WPI Administration (CS, Materials): In particular Mrs. Rita Shilansky • Reviewers of Conferences and Journals where my papers got accepted • Members of DSRG, AIRG, KDDRG and Quenching Research Group • Colleagues and Friends: Shuhui, Sujoy, Viren, Olly, Mariana, Rimma, Maged, Bin, Lydia, Shimin and others… • Great Thanks to my Family: Parents Dr. Sharad Varde and Dr. (Mrs.) Varsha Varde, Grandparents Mr. D.A. Varde and Mrs. Vimal Varde, Brother Ameya Varde and Sister-in-law Deepa Varde • All the attendees of my Ph.D. Defense • Finally, God for guiding me throughout my doctoral journey

Thank You

Graphical Data Mining for Computational Estimation in Materials Science Applications

Graphical Data Mining for Computational Estimation in Materials Science Applications

Presentation Transcript

Data Mining for Earth Science Data

Data Mining : Commercial Applications

Applications and Trends in Data Mining

Theoretical and Computational Materials Science

Data Mining Applications in Robotics Engineering

Computational Materials Science for Innovation

Computational Intelligence for Data Mining

Multi Scale Computational Challenges in Materials Science

Data and Text Mining for Computational Biology

Data Mining for Security Applications

Science in Business Data Mining?

Issues in Data Mining Applications -Tutorial-

Data Mining: Applications

Computational Materials Science Laboratory

Business Data Mining Applications

Data Mining: Applications

Computational Statistics – Graphical and Analytic Methods for Streaming Data

Advantages of Data mining in Data science

Data Science Applications | Data Science For Beginners | Data Science Training | Edureka

Intro to Data Mining for Data Science

Computational and Statistical Issues in Data-Mining

Data Mining: Applications