Data Mining

Data Mining Pat Talbot Ryan Sanders Dennis Ellis 11/14/01

Background • 1998 Scientific Data Mining at the JNIC (CRAD) • Tools: Mine Set 2.5 was used. Trade study =>Clementine scored highest, but $$$ • Training: Knowledge Discovery in Databases conference (KDD ’98) • Data Base: Model Reuse Repository with 325 records, 12 fields of metadata • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” • Clustering algorithm showed correlation of fields • 1999 Wargame 200 Performance Patterns (CRAD) • Tools: Mine Set 2.5 was again used • Training: KDD ’99 tutorials and workshops • Data Base: numerical data from Wargame 2000 performance benchmarks • Results: insight into the speedup obtainable from a parallel discrete event simulation • 2000 Strategic Offense/Defense Integration (IR&D) • Tools: Weka freeware from the University of New Zealand • Training: purchased Weka text: Data Mining by Ian Witten and Eibe Frank • Data Bases: STRATCOM Force Readines, heuristics for performance Metrics • Results: Rule Induction tree provided tabular understanding of structure • 2001 Offense/Defense Integration (IR&D) • Tools Weka with GUI. Also evaluated Oracle’s Darwin • Database: Master Integrated Data Base (MIDB), 300 record, 14 fields (uncl) • Results: Visual display of easy to understand IF_THEN rule is a tree structure • Performance benchmarking of file sizes and execution time

Success Story • Contractual Work • Joint National Integration Center (1998 – 1999) • Under the Technology Insertion Studies and Analysis (TISA) Delivery Order • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” • Clustering algorithm showed correlation of fields • Importance: • Showed practical applications of the technology • Provided training and attendance at at Data Mining Conferences • Attracted the attention of the JNIC Chief Scientist who used it for analysis

Data Mining Techniques Uses algorithmic* techniques for information extraction: - Rule Induction - Neural Networks - Regression Modeling - K Nearest Neighbor Clustering - Radial Basis Functions Shallow Data (discover with SQL) Multi-Dimensional Data (discover with OLAP) Hidden Data (discover with Weka) Preferred when an explanation is required Deep Data (discover only with clues) * Non-parametric statistics, machine learning, connectionist

Data Mining Process SituationAssessment • Data Sources: • STRATCOM Data Stores • Requirements Database • Simulation Output • MIDB DataMining Patterns & Models Transformation& Reduction Storage/RetrievalMechanisms • Indexing: • Hypercube • Multicube Database Datamart Warehouse Visualization Evaluation Preprocessing& Cleaning Target Data Knowledge

Data Mining Flow Diagram External Inputs Input Processing Output • Data File • Defaults: • Algorithm • Choice • Output • Parameters • Control • Parameters • Rule Tree / Clusters • underlying structure • patterns • hidden problems • - consistency • - missing data • - corrupt data • - outliers • - exceptions • - old data Weka Data Mining Software Quinlan Rule Induction Tree Clustering Algorithm Force Readiness Link to Effects

Current Objectives • Automated Assessment and Prediction of Threat Activity • Discover patterns in threat data • Predict future threat activity • Use data mining to construct Dempster-Shafer Belief Networks • Integrate with Conceptual Clustering, Data Fusion, and Terrain Reasoning Terrain Reasoning Conceptual Clustering new concepts Geographic data on routes text threat Movements Noun isa automate likely activities evidence attributes cluster ? new patterns automate evidence Data Fusion Belief in hypotheses Data Mining attributes outcome IF THEN ? Relations Rules automate Hypothesis Impact

Rule Induction Tree Format • Example: Rule induction tree for weather data Data Base: # Outlook Humidity Winds Temp Deploy 1 Sunny 51% True 78 Yes 2 Rainy 70% False 98 No 3 . . n . Had no effect! Decision Supported: Weather influence on asset deployment no (3.0/2.0) 3 met the rule, 2 did not Deploy Asset?

Results: READI Rule Tree Reference: Talbot,P., TRW Technical Review Journal, Spring/Summer 2001, Pages 92,93. Quinlan C4.5 Classifier - Links are IFs - White Nodes are THENs - Yellow nodes are C-ratings C-3 training Overall Training C-0 training C-1 training C-2 training C- 0 rating C-3 rating Mission Capable 48 occur 16 occur >4 <=4 Authorized Platforms C-5 rating Platforms Ready 16 occur. >9 <= 2 >2 <=9 Category C-1 rating Comments C-3 rating 564 occur Edited ICBM 16 occur Spares C-5 rating C-3 rating Training C-1 rating 4 occur 16 occur C-2 rating 16 occur C-2 rating 16 occur 60 occur Example Rule: In the “Overall Training” database, if C-1 training was received, then all but 4 squadrons are Mission Capable and all but 2 are then Platform Ready. If Platform Ready, 564 are then rated C-1.

Results: READI Clustering • STRATCOM Force Readiness Database • 1136 instances T r a i n i n g C = 3 Readiness Ratings C = 2 C = 1 C a t e g o r y

Results: Heuristics Rule Tree Six Variable Damage Expectancy Variable Type SVbr ARbr ARrb DMbr Intent 90% 90% 14 occur./2 misses 14 occur. Weapon Type DMrb SVrb Destroy C N 70% Defend Preempt 14 occur./1 miss Intent Deny 90% 70% 60% 40% 6 occur./1 miss 4 occur. 8 occur 5 misses 100% 2 occur./1 miss 4 occur. Retaliate Defend Destroy 80% Deny 2 occur. Retaliate 50% 100% 2 occur. Arena 2 occur. 90% 100% Strategic Tactical 2 occur. Preempt 2 occur. 20% 40% Arena IF Variable type is DMbr AND Weapon type is conventional THEN DMbr=80% occurs 8 times AND does not occur 5 times 2 occur./1 miss 2 occur./1 miss Strategic Tactical 20% 10% 2 occur./1 miss 6 occur./1 miss

Results: MIDB – 1 MIDB Data Table: difficult to see patterns! 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 10001001012346,TI5NA,80000,DBAKN12345,000002,KN,A,OPR,39500000N,127500000E,19970101225959,0290,SA-2 10001001012347,TI5CA,80000,DBAKN12345,000003,KN,A,OPR,39400000N,127400000E,19970101215959,0290,SA-3 . . 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 • Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): • if SA-2 and lat <= 39.1 then 3 are NOP • if SA-2 and lat > 39.1 then 9 are OPR • if SA-3 and lat<= 38.5 then 3 are OPR • if SA-3 and lat > 38.5  then 9 are NOP • If SA-13 then 6 are NOP MIDB rule tree: easy to see patterns!

Results: MIDB 300 Records • Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): • if lat > 39.4 then 59 are OPR, 2 aren’t • if lat <= 39.4  and lon >127.2  then 56 are NOP, 2 aren’t • if lon <= 127.2and lat > 39.1  then 31 are OPR, 2 aren’t • If lon <=127.2 lat <= 39.1  and if: • SA-2 then 30 are NOP • SA-3 then if lat <= 38.5 then 30 are OPR, 1 isn’t • SA-3 then if lat < 38.5 then 30 are NOP, 1 isn’t • SA-13 then if lat <= 36.5 then 2 are OPR • SA-13 then if lat >36.5 then 60 are OPR, 6 aren’t

Current Work: SUBDUE • Hierarchical Conceptual Clustering • Structured and unstructured data • Clusters attributes w/graphs • Hypothesis generation Subclass 7: SA-13s (29) at (34.3 N, 129.3 E) are not operational

Applicability • Benefits: • Automatically discovers patterns in data • Resulting rules are easy to understand in “plain English” • Quantifies rules in executable form • Explicitly picks out corrupt data, outliers, and exceptions • Graphic user interface allows easy understanding • Example Uses: • Database validation • 3-D sortie deconfliction • Determine trends in activity • Find hidden structure and dependencies • Create or modify belief networks

Lessons Learned • Data Sets: choose one that you understand – makes cleaning, formatting, default • parameter settings, and interpretation much easier. • Background: knowledge of non-parametric statistics helps determine what patterns • are statistically significant • Tools: many are just fancy GUIs with database query and plot functionality. Most • are overpriced (100K/seat for high end tools for mining business data) • Uses: new one discovered in every task; e.g., consistency & completeness of rules. • May be be useful for organizing textual evidence • Algorithms: must provide understandable patterns; e.g., some algorithms do not! • Integration: challenging to interface these inductive and abductive methods with • deductive methods like belief networks

Summary • TRW has many technical people in Colorado with data mining experience • Hands-on with commercial and academic tools • Interesting and useful results have been produced • Patterns in READI and MIDB using rule induction algorithms • Outliers, corrupt data, and exceptions are flagged • Novel uses, such as consistency and completeness of rule sets, demonstrated • Lessons learned have been described • Good starting point for future work • Challenge is interfacing data mining algorithms with others

Data Mining

Data Mining

Presentation Transcript

Data Mining

DATA MINING

Data Mining

Data Mining

Data Mining: Data

Data Mining

DATA MINING

Data Mining: Data

Data Mining: Data

Data Mining: P enelitian Data Mining

Data Mining

Data Mining: Data

Data Mining

Data Mining: Data

Data-mining

Data Mining

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data

Data Mining: Data