1 / 17

Data Mining

Data Mining. Pat Talbot Ryan Sanders Dennis Ellis 11/14/01. Background. 1998 Scientific Data Mining at the JNIC (CRAD) Tools: Mine Set 2.5 was used. Trade study =>Clementine scored highest, but $$$ Training: Knowledge Discovery in Databases conference (KDD ’98)

louised
Télécharger la présentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Pat Talbot Ryan Sanders Dennis Ellis 11/14/01

  2. Background • 1998 Scientific Data Mining at the JNIC (CRAD) • Tools: Mine Set 2.5 was used. Trade study =>Clementine scored highest, but $$$ • Training: Knowledge Discovery in Databases conference (KDD ’98) • Data Base: Model Reuse Repository with 325 records, 12 fields of metadata • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” • Clustering algorithm showed correlation of fields • 1999 Wargame 200 Performance Patterns (CRAD) • Tools: Mine Set 2.5 was again used • Training: KDD ’99 tutorials and workshops • Data Base: numerical data from Wargame 2000 performance benchmarks • Results: insight into the speedup obtainable from a parallel discrete event simulation • 2000 Strategic Offense/Defense Integration (IR&D) • Tools: Weka freeware from the University of New Zealand • Training: purchased Weka text: Data Mining by Ian Witten and Eibe Frank • Data Bases: STRATCOM Force Readines, heuristics for performance Metrics • Results: Rule Induction tree provided tabular understanding of structure • 2001 Offense/Defense Integration (IR&D) • Tools Weka with GUI. Also evaluated Oracle’s Darwin • Database: Master Integrated Data Base (MIDB), 300 record, 14 fields (uncl) • Results: Visual display of easy to understand IF_THEN rule is a tree structure • Performance benchmarking of file sizes and execution time

  3. Success Story • Contractual Work • Joint National Integration Center (1998 – 1999) • Under the Technology Insertion Studies and Analysis (TISA) Delivery Order • Results: “Rules Visualizer” algorithm provided “prevalence” and “predictability” • Clustering algorithm showed correlation of fields • Importance: • Showed practical applications of the technology • Provided training and attendance at at Data Mining Conferences • Attracted the attention of the JNIC Chief Scientist who used it for analysis

  4. Data Mining Techniques Uses algorithmic* techniques for information extraction: - Rule Induction - Neural Networks - Regression Modeling - K Nearest Neighbor Clustering - Radial Basis Functions Shallow Data (discover with SQL) Multi-Dimensional Data (discover with OLAP) Hidden Data (discover with Weka) Preferred when an explanation is required Deep Data (discover only with clues) * Non-parametric statistics, machine learning, connectionist

  5. Data Mining Process SituationAssessment • Data Sources: • STRATCOM Data Stores • Requirements Database • Simulation Output • MIDB DataMining Patterns & Models Transformation& Reduction Storage/RetrievalMechanisms • Indexing: • Hypercube • Multicube Database Datamart Warehouse Visualization Evaluation Preprocessing& Cleaning Target Data Knowledge

  6. Data Mining Flow Diagram External Inputs Input Processing Output • Data File • Defaults: • Algorithm • Choice • Output • Parameters • Control • Parameters • Rule Tree / Clusters • underlying structure • patterns • hidden problems • - consistency • - missing data • - corrupt data • - outliers • - exceptions • - old data Weka Data Mining Software Quinlan Rule Induction Tree Clustering Algorithm Force Readiness Link to Effects

  7. Current Objectives • Automated Assessment and Prediction of Threat Activity • Discover patterns in threat data • Predict future threat activity • Use data mining to construct Dempster-Shafer Belief Networks • Integrate with Conceptual Clustering, Data Fusion, and Terrain Reasoning Terrain Reasoning Conceptual Clustering new concepts Geographic data on routes text threat Movements Noun isa automate likely activities evidence attributes cluster ? new patterns automate evidence Data Fusion Belief in hypotheses Data Mining attributes outcome IF THEN ? Relations Rules automate Hypothesis Impact

  8. Rule Induction Tree Format • Example: Rule induction tree for weather data Data Base: # Outlook Humidity Winds Temp Deploy 1 Sunny 51% True 78 Yes 2 Rainy 70% False 98 No 3 . . n . Had no effect! Decision Supported: Weather influence on asset deployment no (3.0/2.0) 3 met the rule, 2 did not Deploy Asset?

  9. Results: READI Rule Tree Reference: Talbot,P., TRW Technical Review Journal, Spring/Summer 2001, Pages 92,93. Quinlan C4.5 Classifier - Links are IFs - White Nodes are THENs - Yellow nodes are C-ratings C-3 training Overall Training C-0 training C-1 training C-2 training C- 0 rating C-3 rating Mission Capable 48 occur 16 occur >4 <=4 Authorized Platforms C-5 rating Platforms Ready 16 occur. >9 <= 2 >2 <=9 Category C-1 rating Comments C-3 rating 564 occur Edited ICBM 16 occur Spares C-5 rating C-3 rating Training C-1 rating 4 occur 16 occur C-2 rating 16 occur C-2 rating 16 occur 60 occur Example Rule: In the “Overall Training” database, if C-1 training was received, then all but 4 squadrons are Mission Capable and all but 2 are then Platform Ready. If Platform Ready, 564 are then rated C-1.

  10. Results: READI Clustering • STRATCOM Force Readiness Database • 1136 instances T r a i n i n g C = 3 Readiness Ratings C = 2 C = 1 C a t e g o r y

  11. Results: Heuristics Rule Tree Six Variable Damage Expectancy Variable Type SVbr ARbr ARrb DMbr Intent 90% 90% 14 occur./2 misses 14 occur. Weapon Type DMrb SVrb Destroy C N 70% Defend Preempt 14 occur./1 miss Intent Deny 90% 70% 60% 40% 6 occur./1 miss 4 occur. 8 occur 5 misses 100% 2 occur./1 miss 4 occur. Retaliate Defend Destroy 80% Deny 2 occur. Retaliate 50% 100% 2 occur. Arena 2 occur. 90% 100% Strategic Tactical 2 occur. Preempt 2 occur. 20% 40% Arena IF Variable type is DMbr AND Weapon type is conventional THEN DMbr=80% occurs 8 times AND does not occur 5 times 2 occur./1 miss 2 occur./1 miss Strategic Tactical 20% 10% 2 occur./1 miss 6 occur./1 miss

  12. Results: MIDB – 1 MIDB Data Table: difficult to see patterns! 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 10001001012346,TI5NA,80000,DBAKN12345,000002,KN,A,OPR,39500000N,127500000E,19970101225959,0290,SA-2 10001001012347,TI5CA,80000,DBAKN12345,000003,KN,A,OPR,39400000N,127400000E,19970101215959,0290,SA-3 . . 10001001012345,TI5NA,80000,DBAKN12345,000001,KN,A,OPR,40000000N,128000000E,19970101235959,0290,SA-2 • Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): • if SA-2 and lat <= 39.1 then 3 are NOP • if SA-2 and lat > 39.1 then 9 are OPR • if SA-3 and lat<= 38.5 then 3 are OPR • if SA-3 and lat > 38.5  then 9 are NOP • If SA-13 then 6 are NOP MIDB rule tree: easy to see patterns!

  13. Results: MIDB 300 Records • Rules that determine if a threat Surface-to-Air site is operational (OPR) or not (NOP): • if lat > 39.4 then 59 are OPR, 2 aren’t • if lat <= 39.4  and lon >127.2  then 56 are NOP, 2 aren’t • if lon <= 127.2and lat > 39.1  then 31 are OPR, 2 aren’t • If lon <=127.2 lat <= 39.1  and if: • SA-2 then 30 are NOP • SA-3 then if lat <= 38.5 then 30 are OPR, 1 isn’t • SA-3 then if lat < 38.5 then 30 are NOP, 1 isn’t • SA-13 then if lat <= 36.5 then 2 are OPR • SA-13 then if lat >36.5 then 60 are OPR, 6 aren’t

  14. Current Work: SUBDUE • Hierarchical Conceptual Clustering • Structured and unstructured data • Clusters attributes w/graphs • Hypothesis generation Subclass 7: SA-13s (29) at (34.3 N, 129.3 E) are not operational

  15. Applicability • Benefits: • Automatically discovers patterns in data • Resulting rules are easy to understand in “plain English” • Quantifies rules in executable form • Explicitly picks out corrupt data, outliers, and exceptions • Graphic user interface allows easy understanding • Example Uses: • Database validation • 3-D sortie deconfliction • Determine trends in activity • Find hidden structure and dependencies • Create or modify belief networks

  16. Lessons Learned • Data Sets: choose one that you understand – makes cleaning, formatting, default • parameter settings, and interpretation much easier. • Background: knowledge of non-parametric statistics helps determine what patterns • are statistically significant • Tools: many are just fancy GUIs with database query and plot functionality. Most • are overpriced (100K/seat for high end tools for mining business data) • Uses: new one discovered in every task; e.g., consistency & completeness of rules. • May be be useful for organizing textual evidence • Algorithms: must provide understandable patterns; e.g., some algorithms do not! • Integration: challenging to interface these inductive and abductive methods with • deductive methods like belief networks

  17. Summary • TRW has many technical people in Colorado with data mining experience • Hands-on with commercial and academic tools • Interesting and useful results have been produced • Patterns in READI and MIDB using rule induction algorithms • Outliers, corrupt data, and exceptions are flagged • Novel uses, such as consistency and completeness of rule sets, demonstrated • Lessons learned have been described • Good starting point for future work • Challenge is interfacing data mining algorithms with others

More Related