NPACI Summer Institute 2003 TUTORIAL Data Mining for Scientific Applications

NPACI Summer Institute 2003TUTORIALData Mining for Scientific Applications Tony Fountain Peter Shin San Diego Supercomputer Center UCSD

NPACI Data Mining • Resources • TeraGrid

NSF TeraGridBuilding Integrated National CyberInfrastructure • Prototype for CyberInfrastructure • Ubiquitous computational resources • Plug-in compatibility • National Reach: • SDSC, NCSA, CIT, ANL, PSC • High Performance Network: • 40 Gb/s backbone, 30 Gb/s to each site • Over 20 Teraflops compute power • Approx. 1 PB rotating Storage • Extending by 3-4 sites in ‘03

SDSC is Data-Intensive Center 4

LAN (multiple GbE, TCP/IP) Local Disk (50TB) Power 4 DB Blue Horizon WAN (30 Gb/s) Power 4 HPSS Sun F15K Linux Cluster, 4TF SAN (2 Gb/s, SCSI) SCSI/IP or FC/IP 30 MB/s per drive 200 MB/s per controller FC GPFS Disk (100TB) FC Disk Cache (400 TB) Database Engine Data Miner Vis Engine Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored SDSC Machine Room Data Architecture • .5 PB disk • 6 PB archive • 1 GB/s disk-to-tape • Optimized support for DB2 /Oracle Philosophy: enable SDSC configuration to serve the grid as Data Center

SDSC IBM Regatta - DataStar • 100+ TB Disk • Numerous fast CPUs • 64 GB of RAM per node • DB2 v8.x ESE • IBM Intelligent Miner • SAS Enterprise Miner • Platform for high-performance database, data mining, comparative IT studies …

Data Mining Definition The search for interesting patterns and models.

Data Mining Definition The search for interesting patterns and models, in large databases, that were collected for other applications, using machine learning algorithms, and high-performance computational infrastructure.

Broad Definition:Analysis and Infrastructure • Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool) • Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO), Blast • Machine learning (compute-intensive statistical methods) • Supervised – classification, prediction • Unsupervised – clustering • Computational infrastructure – collections management, information integration, high-performance database systems, web services, grid services, the global IT grid

The Case for Data Mining: Data Reality • Deluge from new sources • Remote sensing • Microarray processing • Wireless communication • Simulation models • Instrumentation • Digital publishing • Federation of collections • Legacy archives and independent collection activities • Many types of data, many uses, many types of queries • Growth of data collections vs. analysts • Paradigm shift: from hypothesis-driven data collection to data mining • Virtual laboratories and digital science

KDD ProcessKnowledge Discovery and Data Mining Application/Decision Support Knowledge Presentation/Visualization Analysis/Modeling Management/Federation/Warehousing Processing/Cleansing/Corrections Data Collection Bulk of difficult work is below analysis Integrated infrastructure increases efficiency

Characteristics of Data Mining Applications • Lots of data, numerous sources • Noisy – missing values, outliers, interference • Heterogeneous – mixed types, mixed media • Complex – scale, resolution, temporal, spatial dimensions • Relatively little domain theory, few quantitative causal models • No rigorous experimental design, limited control on data collection • Lack of valid ground truth – all data is not equal! • Finding needles in haystacks… • Advice: don’t choose problems that have all these characteristics …

Scientific vs. Commercial Data Mining • Goals: profits vs. theories • Need for insight and the depth of science • The role of black boxes and theory-based models • Produce interpretable model structures, generate domain rules or causal structures, support for theory development • Scientists-in-the-loop architectures • Data characteristics: • Transaction data vs. images, sensors, simulations • Spatial and temporal dimensions, heterogeneity • Trend – the IT differences are diminishing -- this is good! • Databases, integration tools, web services… • Industry is a big IT engine

SDSC Applications • Alliance for Cell Signaling (AFCS) • Joint Center for Structural Genomics (JCSG) • Cooperative Association for Internet Data Analysis (CAIDA)

Hyperspectral Example • Characteristics of the data • Over 200 bands • Small number of samples through labor-intensive collecting process • Task: • Classify the vegetation (e.g. Kangaroo Mound, Juniper, Pinyon, etc.)

Cancer Example Data Set: • 88 prostate tissue samples: • 37 labeled “no tumor”, • 51 labeled “tumor” • Each tissue with 10,600 gene expression measurements • Collected by the UCSD Cancer Center, analyzed at SDSC Tasks: • Build model to classify new, unseen tissues as either “no tumor” or “tumor” • Report back to the domain expert the key genes used by the model, to find out their biological significance in the process of cancer

Some genes are more useful than others for building classification models Example: genes 36569_at and 36495_at are useful No Tumor Tumor

Civil Infrastructure Health Monitoring Example • Goal: • Provide a flexible and scalable infrastructure • Process sensor network data stream • to monitor and analyze real-time sensor data • to integrate various types of sensor data • Support classification and decision support task • to build a real-time decision support system

Detecting Damage Location in a Bridge • Task: • Identify which pier is damaged based on the data stream of acceleration measured from the sensors at the span middles. • Compare the prediction accuracy between the exact data vs. approximate data • Testbed: • Humboldt Bay Bridge with 8 piers. • Assumption: • Damage only happened at the lower end of each pier (location of plastic hinge) • There is only one damaged pier each time.

Introduction to Machine Learning • Concepts and inductive reasoning • Supervised and unsupervised learning • Model development -- training and testing methodology, cross validation • Measuring performance -- overfitting, confusion matrices • Survey of algorithms • Decision Trees classification • k-means clustering • Hierarchical clustering • Bayesian networks and probabilistic inference • Support vector machines

Basic Machine Learning Theory • Inductive learning hypothesis: • Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples. • No Free Lunch Theorem: • In the absence of prior information about the problem there are no reasons to prefer one learning algorithm over another. • Conclusion: • There is no problem-independent “best” learning system. Formal theory and algorithms are not enough. • Machine learning is an empirical subject.

Concepts and Feature Vectors • Concepts can be identified by features • Example: vehicles • Has wheels • Runs on gasoline • Carries people • Flies • Weighs less than 500 pounds • Boolean feature vectors for vehicles • Car [1 1 1 0 0] • Motorcycle [1 1 1 0 1] • Airplane [1 1 1 1 0] • Motorcycle[1 1 1 0 0]

Concepts and Feature Vectors 2 • Easy to generalize to complex data types: • [type num_wheels, fuel_type, carrying_capacity, max_altitude, weight] • [Car, 4, gas, 600, 0.0, 2190] • Most machine learning algorithms expect this input format • Suggestions: • Identify the target concept • Organize your data to fit feature vector representation • Design your database schemas to support generation of data in this format

Dimensions of Data These help determine algorithm selection • Questions to consider (metadata): • Number of features (columns)? • Number of independent/dependent features? • Type of features? • Missing features? • Mixed types? • Labels? Types? • Ratio of rows to columns? • Goals of analysis? • Best guess at type of target function? (e.g., linear?)

Challenges to Quality Ground Truth(labeled training data) • Labels enable supervised learning (classification, prediction) • Difficult to expensive to impossible • Remote sensing imagery, biological field surveys • Clinical cancer data, limited biological samples • Failure signatures of civil infrastructure, catastrophes, terrorism • Approaches • Opportunistic labeling (e.g., security logs, multiuse field surveys) • Learning from process-based models (e.g., bridge simulations) • Shared community resources (amortize costs, e.g., museum federation) • High-throughput annotations

Overview of Classification • Definition • A new observation can be assigned to one of several known classes using a rule. • The rule is learned using a set of labeled examples, through a process called supervised learning. • Survey of Applications • Ecosystem classification, hyperspectral image pixel classification, cancer diagnosis and prognosis, structural damage detection, crystallization success prediction, spam detection, etc. • Survey of Methods • Neural Network • Decision Trees • Naïve Bayesian Networks • Support Vector Machines

Classification – Decision Tree Ecosystem Precipitation

Classification – Decision Tree Precipitation > 63

Classification – Decision Tree Precipitation > 63 Precipitation > 5

Classification – Decision Tree Learned Model IF(Precip > 63 ) then Forest Else If (Precip > 5) then Prairie Else Desert True D F P D Confusion Matrix Predicted F P Classification accuracy on training data is 100%

Classification – Decision TreeTesting Set Results Test Data IF(Precip > 63 ) then Forest Else If (Precip > 5) then Prairie Else Desert Learned Model True D F P D True Predicted Predicted F Confusion Matrix P Result: Accuracy 67% Model shows overfitting, generalizes poorly

Pruning to improve generalizationPruned Decision Tree IF(Precip < 60 ) then Desert Else [P(Forest) = .75] & [P(Prairie) = .25] Precipitation < 60

Decision Trees Summary • Simple to understand • Works with mixed data types • Heuristic search so sensitive local minima • Models non-linear functions • Handles classification and regression • Many successful applications • Readily available tools

Cross ValidationGetting the most mileage from your data • Creating training and testing data sets • Hold-out validation (2/3, 1/3 splits) • Cross validation, simple and n-fold (reuse) • Bootstrap validation (sample with replacement) • Jackknife validation (leave one out) • When possible hide a subset of the data until train-test is complete. Train Test Apply

Learning CurvesReality checks and optimization decisions Overfitting Optimal Depth Train Test

Hands-on Analysis • Decision Tree with IBM Intelligent Miner

Sales Problem • Goal: Maximize the profits on sales of quality shoes • Problem Characteristics: 1. We make $50 profit on a sale of $200 shoes. 2. People who make over $50k buy the shoes at a rate of 5% when they receive the brochure. 3. People who make less than $50k buy the shoes at a rate of 1% when they receive the brochure. 4. It costs $1 to send a brochure to a potential customer. 5. In general, we do not know whether a person will make more than $50k or not. However, we have indirect information about them.

Data Description • Credit: • Census Bureau (1994) • Data processed and donated by Ron Kohavi and Barry Becker (Data Mining and Visualization, SGI) • Variable Description • Please refer to the hand-out.

Two Datasets • Train Set • Total Number = 32561 (100%) • Less than or equal to $50k = 24720 (76 %) • Over $50k = 7841 (24 %) • Test Set (Validation Set) • Total Number = 16281 (100%) • Less than or equal to $50k = 12435 (76%) • Over $50k = 3846 (24%)

Marketing Plan • We will send out 30,000 brochures. • Remember! • We do not know whether a person makes $50k or not, • We do know the other information (e.g. age, education level, etc.) • Plans • Plan A: randomly send them out (a.k.a ran-dumb plan) • Plan B: send them to a group of people who are likely to make over $50k (a.k.a InTelligent (IT) plan)

Plan A (ran-dumb plan) • Cost of sending one brochure = $1 • Probability of Response • 1 % for 76% of the population who make <= $50k. • 5 % for 24% of the population who make > $50k. • Probability of response: (0.01 * .76 + 0.05 * 0.24) = 0.0196 (1.96%) • Estimated Earnings • Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) (0.0196 * $50 - $1) = -$0.02 • Expected Earning = Expected profit from one brochure * number of brochures sent -$0.02 * 30000 = -$600

Plan B (InTelligent (IT) plan) • Strategy: • Send out brochures to only the ones who are likely to make over $50k. • Cost of sending one flier = $1 • Probability of Response • Use the test result of the decision tree • Send out the brochures to only the ones that are predicted to make over $50k • Total number of cases where predicted >$50k: 3,061 (100%) • Number of cases where predicted >$50k and actually make >$50k: 2,121 (%69) • Number of cases where predicted >$50k and actually make <=$50k: 940 (%31) • Probability of response: (0.01 * .31 + 0.05 * 0.69) = 0.0376 (3.76%)

Plan B (InTelligent (IT) plan) • Estimated Earnings • Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) (0.0376 * $50 - $1) = $0.88 • (Probability of response * profit – Cost of a flier) * number of fliers $1.06 * 30000 = $26,400

Comparison of Two Plans • Expected earning from ran-dumb plan • -$600 • Expected earning from IT plan • $26,400 • Net Difference • $26,400 – (-$600) = $27,000

Overview of Clustering • Definition: • Clustering is the discovery of classes that belong together. • The classes are discovered using a set of unlabeled examples, through a process called supervised learning. • Survey of Applications • Grouping of web-visit data, clustering of genes according to their expression values, grouping of customers into distinct profiles using transaction data, • Survey of Methods • k-means clustering • Hierarchical clustering • Expectation Maximization (EM) algorithm • Gaussian mixture modeling • Cluster analysis • Concept discovery • Bootstrapping knowledge

Clustering – K-Means Precipitation Temperature

Clustering – K-Means

NPACI Summer Institute 2003 TUTORIAL Data Mining for Scientific Applications