Computational and Statistical Issues in Data-Mining

Computational and Statistical Issues in Data-Mining Yoav Freund Banter Inc.

Plan of talk • Two large scale classification problems. • Generative versus Predictive modeling • Boosting • Applications of boosting • Computational issues in data-mining.

Freund, Mason, Rogers, Pregibon, Cortes 2000 AT&T customer classification • Distinguish business/residence customers • Classification unavailable for about 30% of known customers. • Calculate a “Buizocity” score • Using statistics from call-detail records • Records contain: • calling number, • called number, • time of day, • length of call.

Massive datasets • 260 Million calls / day • 230 Million telephone numbers to be classified.

Faces Non-Faces Paul Viola’s face recognizer Training data 5000 faces 108 non faces

Application of face detector Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis

Generative vs. Predictive models

Male Human Voice Female Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch

No. of mistakes Discriminative approach Voice Pitch

mean2 mean1 Probability No. of mistakes Ill-behaved data Voice Pitch

Machine Learning Decision Theory Statistics Traditional Statistics vs. Machine Learning Predictions Actions Data Estimated world state

Comparison of methodologies

Boosting

Non-negative weights sum to 1 Binary label Feature vector (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) Weighted training set instances labels x1,x2,x3,…,xn y1,y2,y3,…,yn The weak requirement: A weak learner A weak rule weak learner h h

weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h8 hT h9 h7 h5 h3 h4 h6 The boosting process Sign[] + + + Final rule: a1 h1 a2 h2 aT hT

Main properties of adaboost • If advantages of weak rules over random guessing are: g1,g2,..,gT then in-sample error of final rule is at most (w.r.t. initial weights) • Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98]

What is a good weak learner? • The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Small enough to allow exhaustive search for the minimal weighted training error. • Small enough to avoid over-fitting. • Should be able to calculate predicted label very efficiently. • Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

Image Features Unique Binary Features

Example Classifier for Face Detection A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier

Alternating Trees Joint work with Llew Mason

Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1

Y -1 +1 +0.2 -0.1 +0.1 X>3 no yes -1 -0.3 +0.1 sign -0.1 Y>5 no yes -0.3 +0.2 X Decision tree as a sum -0.2 -0.2

Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7

Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.

Adtree for Cleveland heart-disease diagnostics problem

Cross-validated accuracy

Alternating tree for “buizocity”

Alternating Tree (Detail)

Accuracy Score Precision/recall graphs

“Drinking out of a fire hose” Allan Wilks, 1997

Data aggregation • Front-end systems • Cashier’s system • Telephone switch • Web server • Web-camera “Data warehouse” Massive distributed data streams Analytics

The database bottleneck • Physical limit: disk “seek” takes 0.01 sec • Same time to read/write 10^5 bytes • Same time to perform 10^7 CPU operations • Commercial DBMS are optimized for varying queries and transactions. • Classification tasks require evaluation of fixedqueries on massive data streams.

Working with large flat files • Sort file according to X(“called telephone number”). • Can be done very efficiently for very large files • Counting occurrences becomes efficient because all records for a given X appear in the same disk block. • Randomly permute records • Reading k consecutive records suffices to estimate a few statistics for a few decisions (splitting a node in a decision tree). • Done by sorting on a random number. • “Hancock” – a system for efficient computation of statistical signatures for data streams. http://www.research.att.com/~kfisher/hancock/

Working with data streams • “You get to see each record only once” • Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months. • To learn more: Stanford’s Stream Dream Team: http://www-db.stanford.edu/sdt/

Download code Front-end systems Upload statistics Analyzing at the source JAVA Code generation Statistics aggregation Analytics

Learn Slowly, Predict Fast! • Buizocity: • 10,000 instances are sufficient for learning. • 300,000,000 have to be labeled (weekly). • Generate ADTree classifier in C, compile it and run it using Hancock.

Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces. Cascaded method minimizes average processing time Training takes a day on a fast parallel machine. T T T IMAGE BOX Classifier 2 Classifier 3 FACE F F F T NON-FACE NON-FACE Classifier 1 NON-FACE F NON-FACE Paul Viola’s face detector:

Summary • Generative vs. Predictive methodology • Boosting • Alternating trees • The database bottleneck • Learning slowly, predicting fast.

Other work 1 • Specialized data compression: • When data is collected in small bins, most bins are empty. • Instead of storing the zeros smart compression dramatically reduces data size. • Model averaging: • Boosting and Bagging make classifiers more stable. • We need theory that does not use Bayesian assumptions. • Closely relates to margin-based analysis of boosting and of SVM. • Zipf’s Law: • Distribution of words in free text is extremely skewed. • Methods should scale exponentially in entropy rather than linearly in number of words.

Other work 2 • Online methods: • Data distribution changes with time. • Online refinement of feature set. • Long-term learning. • Effective label collection • Selective sampling to label only hard cases. • Comparing labels from different people to estimate reliability. • Co-training: different channels train each-other. (Blum, Mitchell, McCallum)

Contact me! • Yoav@banter.com • http://www.cs.huji.ac.il/~yoavf

Computational and Statistical Issues in Data-Mining

Computational and Statistical Issues in Data-Mining

Presentation Transcript

Issues with Data Mining

Statistical Data Mining - 3

Some Statistical Issues in Microarray Data Analysis

Social and Privacy Issues of Data Mining

statistical analysis and data mining

Computational Geometry and Spatial Data Mining

A Statistical Viewpoint on Data Science, Data Mining and Big Data

Safety Data Mining: Background and Current Issues

Computational Intelligence for Data Mining

Issues in Data Mining Infrastructure

Computational Issues on Statistical Genetics

Computational Geometry and Spatial Data Mining

Data and Text Mining for Computational Biology

International Statistical Data: Trends, Sources and Issues

Issues in Data Mining Applications -Tutorial-

Statistical Data Mining - 2

On the Computational and Statistical Interface and “ Big Data ”

Issues in Data Mining Infrastructure

PRIVACY AND security Issues IN Data Mining

Issues in Data Mining Infrastructure

Research Issues in Web Data Mining

Statistical Data Mining - 3