1 / 42

Computational and Statistical Issues in Data-Mining

Computational and Statistical Issues in Data-Mining. Yoav Freund Banter Inc. Plan of talk. Two large scale classification problems. Generative versus Predictive modeling Boosting Applications of boosting Computational issues in data-mining. Freund, Mason, Rogers, Pregibon, Cortes 2000.

crishel
Télécharger la présentation

Computational and Statistical Issues in Data-Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational and Statistical Issues in Data-Mining Yoav Freund Banter Inc.

  2. Plan of talk • Two large scale classification problems. • Generative versus Predictive modeling • Boosting • Applications of boosting • Computational issues in data-mining.

  3. Freund, Mason, Rogers, Pregibon, Cortes 2000 AT&T customer classification • Distinguish business/residence customers • Classification unavailable for about 30% of known customers. • Calculate a “Buizocity” score • Using statistics from call-detail records • Records contain: • calling number, • called number, • time of day, • length of call.

  4. Massive datasets • 260 Million calls / day • 230 Million telephone numbers to be classified.

  5. Faces Non-Faces Paul Viola’s face recognizer Training data 5000 faces 108 non faces

  6. Application of face detector Many Uses - User Interfaces - Interactive Agents - Security Systems - Video Compression - Image Database Analysis

  7. Generative vs. Predictive models

  8. Male Human Voice Female Toy Example • Computer receives telephone call • Measures Pitch of voice • Decides gender of caller

  9. mean1 mean2 var1 var2 Probability Generative modeling Voice Pitch

  10. No. of mistakes Discriminative approach Voice Pitch

  11. mean2 mean1 Probability No. of mistakes Ill-behaved data Voice Pitch

  12. Machine Learning Decision Theory Statistics Traditional Statistics vs. Machine Learning Predictions Actions Data Estimated world state

  13. Comparison of methodologies

  14. Boosting

  15. Non-negative weights sum to 1 Binary label Feature vector (x1,y1,w1),(x2,y2,w2) … (xn,yn,wn) Weighted training set instances labels x1,x2,x3,…,xn y1,y2,y3,…,yn The weak requirement: A weak learner A weak rule weak learner h h

  16. weak learner h1 (x1,y1,w1), … (xn,yn,wn) (x1,y1,1/n), … (xn,yn,1/n) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) (x1,y1,w1), … (xn,yn,wn) weak learner h2 h8 hT h9 h7 h5 h3 h4 h6 The boosting process Sign[] + + + Final rule: a1 h1 a2 h2 aT hT

  17. Main properties of adaboost • If advantages of weak rules over random guessing are: g1,g2,..,gT then in-sample error of final rule is at most (w.r.t. initial weights) • Even after in-sample error reaches zero, additional boosting iterations usually improve out-of-sample error. [Schapire,Freund,Bartlett,Lee Ann. Stat. 98]

  18. What is a good weak learner? • The set of weak rules (features) should be flexible enough to be (weakly) correlated with most conceivable relations between feature vector and label. • Small enough to allow exhaustive search for the minimal weighted training error. • Small enough to avoid over-fitting. • Should be able to calculate predicted label very efficiently. • Rules can be “specialists” – predict only on a small subset of the input space and abstain from predicting on the rest (output 0).

  19. Image Features Unique Binary Features

  20. Example Classifier for Face Detection A classifier with 200 rectangle features was learned using AdaBoost 95% correct detection on test set with 1 in 14084 false positives. Not quite competitive... ROC curve for 200 feature classifier

  21. Alternating Trees Joint work with Llew Mason

  22. Y no no yes yes 5 X 3 Decision Trees -1 +1 X>3 -1 Y>5 -1 -1 +1

  23. Y -1 +1 +0.2 -0.1 +0.1 X>3 no yes -1 -0.3 +0.1 sign -0.1 Y>5 no yes -0.3 +0.2 X Decision tree as a sum -0.2 -0.2

  24. Y +0.2 Y<1 -0.1 +0.1 -1 +1 no yes X>3 +0.7 0.0 no yes -0.3 +0.1 -0.1 -1 sign Y>5 no yes -0.3 +0.2 +1 X An alternating decision tree -0.2 +0.7

  25. Example: Medical Diagnostics • Cleve dataset from UC Irvine database. • Heart disease diagnostics (+1=healthy,-1=sick) • 13 features from tests (real valued and discrete). • 303 instances.

  26. Adtree for Cleveland heart-disease diagnostics problem

  27. Cross-validated accuracy

  28. Alternating tree for “buizocity”

  29. Alternating Tree (Detail)

  30. Accuracy Score Precision/recall graphs

  31. “Drinking out of a fire hose” Allan Wilks, 1997

  32. Data aggregation • Front-end systems • Cashier’s system • Telephone switch • Web server • Web-camera “Data warehouse” Massive distributed data streams Analytics

  33. The database bottleneck • Physical limit: disk “seek” takes 0.01 sec • Same time to read/write 10^5 bytes • Same time to perform 10^7 CPU operations • Commercial DBMS are optimized for varying queries and transactions. • Classification tasks require evaluation of fixedqueries on massive data streams.

  34. Working with large flat files • Sort file according to X(“called telephone number”). • Can be done very efficiently for very large files • Counting occurrences becomes efficient because all records for a given X appear in the same disk block. • Randomly permute records • Reading k consecutive records suffices to estimate a few statistics for a few decisions (splitting a node in a decision tree). • Done by sorting on a random number. • “Hancock” – a system for efficient computation of statistical signatures for data streams. http://www.research.att.com/~kfisher/hancock/

  35. Working with data streams • “You get to see each record only once” • Example problem: identify the 10 most popular items for each retail-chain customer over the last 12 months. • To learn more: Stanford’s Stream Dream Team: http://www-db.stanford.edu/sdt/

  36. Download code Front-end systems Upload statistics Analyzing at the source JAVA Code generation Statistics aggregation Analytics

  37. Learn Slowly, Predict Fast! • Buizocity: • 10,000 instances are sufficient for learning. • 300,000,000 have to be labeled (weekly). • Generate ADTree classifier in C, compile it and run it using Hancock.

  38. Scan 50,000 location/scale boxes in each image, 15images per sec. to detect a few faces. Cascaded method minimizes average processing time Training takes a day on a fast parallel machine. T T T IMAGE BOX Classifier 2 Classifier 3 FACE F F F T NON-FACE NON-FACE Classifier 1 NON-FACE F NON-FACE Paul Viola’s face detector:

  39. Summary • Generative vs. Predictive methodology • Boosting • Alternating trees • The database bottleneck • Learning slowly, predicting fast.

  40. Other work 1 • Specialized data compression: • When data is collected in small bins, most bins are empty. • Instead of storing the zeros smart compression dramatically reduces data size. • Model averaging: • Boosting and Bagging make classifiers more stable. • We need theory that does not use Bayesian assumptions. • Closely relates to margin-based analysis of boosting and of SVM. • Zipf’s Law: • Distribution of words in free text is extremely skewed. • Methods should scale exponentially in entropy rather than linearly in number of words.

  41. Other work 2 • Online methods: • Data distribution changes with time. • Online refinement of feature set. • Long-term learning. • Effective label collection • Selective sampling to label only hard cases. • Comparing labels from different people to estimate reliability. • Co-training: different channels train each-other. (Blum, Mitchell, McCallum)

  42. Contact me! • Yoav@banter.com • http://www.cs.huji.ac.il/~yoavf

More Related