440 likes | 458 Vues
Machine Learning meets the Real World: Successes and new research directions. Andrea Pohoreckyj Danyluk Department of Computer Science Williams College, Williamstown, MA October 11, 2002. Data, data everywhere. Scientific : data collection routinely produces gigabytes of data per day
E N D
Machine Learning meets the Real World:Successes and new research directions Andrea Pohoreckyj Danyluk Department of Computer Science Williams College, Williamstown, MA October 11, 2002
Data, data everywhere... • Scientific: data collection routinely produces gigabytes of data per day • Telecommunications: AT&T produces 275 million call records • Web: Google handles 70 million searches • Retail: WalMart records 20 million sales transactions
A wealth of information • Scientific data • Detection of oil spills from satellite images • Prediction of molecular bioactivity for drug design • Telecommunications • Fraud detection to distinguish between “bad” and normal usage of cell phones
A wealth of information • Web mining • Characterize killer pages • Retail • Determine better product placement • Direct mail • Predict who is most likely to donate to a charity
Machine learning success(Machine learning is ubiquitous) • Scientific discovery • Detection of oil spills from satellite images • Telecommunications • Diagnosis of problems in the local loop • Printing • Determine causes of banding (printing cylinder problems) • Control • Self-steering vehicles
Why research in machine learning is so good today Research in machine learning benefits from • Abundant data • Interest in fielding new applications • Even more data • Push on limits of our understanding, technology, etc.
Plan for this talk Original • Discuss success stories and failures • Failures help identify new areas of research New plan • One success story in detail • Lesson learned: can identify new areas of research even when we succeed
Induction of decision trees • Not the only (or even the most “hot”) algorithms • Have been used in many contexts • Important for understanding our success story: local-loop network diagnosis
Inductive learning Given a collection of observations of the form (<x>, f<x>) Find g<x> that approximates f<x>
Learning objectives • Learn a tree that is correct • Learn a tree that is compact • At every level in the tree, select a test that best differentiates examples of one class from another
TDIDT • If all examples are from the same class • The tree is a leaf with that class name • Else • Pick a test to make • Construct one edge for each possible test outcome • Partition the examples by test outcome • Build subtrees recursively
The Gain Criterion • Measure the information of the collection • Measure the information of each possible split • Choose the split with greatest information gain
Information (Entropy) • Let T be a set of examples • Let C1, C2, …, Cn be class labels • freq(Ci,T) = number of examples in T that belong to class Ci. • |T| = number of examples in T • Select example and announce its class: info = - log2 freq(Ci,T)/|T|
Information (Entropy) • Let T be a set of examples • Info(T) = - (freq(Ci,T)/|T|) (log2 (freq(Ci/|T|)/|T|))
Entropy after a split • Let X be an attribute with n possible values. • Let Tj be the examples that have the value j for attribute X.Average entropy that results from making split on X:infoX(T) = ( |Ti| / |T| ) * info(Ti),sum over n possible values of X.
Information Gain • Compute infoX(T) for every attribute • Select attribute that maximizes info(T) – infoX(T)
Scrubber (the success story) • Diagnoses problems in the local loop • Problem may be due to trouble in: • Customer premise equipment • Facilities connecting customer to cable • Cable • Central office • Millions of “troubles” reported annually
MAX, 1990 • Acts as Maintenance Administrator (MA) • Sequence of action: • Customer calls • Rep takes information; initiates tests • Trouble report sent to MA • MA puts trouble in dispatch queue for specific type of technician
Scrubber 2 • Performed a task at a later point in the pipeline • Survey dispatch queues to determine whether dispatch appropriate • Dispatch not immediate • Many problems resolved exogenously
Scrubber 3 • Scrubber 2 for new application platform • Centralized knowledge server • Cover twice as large a network
Implementation difficulties • Original expert system shell no longer supported • Knowledge base evolved into opacity • Many tweaks over a decade • Many knowledge engineers • Most not available to work on Scrubber3
Requirements • Level of performance at least as good as prior system • Overall accuracy • False positives and false negatives in range • Comprehensible • For understanding and acceptance by experts
Additional requirements (ours) • Improved performance • Improved extensibility
Phase I: Modeling Scrubber 2 • Applied a decision tree learning algorithm • Input data: • Trouble reports • Scrubber 2 diagnoses
Data 26,000 trouble reports • 40 attributes (1/2 continuous; 1/2 symbolic) • Two classes • Dispatch • Don’t -- I.e., call customer to verify ok
Background knowledge • C4.5 selected • 17 of 40 attributes used
Phase I results • Decision trees with predictive accuracy of .99, with as few as 10,000 examples • Less than two days of work (easy!)
Phase II: Acceptance • Comprehensibility Readability • Need to observe rationality in learned knwoledge • Original trees on order of 1000 nodes • The simpler the model, the better it can be understood Comprehensibility = Readability + Simplicity + Fidelity
Trading off simplicity and correctness • Pruning nodes sacrifices correctness • Appropriate when comprehensibility an issue • Langley and Schwabacher, 2001 • Note: not pruning to avoid overfitting
Phase II results • Used only two most prominent attributes • New decision trees created • Still fell into acceptable zone
Phase III: Working toward extensibility • Hoped to gain flexibility for • Local modifiability • Additional attribute values • Moved toward probabilistic decision tree • Leaves labeled with probability estimates, not decisions • Stubby trees easy to represent in tabular form
Phase IIIb: More data • Focus on two attributes gave us access to an extensive data set • Many more trouble reports • Abridged (two-attribute) form had not been considered useful earlier
Phase III results • Simple diagnostic model • Greater empirical confidence -- impt due to small disjunct problem • “Big” general rules cover approximately 50% of the data • Remaining 50% covered by small disjuncts
Summarizing the success story • C4.5 applied to induce Scrubber 2 model • Pruned model for comprehensibility/simplicity • Converted new model into probabilistic one • Used newly gained data for additional tuning and confidence • Small(?), simple model in very short time
Lessons can be learned from success Lesson 1: the importance of comprehensibility • Rationality • Readability • Simplicity
Lessons can be learned from success Lesson 2: the need for algorithms to handle small data sets • Creative ways to engineer interesting features from few • Openness to alternative sources of data • Algorithms specifically tuned to handle small data sets Langley has noted this to be an issue of scientific data -- but true for industrial data as well
Lessons can be learned from success Lesson 3: the need to think about systematic error • Locally systematic error only look like noise with enough data • Clearly related to the problem of small data sets • How do our algorithms hold up?
Lessons can be learned from success Lesson 4: the need to think about the future • Learning results put into practice will be modifed and extended • Must new models be learned? • Can improvement be incremental?
Lessons can be learned from success Lesson 5: creative uses of the technology • Learning for the purposes of re-engineering isn’t “standard” • New applications will serve to fuel new research
Further reading and acknowledgements • Carla Brodley et al, American Scientist, Jan./Feb. ‘99 • Pat Langley, various publications • Thanks to Foster Provost and many others at Nynex / Bell Atlantic