Introduction to Data Mining and Machine Learning Top 10 Methods

Data Mining(and machine learning) A few important things in brief top10dm, - neural networks – overfitting --- SVM David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

http://is.gd/top10dm • C4.5 • k-means • SVMs • A priori • EM algorithm 6. PageRank 7. Adaboost 8. k-NN 9. Naive-Bayes 10. CART David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

http://is.gd/top10dm • C4.5 • k-means • SVMs • A priori • EM algorithm 6. PageRank 7. Adaboost 8. k-NN 9. Naive-Bayes 10. CART The 10 most mentioned in Data Mining academic literature, not including Machine Learning literature David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

http://is.gd/top10dm • C4.5 • k-means • SVMs • A priori • EM algorithm 6. PageRank 7. Adaboost 8. k-NN 9. Naive-Bayes 10. CART David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

http://is.gd/top10dm • C4.5 • k-means • SVMs • A priori • EM algorithm 6. PageRank 7. Adaboost 8. k-NN 9. Naive-Bayes 10. CART Decision tree methods Machine-learning decision surface method ‘meta’ / ensemble method Specific to certain kinds of data

So ... Today we will look at: The no. 1 Machine Learning method, or no. 11 data mining method? Overfitting (this is a good place for it...) Support vector machines David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Decision surface methods: finding separating hyperplanes David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Artificial Neural Networks An artificial neuron (node) An ANN (neural network) Nodes abstractly model neurons; they do very simple number crunching Numbers flow from left to right: the numbers arriving at the input layer get “transformed” to a new set of numbers at the output layer. There are many kinds of nodes, and many ways of combining them into a network, but we need only be concerned with the types described here, which turn out to be sufficient for any (consistent) pattern classification task.

A single node (artificial neuron) works like this 3 2 1 -2 2

A single node (artificial neuron) works like this 4 3 2 -3 1 -2 2 0 Field values come along (inputs from us, or from other nodes)

A single node (artificial neuron) works like this 4x3=12 2 -3x1=-3 -2 0x2=0 They get multiplied by the strengths on the input lines …

A single node (artificial neuron) works like this 3 2 f(12-3+0) 1 -2 2 The node adds up its inputs, and applies a simple function to it

A single node (artificial neuron) works like this 3 2 x f(9) 1 -2 x f(9) 2 It sends the result out along its output lines, where it will in turn get multiplied by the line weights before being delivered …

A 0.5 Computing AND with a NN 0.5 B The blue node is the output node. It adds the weighted inputs, and outputs 1 if the result is >= 1, otherwise 0.

A 1 Computing OR with a NN 1 B The blue node is the output node. It adds the weighted inputs, and outputs 1 if the result is >= 1, otherwise 0. With these weights, only one of the inputs needs to be a 1, and the output will be 1. Output will be 0 only if both inputs are zero.

A -1 Computing NOT with a NN 1 Bias unit which always sends fixed signal of 1 This NN computes the NOT of input A The blue unit is a threshold unit with a threshold of 1 as before. So if A is 1, the weighted sum at the output unit is 0, hence output is 0; If A is 0, the weighted sum is 1, so output is 1.

So, an NN can compute AND, OR and NOT – so what? It is straightforward to combine ANNs together, with outputs from some becoming the inputs of others, etc. That is, we can combine them just like logic gates on a microchip. And this means that a neural network can compute ANY function of the inputs

Imagine this. Image of handwritten character converted into array of grey levels (inputs) 26 outputs, one for each character And you’re telling me this because … ? a 7 2 b 0 c 0 d e 3 … … … 0 f Weights are the links are chosen such that the output corresponding to the correct letter emits a 1, and all the others emit a 0. This sort of thing is not only possible, but routine: Medical diagnosis, wine-tasting, lift-control, sales prediction, …

Clearly, an application will only be accurate if the weights are right. An ANN starts with randomised weights And with a database of known examples for training Getting the Right Weights 7 0 If this pattern corresponds to a “c” 2 0 We want these outputs 0 1 0 0 0 3 0 0 If wrong, weights are adjusted in a simple way which makes it more likely that the ANN will be correct for this input next time

Training an NN It works like this: Send Training Pattern in Crunch to outputs Adjust weights All correct STOP Some wrong Present a pattern as a series of numbers at the first layer of nodes. E.g. Field values for instance 1 Each node in the next layer does its simple processing, and sends its results to the next layer, and so on, until numbers call out at the output layer Compare the NN’s output pattern with the known correct pattern (target class). If different, adjust the weights somehow to make it more likely to be correct on this pattern next time.

`Classical’ NN Training An algorithm called backpropagation BP is the classic way of training a neural network. Based on partial differentiation, it prescribes a way to adjust the weights so that the error on the latest pattern would probably be reduced next time. We can instead use any optimisation algorihm(e.g. GA) to find the weights for a NN. E.g. the first ever significant application of particle swarm optimisation, showed that it was faster than BP, with better results.

The ANN is Learning during its training phase When it is in use, providing decisions/classifications for live cases it hasn’t seen before, we expect a reasonable decision from it. I.e. we want it to generalise well. Generalisation Suppose a network was trained with the black As and Bs; here, the black line is a visualisation of it’s decision space; it will think anything on one side is an A, and anything on the other side is a B.the white A represents an unseen test case. In the third example, it thinks this is a B A A A A A A A B A B A B A A A B B B B B B Good generalisation Fairly poor generalisation Stereotyping? Coverage and extent of training data helps to avoid poor generalisaton Main Point: when an NN generalises well, its results seems sensible, intuitive, and generally more accurate than people

Poor generalisation – some insight into overfitting Suppose we train an a classifier to tell the difference between handwritten t and c, using only these examples: ts The classifier will learn easily. It will probably gives 100% correct prediction on these cases. cs David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Overfitting BUT; this classifier will probably generalise very poorly; it will perform very badly on a test set E.g. here is potential (very likely) performance on certain unseen cases It will probably predict that this is a c Why? It will probably predict that this is a t David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Avoiding Overfitting • It can be avoided by using as much training data as possible,ensuring as much diversity as possible in the data. • This cuts down on the potential existence of features that might be discriminative in the training data, but are otherwise spurious. • It can be avoided by jittering(adding noise). • During training, every time an input pattern ispresented, it is randomly perturbed. The idea of this is that spurious features will be `washed out’ by the noise, but valid discriminatory features will remain. • The problem with this approach is how to correctly choose the level of noise. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Validation data Starting to overfit Avoiding Overfitting II A typical curve showing performance during training. But here is performance on unseen data, not in the training set. Training data error Time – for methods like neural networks David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Avoiding Overfitting III Another approach is early stopping. During training, keep track of the network’s performance on a separate validation set of data. At the point where error continues to improve on the training set, but starts to get worse on the validation set, that is when training should be stopped, since it is starting to overfit on the training data. The problem here is that this point is far from always clear cut. David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Real-world applications of ANNs are all over the place David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Stocks, Commodities and Futures • Currency Price Predictions • James O'Sullivan: Controls trading of more than 10 different financial markets with consistent profits. • Corporate Bond Rating • George Pugh: Predicts corporate bond ratings with 100% accuracy for consulting and trading. • Standard and Poor's 500 Prediction • LBS Capital Management, Inc: Predicts the S&P 500 one day ahead and one week ahead with better accuracy than traditional methods. • Forecasting Stock Prices • Walkrich Investments: Neural Networks rate underpriced stock; beating the S&P.

Business, Management, and Finance • Direct Marketing Mail Prediction • Microsoft: Improves response rates from 4.9% to 8.2%. • Credit Scoring • Herbert Jensen: Predicts loan application success with 75-80% accuracy. • Identifing Policemen with Potential for Misconduct • The Chicago Police Department predict misconduct potential based on employee records. • Jury Summoning with Neural Networks • The Montgomery Court House in Norristown, PA saves $70 million annually using The Intelligent Summoner from MEA. • Forecasting Highway Maintenance with Neural Networks • Professor Awad Hanna at the University of Wisconsin in Madison has trained a neural network to predict which type of concrete is better than another for a particular highway problem.

Medical Applications • Breast Cancer Cell Analysis • David Weinberg, MD: Image analysis ignores benign cells and classifies malignant cells. • Hospital Expenses Reduced • Anderson Memorial Hospital: Improves the quality of care, reduces death rate, and saved $500,000 in the first 15 months of use. • Diagnosing Heart Attacks • J. Furlong, MD: Recognizes Acute Myocardial Infarction from enzyme data • Emergency Room Lab Test Ordering • S. Berkov, MD: Saves time and money ordering tests using symptoms and demographics. • Classifying Patients for Psychiatric Care • G. Davis, MD: Predicts Length of Stay for Psychiatric Patients, saving money

Sports Applications • Thoroughbred Horse Racing • Don Emmons: 22 races, 17 winning horses. • Thoroughbred Horse Racing • Rich Janeva: 39% of winners picked at odds better than 4.5 to 1. • Dog Racing • Derek Anderson: 94% accuracy picking first place.

Science • Solar Flare Prediction • Dr. Henrik Lundstet: Predicts the next major solar flare; helps prevent problems for power plants. • Mosquito Identification • Aubrey Moore: 100% accuracy distinguishing between male and female, two species. • Spectroscopy • StellarNet Inc: Analyze spectral data to classify materials. • Weather Forecasting • Fort Worth National Weather Service: Predict rainfall to 85% accuracy. • Air Quality Testing • Researchers at the Defense Research Establishment Suffield, Chemical & Biological Defense Section, in Alberta, Canada have trained a neural network to recognize, classify and characterize aerosols of unknown origin with a high degree of accuracy.

Manufacturing • Plastics Testing • Monsanto: Predicts plastics quality, saving research time, processing time, and manufacturing expense. • Computer Chip Manufacturing Quality • Intel: Analyzes chip failures to help improve yields. • Nondestructive Concrete Testing • Donald G. Pratt: Detects the presence and position of flaws in reinforced concrete. • Beer Testing • Anheuser-Busch: Identifies the organic content of competitors' beer vapors with 96% accuracy. • Steam Quality Testing • AECL Research in Manitoba, Canada has developed the INSIGHT steam quality monitor, an instrument used to measure steam quality and mass flowrate.

Support Vector Machines: a different approach to finding the decision surface, particularly good at generalisation David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Suppose we can divide the classes with a simple hyperplane David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

There will be infinitely many such lines David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

One of them is ‘optimal’ David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Beause it maximises the average distance of the hyperplane from the ‘support vectors’ – instances that are closest to instances of different class David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

A Support Vector Machine (SVM) finds this hyperplane David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

But, usually there is no simple hyperplane that separates the classes! David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

One dimension (x), two classes David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Two dimensions (x, x*sin(x)), David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Now we can separate the classes David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

SVMs do ths: If we add enough extra dimensions/fields using arbitrary functions of the existing fields, then it becomes very likely we can separate the data. SVMs - apply such a transformation - then find the optimal separating hyperplane. The ‘optimality’ of the sep hyp means good generalisation properties David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html

Introduction to Data Mining and Machine Learning Top 10 Methods