Machine learning methods – Introduction The main properties of learning algorithms

Machine learning methods – IntroductionThe main properties of learning algorithms

Thegoal of machine learning • Goal: To construct programs that are able to improve their performance using the experience collected during their operation • Learning algorithm: algorithms that are able to deduct regularities, relationships from a set of training examples • Note 1.: The main aim is not to memorize the actual training examples, but to correctly generalize to other samples not seen during training (also known as inductivelearning) • Assumption:the examples faithfully represent the relationship that we try to learn • Note 2.: We can never be 100% sure that the relationship we found will generalize to unseen data • Because of this, we will call the found relationship a „hypothesis” • After receiving further examples the algorithm may refine the hypothesis

The main types oflearning tasks • Supervised learning: the correct answer is also given with the training examples • The most common task: classification • Example: character recognition: 16x16 pixels letter • 16x16 pixels: input features • Letter: class label • Practically, we have to learn a function from examples • This will be the dominant topic of this semester • Unsupervised learning: no helping information is given • Most common task: clustering • Mapping data points intoautomaticallyfound classesclasses based on some kind similarity measure

The main types oflearning tasks 2 • Modelling processes along time • In the classic function learning task we assume that the samples following each other are independent, or at least come in a random order • On contrary, when modelling time series we assume that the order carries crucial information that must be modelled • Examples: speech recognition,text analysis, modelling stock echange data • Reinforcement learning • Example: artificialliving „creatures” -- autonomous agents • Interaction with the environment, collection of experiences • The experiences have no labels in themselves, only a long-term goal is defined • A special sub-field within machine learning • Other special learning tasks

Supervised learning of functions • The input of the function: a vector of some measurement data • feature vector, attribute vector • The output of the function: class label or a real number • The input of thelearning algorithm: a set of training examples • Output: A hypothesis (model) about the function • It can return the (hypothesized) output value for any input vector • Set of training examples: a set of pairs of a feature vectors and the correspondingclass label • Examples: does the patient have influenza? F e a t u r e v e c t o r Class label (Y/N) training instances

The main properties of a learning system • We have to think about these features when designing a new learning method or when we look for a suitable method for a given task • The type of input/output of the function to be learned • The representation method of the learnedfunction (hypothesis) • Hypothesis space: what is the set of functions that the method selects from • Which hypothesis will it prefer when the are more hypotheses that fit the data • What algorithm is uses to find a/the best hypothesis

The output of the function to be learned • Classification: theoutput value is from a finite, discrete set • Example: character recognition. We have to tell which letter is shown in images of 16x16 pixels. Range of output values= letters of the ABC • The classification task is the typical machine learning task • Concept learning: the function has a binary range • Example: we want to train a robot the notion of “chair”. Each object in its environment either belongs to this notion or not. • Regression: the range of the function is continuous • Example: assessing the value of used cars based on features like brand, age, motor capacity,…

The input of the function to be learned • Binary features • Discrete features • Also called nominal, symbolic orcategoric features • Continuous features • Binarydiscretecontinuous conversion trivial • Discrete binary: • Class labels: learning N class labelscan alwaysbe solved as having N concept learning tasks („one against the rest”) • Features: N different values can be represented by log2N binaryvalues • Continuous Discrete : • Can be solved by quantization(with some error), eg. (fever) 39,7high • Quantization is only for features, less usual for training targets

Why does the type of input/output matter? • Different type of input/output requires a different type of inner representation • Some algorithms work only with a certain type of features/targets • Or they might work with other types of features, but not optimally • Examples: • Concept learning with binary features: we have to learn a boolean function • In the 60-70’s logic formulas were thought to be the best representation of human thinking • A lot of research effort was put into the learning of logic formulas, these algorithms do not work on other types of data • Theclassic SVM algorithmis defined for two classes • Several extensions exist for multi-class tasks

Input/output examples 2 • Theclassic decision tree algorithmswas defined for discrete features • There are several extensions for continuous feature, but these are not really efficient • The Gaussian mixture model of statistical pattern recognition • This assumes continuous features • There is not much sense in fitting Gaussian distributions on discrete features, in many cases the algorithm would crash in practice • Classification in general, when we have continuous features • The characteristic function of each class is a discontinuous function that is hard to represent • There are two general solutions to represent it using continuous functions: • Geometric approach • Decision-theoretic approach

The feature space and the decision boundary • When we have ta feature vector of N components, then our training examples can be displayed as points in an N-dimensional space • Example: • 2 features –>2axes (x1, x2) • Class label: by colors • Goal: to find the decision boundary between the classes • Generally: give an estimate of the (x1,x2)c function based on the training examples • The same as specifying the (x1,x2){0,1}characteristic function (or indicator function)of each ci class

Representing the decision boundary • Direct (geometric) approach: We directly represent the decision surface • Using some simple, continuous function like lines (planes) • Indirect (decision-theoretic) approach: • 1. We assign a function to each class that can tell for any point of the space the probability that thepoint belongs to the given class • 2. The given point is identified by the class label for which the discrimininant function takes the largest value • The boundary between the classes is defined indirectly by the section of the discriminant functions • This way, the classification task is solved indirectly by learning the discriminant fuctions

Further remarks (Input/output) • It is important whether the examples have missing feature values • There exist methods to estimate the missing values • But most algorithms cannot handle these by default • This might happen in several practical tasks (pl. medical diagnostics) • It is important whether the algorithm can handle contradicting examples (same feature vector with different class labels) • There are solutions to this • But some algorithms cannot handle this • It is very frequent in practice • Due to labelling mistakes, e.g. ambiguous diagnosis

Representation of the function to be learned • Symbolic representation numeric representation • This is an ancient debate in AI • 60-70’s : symbolic representation was preferred • E.g.: logic formulas, if-then rules • Currently: numeric representation is preferred • Pl. neural networks the representation consists of a bunch of real numbers • For certain tasks symbolic representation seems to be more suitable • E.g. automatic proving of mathematical theorems • For other tasks it makes no sense • E.g. image recognition • The most important aspect: does the model have to be well-structured, interpretable for human inspection? • Sometimes it does not matter, e.g. speech recognition • Sometimes human understanding is the goal,e.g. medical data mining

What is the hypothesis space used • Hypothesis space: the set of functions from which the algorithm selects the best fitting one • Example: parametric methods • In the case of a continuous feature space most methods use some paramteric curve to represent the function to be learned • Example: regression with 1 variable • We fit a polinomial on the training points • Restricting the hypotheis space: we specify the degree of the polinomial • This restricts the set of possible functions • The parameters that influence the size of the hypothesis space are called meta-parameters • Training = find the optimal parameters of the polinomial • In the example these are the coefficients of the polinomial • These are called the parameters of the model

What is the hypothesis space used 2 • Hypothesis space: the set of functions from which the algorithm selects its hypothesis • Restricting the hypothesis space is technically necessary • Continuous feature space: it is impossible to represent all possible functions • Discrete space: the number of possible functions is finite, so theoretically we could represent all of them, the practically usually there are too many combinations • It is also necessary for efficient (meaning well-generalizing) learning • Generalization requires that the system can give a reply for previously unseen examples • During training, we fit a model (function) on the data from the hypothesis space • The shape of this function plays a critical role in how the system replies to previously unseen data („inductive bias”) • Usually we work with mathematically simple function families • The optimal hypothesis space depends on the actual task!! • Toorestrictedhypothesis space the model won’t be able to learn even the training examples • Toowidehypothesis space it „mugs up”, the training examples, but cannot generalize • Similar to human learning (though we adjust the task to the child, and not the other way round..)

Which one it selects from among the possible hypothesis • Consistent hypothesis: gives a correct return value for the training examples • If there are more than one correct hypotheses, ten we have to chose from them • The training examples cannot help in this! • We need some heuristics for this • The principle of Occam’s razor: „when ther are more than one possible explanations, then usually the simplest one turns out to be right” • Of course, we have to mathematically define the notion of „simplest” • Eg.: minimum description length

What algorithm is used to find the best hypothesis • In the previous step we defined the criterium of the optimal hypothesis • In practice we will frequently define it as a target function • Defining it is not enough, we have to find it somehow • In the case of numerical models, the task of optimizing the target function usually leads to a multivariate global optimization problem • Theoretically, we may use general-purpose global optimization algorithms for this • In most cases, however, we will have a training algorithm specially adjusted for the needs of the actual machine learning model

Machine learning methods – Introduction The main properties of learning algorithms