Linear Models & Clustering

Linear Models & Clustering Presented by Kwak, Nam-ju

Coverage • Classification • Some tools for classification • Linear regression • Multiresponse linear regression • Logistic regression • Perceptron • Instance-based learning • Basic understanding • kD-tree • Ball tree • Clustering • Clustering and types of clustering • Iterative distance-based clustering • Faster distance calculation

Classification • Classification • Some tools for classification • Linear regression • Multiresponse linear regression • Logistic regression • Perceptron

Some tools for classification • An input is categorized into one of collections of data based on its features or attributes. • Classification is important in that we can distinguish a set of data having common characteristics from others. Some classification operations Input Class

Some tools for classification • Decision tree & classification rule x=1? x XOR y = ? yes no y=1? y=1? no yes no yes Classification rule b a a b If x=1 and y=0 then class=a If x=0 and y=1 then class=a If x=0 and y=0 then class=b If x=1 and y=1 then class=b Decision tree

Linear regression • If attributes and classes are numeric value, we can express the resulting class of a given input as a linear transformation between the set of attributes of the input and a certain set of weights. • x: class, wi: weight, ai: attribute • It is important to set wi well, so that the transformation results in a desirable class for given attributes of an input.

Linear regression • Here, we introduce a simple way to make a machine “learn”. • A machine takes several training instances, which are associations of a set of attributes to a class. • It extracts a rule from the training instances, then builds and tunes a mechanism to infer a class from a unknown test example. • It would give us an inferred class using “learnt” knowledge.

Linear regression • n training instances will be given, that is, n sets of attributes and n corresponding classes is going to be provided. • x(i): the corresponding ACTUAL class for the i-th training instance • aj(i): the j-th attribute of the i-th training instance • It is clear that we should find the set of wj’s minimizing the following:

Multiresponse linear regression • Linear regression is performed for each appearing class individually, with training instances, such that the value of linear transformation becomes 1 for training instances of that class and 0 for others. • Let us assume that we are doing linear regression for a certain class. But the following conditions should be met. • It looks like a membership function.

Multiresponse linear regression • Now, with a given test example, we evaluate linear transformation for each class using wi of that class. • Select the class which gives the largest value as the class of the test example. For Class 1 Value 1 For Class 2 Largest!! Value 2 Test example … For Class m Value 3 For Class n Value 4

Logistic regression • Logit function • Inverse Logit function From Wikipedia

Logistic regression • P(1|a1, … , ak): for a certain class, the probability that a test example consisting of a1, … , akis of that class • We set wi’s to minimize log-likelihood for each class.

Logistic regression • Plain multiresponse regression doesn’t guarantee that each linear transformation value is between 0 and 1. • With Logistic regression, the value is between 0 and 1 and satisfies one of important condition for being regarded as a probability. • However, the sum of values for all the classes may not become 1.

Logistic regression • Pairwise classification: for every pair of classes, namely, the first one and the second one, the meaning of P(1|a1, … , ak) issomewhat changed. • P(1|a1, … , ak): the probability that a test example consisting of a1, … , akis of the first class • P(0|a1, … , ak)=1-P(1|a1, … , ak): the probability that a test example consisting of a1, … , akis of the second class • The regression is done only for training instances of either the first and the second class of the pair.

Logistic regression • For each pair of classes, namely, the first one and the second one, if P(1|a1, … , ak) is above 0.5, then the resulting class is the first one, otherwise, the second one. • We can count how many times each class wins pairwise classification. The class which wins the most many times is the final resulting class for the given test example.

Logistic regression Winner the most many times!! h P(1|a1, … , ak) ≥0.5? Or not? i i … i P(1|a1, … , ak) ≥0.5? Or not? i j … i P(1|a1, … , ak) ≥0.5? Or not? k k

Perceptron • Sometimes, we only need to know which class a test example belongs to without any information of probabilities. • Assumptions for simplification • Only two classes are of interest. • Linearly separable: data space can be separated with a single hyperplane. Linearly separable Not linearly separable

Perceptron • Remind that it is about a pair of two classes, namely, the first class and the second one. • If a test example makes it below 0, the example is of the first class. If a test example makes it above 0, the example is of the second class. • We will findwj's asdescribed above. • In other words, we’re looking for a hyperplane:

Perceptron • Algorithm PERCEPTRON LEARNING RULE • When misclassified instance is found, parameters of the perceptronhyperplane is modified, so that the instance may be classified correctly in the future. • If A is added into wj's • (w0, w1, … , wk) ☞ (w0+a0, w1+a1, … , wk+ak) • w0a0+w1a1+ … +wkak ☞ w0a0+w1a1+ … +wkak+∑aj2 • Initialize all wj's to 0’s • Until all the training instances are properly classified • For each training instance A • If A is wrongly classified by the current perceptron • If A is actually of the first class, add A into wj's • If A is actually of the second class, subtract A from wj's

Perceptron • Perceptron: the hyperplane found in such a way • Perceptron is grandfather/grandmother of neural network. An instance is input into the perceptron. Attributes of the instance activate the input layer. Attributes are linearly transformed with weights and sent to the output node. Output node signals 1 if the received value is above 0, -1 otherwise. w0 w1 wj wk a0 a1 aj ak …

Instance-based learning • Instance-based learning • kD-tree • Ball tree

Basic understanding • Find the training instance which is the closest to the test example and predict the class from it. • Distance • Euclidean distance • Alternatives • Normalizing attributes

kD-tree • k: the number of attributes • Assume that k=2. (7, 4) (3, 8) (6, 7) (2, 2) (6, 7) (7, 4) (2, 2) (3, 8)

kD-tree

Ball tree 8 5 5 2 2 3 2

clustering • Clustering • Iterative distance-based clustering • Faster distance calculation

Clustering and types of clustering • No class to be predicted • Instances are to be divided into groups. • Types of clustering • Exclusive • Overlapping • Probabilistic • Hierarchical

Iterative distance-based clustering • Also called k-means • Step 1: Select k points randomly as centers of k clusters. • Step 2: Each instance is associated with the center which is the closest to it. • Step 3: After all the instances are associated, for each cluster, the centroid is computed from the instances of that cluster. This centroid becomes a new center of the cluster. • Step 4: With new centers for clusters, the same jobs are repeated.

Iterative distance-based clustering • The best solution (k=2) • If the randomly selected centers are as follows,

Faster distance calculation • For each node, keep the sum of all instances and the number of instances belonging to the ball the node represents. • Traversing the tree from top to bottom, find the closest cluster center for each instance. • If an entire ball of a node belongs to a certain cluster center, we need not traverse its child nodes simply by utilizing information stored in the node.

Faster distance calculation

Conclusion • Any question?

Linear Models & Clustering