Classification, Part 2

Classification, Part 2 BMTRY 726 4/11/14

The 3 Approaches (1) Discriminant functions: -Find function f(x) that maps each point x directly into class label -NOTE: in this case, probabilities play no role (2) Linear (Quadratic) Discriminant Analysis - Solve inference problem using prior class probability and -use Bayes’ thm and find posterior class probabilities -use posteriors to make optimal decision (3) Logistic Regression -Solve inference problem by determining -use posteriors to make optimal decision

Logistic Regression Probably the most commonly used linear classifier (certainly one we all know) If the outcome is binary, we can describe the relationship between our features x and the probability of our outcomes as a linear relationship: Using this we define the posterior probability of being in either of the two classes as

Logistic Regression for K> 2 But what happens if we have more than two classes? Examples: (1) Ordinal outcome (i.e. Likert scale) defining attitudes towards safety of epidurals during labor among moms to be -Features may include things like age, ethnicity, level of education, socio-economic status, parity, etc. (2) Goal is to distinguish between several different types of lung tumors (both malignant and non-malignant): -small cell lung cancer -non-small cell lung cancer -granulomatosis -sarcoidosis In this case features may be pixels from CT scan image

Logistic Regression for K > 2 In the first example, cancer stage is ordinal One possible option is to fit a cumulative logit model The model for P(C<j|X = x) is just a logit model for a binary response In this case, the response takes value 1 if y< j and the takes value 0 if y > j + 1

Logistic Regression for K > 2 It is of greater interest however to model all K – 1 cumulative logits in a single model. This leads us to the proportional odds model Notice the intercept is allowed to vary as jincreases However, the other model parameters remain constant Does this make sense given the name of the model?

Logistic Regression for K > 2 Assumptions for the proportional odds model -Intercepts are increasing with increasing j -Models share the same rate of increase with increasing j -Odds ratios are proportional to distance between x1 and x2 and the proportionality constant is same for each logit So for j < k, the curve for P(C<k| X = x) is equivalent to curve P(C<j| X = x) shifted (b0k – b0j)/b units in the x direction Odds ratios to interpret the model are cumulative odds ratios

Logistic Regression for K > 2 In the second example, our class categories are not ordinal We can however fit a multinomial logit model The model includes K – 1 logit models

Logistic Regression for K > 2 We can estimate the posterior probabilities of each of our K classes from the multinomial logit model When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)

Logistic Regression for K > 2 When K = 2, this reduces down to a single linear function (i.e. a single logistic regression) Though we’ve referenced the last category, since the data have no natural ordering we could reference any category we choose. Unlike the cumulative logit and proportional odds models, all parameters vary in these models

Logistic Regression for K > 2 As in the case of the ordinal models, it makes more sense to fit these models simultaneously As a result, there are some assumptions and constraints we must impose (1) The different classes in the data represent a multinomial distribution -Constraint: all posterior probabilities must sum to 1 -In order to achieve this all models fit simultaneously (2) Independence of Irrelevant Alternatives: - relative odds between any two outcomes independent of number and nature of other outcomes being simultaneously considered

Logistic Regression vs. LDA Both LDA and logistic regression represent models of the log-posterior odds between classes k and K that are linear functions of x LDA: Logistic regression:

Logistic Regression vs. LDA The posterior conditional density of class k for both LDA and logistic regression can be written in the linear logit form The joint density for both can be written in the same way Both methods represent linear decision boundaries that classify observations So what’s the difference?

Logistic Regression vs. LDA The difference lies in how the linear coefficients are estimated LDA: Parameters are fit by maximizing the full log likelihood based on the joint density -recall here f is the Gaussian density function Logistic regression: In this case the marginal density P(C = k) is arbitrary and parameters are estimated by maximizing the conditional multinomial likelihood -although ignored, we can think of this marginal density as being estimated in a nonparametric unrestricted fashion

Logistic Regression vs. LDA So this means… (1) LR makes fewer assumptions about distribution of the data (more general approach) (2) But LR “ignores” the marginal distribution P(C = k) -Including additional distributional assumptions provides more information about parameters allowing for more efficient estimation (i.e. lower variance) -if Gaussian assumption correct, could lose up to 30% efficiency -OR need 30% more data for conditional likelihood to do as well as full likelihood

Logistic Regression vs. LDA (3) If observations far from decision boundary (i.e. probably NOT Gaussian), they influence estimation of common covariance matrix -i.e. LDA is not robust to outliers (4) data in a two class model can be perfectly separated by a hyperplane, LR parameters are undefined. But LDA coefficients still well defined (marginal likelihood avoids this degeneracy) -e.g. one particular feature has all of its density in one class -Advantages/disadvantages to both methods -LR thought of as more robust because makes fewer assumptions -In practice they tend to perform similarly

Problems with Both Methods There are a few problems with both methods… • As with regression, we have a hard time including a large number of covariates especially if n is small • A linear boundary may not really be an appropriate choice for separating our classes So what can we do?

LDA and logistic regression works well if classes are linear separable… But what if they aren’t? Linear boundaries may be almost useless

Nonlinear test statistics The optimal decision boundary may not be a hyperplane → nonlinear test statistic H1 H0 accept Multivariate statistical methods are a Big Industry: Neural Networks Support Vector Machines Kernel density methods

Artificial Neural Networks (ANNs) Central Idea: Extract linear combinations of inputs as derived features and then model the outcome (classes) as a nonlinear function of these features Huh!? Really they are nonlinear statistical models but with pieces that are familiar to us already

Biologic Neurons Idea for Neural Networks came from biology- more specifically, the brain… Input signals come from the axons of other neurons, which connect to dendrites (input terminals) at the synapses If a sufficient excitatory signal is received, the neuron fires and sends an output signal along the axons The firing of the neuron occurs when a threshold excitation is reached

Brains versus Computers : Some numbers -Approximately 10 billion neurons in the human cortex, compared with 10 of thousands of processors in the most powerful parallel computers -Each biological neuron is connected to several thousand other neurons, similar to the connectivity in powerful parallel computers -Lack of processing units can be compensated by speed. The typical operating speeds of biological neurons is measured in milliseconds while a silicon chip can operate in nanoseconds -The human brain is extremely energy efficient, using approximately 10-16 joules per operation per second, whereas the best computers today use around 10-6 joules per operation per second -Brains have been evolving for tens of millions of years, computers have been evolving for tens of decades.

ANNs Non-linear (mathematical) models of an artificial neuron x1 x2 w1 w2 x3 h w3 S g O Output signal Activation/ Threshold Function wp Synaptic Weights xp Input Signal

ANNs … Neural Network is 2-stage classification (or regression) model Can be represented as network diagram -for classification these represent the K-classes -kth unit models probability of being in kth class Y1 Y1 Y2 YK … Z1 Z2 Z3 ZM … X1 X2 X3 Xp-1 Xp

ANNs … Zm represent derived features created from linear combinations of the X’s Y’s are modeled as a function of linear combinations of the Zm sis called the activation function Y1 Y1 Y2 YK … Z1 Z2 Z3 ZM … X1 X2 X3 Xp-1 Xp

ANNs The activation function, s, could be any function we choose In practice, there are only a few that are frequently used

ANNs ANNs are based on simpler classifiers called perceptrons The original single layer perceptronused the hard threshold sign function but this lacks flexibility making separation of classes difficult Later adapted to use the sigmoid function -Note this should be familiar (think back to logistic regression) ANNs are adaptation of the original single layer perceptron that include multiple layers (and have hence also been referred to as multi-layer perceptrons) Use of the sigmoid function also links it with multinomial logistic regression

Next Class… How do you fit an ANN? What are the issues with ANN? Software 

Classification, Part 2