240 likes | 410 Vues
Review of Lecture Two. Linear Regression Cost Function Gradient Decent Normal Equation (X T X) -1 Probabilistic Interpretation Maximum Likelihood Estimation vs. Linear Regression Gaussian Distribution of the Data Generative vs Discriminative.
E N D
Review of Lecture Two • Linear Regression • Cost Function • Gradient Decent • Normal Equation • (XTX)-1 • Probabilistic Interpretation • Maximum Likelihood Estimation vs. Linear Regression • Gaussian Distribution of the Data • Generative vs Discriminative
General Linear Regression MethodsImportant Implications • Recall q, a column vector (1 for the intercept q0+ n parameters), can be obtained from: • When the Xvariables are linearly independent (XTX being full rank), there is a unique solution to the normal equations; • The inversion of XTX depends on the existence of XTXX=X, that is to find a matrix equivalent of a numerical reciprocal; • Only models with a single output variable can be trained.
Maximum Likelihood Estimation • Assume data are i.i.d. (independently identically distributed) • Likelihood of L(q) = the probability of y given x parameterized by q • What is Maximum Likelihood Estimation (MLE)? • Chose parameters qto maximize the function , so to make the training data set as probable as possible.
The Connection Between MLE and OLE • Chose parameters q to maximize the data likelihood: • Equivalent to minimize
The Equivalence of MLE and OLE = J(q) !?
Today’s Content • Logistic Regression • Discrete Output • Connection to MLE • The Exponential Family • Bernoulli • Gaussian • Generalized Linear Models (GLMs)
Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.
Gradient Assent for MLE of the Logistic Function Recall Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule: Given
Discriminative vs. Generative Algorithms • Discriminative Learning • Either Learn p(y|x) directly, or learn hq {1,0} that given x, the hypothesis will output {1,0} directly; • Logistic regression is an example of discriminative learning algorithm; • In Contrast, Generative Learning • Build the probabilistic distribution of x conditioned for each of the classes, p(x|y=1) and p(x|y=0), respectively; • Also build the probabilistic distribution of p(y=1) or p(y=0), as the class priors (or the weights); • Use the Bayes Rule to compare the p(x|y) given y=1 or y=0, i.e., to see which one is more likely;
Question For P(y|x; q) • We learn qin order to maximize the P(y I x;q) • When we do so: • If y ~ Gaussian, we use Least Square Regression • If y {0,1} ~ Bernoulli, we use Logistic Regression Why ? Any natural reasons?
Any Probabilistic, Linear, and General (PLG), Learning Framework? A web-site visiting problem, for a PLG solution
Generalized Linear ModelsThe Exponential Family Natural (distribution) Parameter Sufficient Statistics, often T(y) = y Normalization Term A fixed choice of T, a, and b defines a set of distributions that is parameterized by h; as we vary h we will get different distributions within this family (affecting the mean). Bernoulli, Gaussian, and other distributions are examples of exponential family distributions. A way of unifying various statistical models, like linear regression, logistic regression and Poisson regression, into one framework.
Examples of distributions in the exponential family • Gaussian • Bernoulli • Binomial • Multinomial • Chi-square • Exponential • Poisson • Beta • …
Bernoulli Y | x; q ~ ExpFamily (h), here we chose a, b, T to be the specific form to cause the distribution to be Bernoulli. For any fixed x, q, we hope that our algorithm will output hq(x) = E[y|x;q) = p (y=1|x;q) = f = 1/(1+e-h) = 1/(1+eqTx) If you recall that the form of logistic function being 1/(1+e-z), now you should understand why we chose the logistic form for a learning process if my data mimics a Bernoulli distribution.
To Build GLM • p = (y|x ; q) where y belongs to a distribution of the Exponential Family (h) given x and q • Given x, our goal is to output E[T(y)|x] • i.e., we want h(x) = E[T(y)|x] (Note for most cases, T(y) = y) • Think about the relationship between the input x and the parameter h, which we hope to use h to define my desired distribution, according to • h=qTX (linear, as my design choice), h is a number or a vector
More precisely… A flexible generalization of ordinary least squares regression that relates the random distribution of the distribution function to the systematic portion of the linear predictor through a function called the link function.
Extensions The standard GLZ assumes that the observations are uncorrelated (i.i.d.) Models that deal with correlated data are extensions of GLZ’s. • Generalized estimating equations: Use population-averaged effects. • Generalized linear mixed models: A type of multilevel model (mixed model), an extension of logistic regression. • Hierarchical generalized linear models: similar to generalized linear mixed models, apart from two distinctions: • The random effects can have any distribution in the exponential family, whereas current linear mixed models nearly always have normal random effects; • Computationally less complex than linear mixed models.
Summary • GLM is a flexible generalization of ordinary least squares regression. • GLM generalizes linear regression by allowing the linear model to be related to the output variable via a link function and by allowing the magnitude of the variance of each feature to be a function of its predicted value. • GLMs are of unifying various other statistical models, including linear, logistic, …, and Poisson regressions, under one framework. • This allowed us to develop a general algorithm for maximum likelihood estimation in all these models. • It extends naturally to encompass many other models as well. • In a GLM, the output is thus assumed to be generated from a particular distribution function of the exponential family.