Applying Finite Mixture Models

Applying Finite Mixture Models Presenter: Geoff McLachlan

Topics • Introduction • Application of EM algorithm • Examples of normal mixtures • Robust mixture modeling • Number of components in a mixture model • Number of nonnormal components • Mixture models for failure-time data • Mixture software

1.1 Flexible Method of Modeling • Astronomy • Biology • Genetics • Medicine • Psychiatry • Economics • Engineering • Marketing

1.2 Initial Approach to Mixture Analysis • Classic paper of Pearson (1894)

Figure 1: Plot of forehead to body length data on 1000 crabs and ofthe fitted one-component (dashed line) and two-component (solid line)normal mixture models.

1.3 Basic Definition We let Y1,…. Yn denote a random sample of size n where Yj is a p-dimensional random vector with probability density function f (yj) where the f i(yj) are densities and the pi are nonnegative quantities that sum to one. (1)

1.4 Interpretation of Mixture Models An obvious way of generating a random vector Yj with the g-component mixture density f (Yj), given by (1), is as follows.Let Zj be a categorical randomvariable taking on the values 1,…,g with probabilitiesp1, … pg, respectively,and suppose that the conditional density of Yj given Zj=i isfi(yj) (i=1, … , g). Then the unconditional density of Yj,(that is, its marginal density) is given by f (yj).

1.5 Shapes of Some Univariate Normal Mixtures Consider where denotes the univariate normal density with mean m and variance s2. (5) (6)

Figure 2: Plot of a mixture density of two univariate normalcomponents in equal proportions with common variance s2=1 D=1 D=2 D=3 D=4

Figure 3:Plot of a mixture density of two univariate normalcomponents in proportions 0.75 and 0.25 with common variance D=1 D=2 D=3 D=4

1.6 Parametric Formulation of Mixture Model In many applications, the component densities fi(yj) are specifiedto belong to some parametric family. In this case, the component densities fi(yj) are specified asfi(yj;qi), where qi is the vector of unknownparameters in the postulated form for the ith component density in the mixture. The mixture density f(yj) can then be written as

1.6 cont. (7) where the vector Y containing all the parameters in the mixture model can be written as where x is the vector containing all the parameters in q1,…qg known a priori to be distinct. (8)

1.7 Identifiability of Mixture Distributions In general, a parametric family ofdensitiesf (yj;Y) is identifiable if distinct valuesof the parameter Y determine distinct members of the family ofdensities where W is the specified parameter space; that is, (11)

1.7 cont. (12) if and only if identifiability for mixture distributions is defined slightly different.To see why this is necessary, suppose that f(yj;Y) has twocomponentdensities, say, fi(y;qi) and fh(y;qh), that belong to the same parametric family.Then (11) will still hold when thecomponent labels i and h are interchanged in Y.

1.8 Estimation of Mixture Distributions • In the 1960s, the fitting of finite mixture models by maximum likelihoodhad been studied in a number of papers, including the seminal papers byDay (1969) and Wolfe (1965, 1967, 1970). • However, it was the publication of the seminal paper ofDempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest inthe use of finite mixture distributions to model heterogeneous data.

1.8 Cont. This is because the fitting of mixture models by maximum likelihood is a classic example of a problem that is simplified considerably by the EM's conceptual unification of maximum likelihood (ML) estimation from data that can be viewed as being incomplete.

1.9 Mixture Likelihood Approach to Clustering Suppose that the purpose of fitting the finite mixture model (7)is to cluster an observed random sample y1,…,yn into gcomponents. This problem can be viewed as wishing to infer the associated component labels z1,…,zn of these feature data vectors. That is, we wish to infer the zj on the basis of the feature data yj.

1.9 Cont. After we fit the g-component mixture model to obtain the estimate of the vector of unknown parameters in the mixture model, we can give aprobabilistic clustering of the n feature observationsy1,…,yn in terms of their fitted posterior probabilitiesof component membership. For each yj, the gprobabilities t1(yj; ) ,…, tg(yj; )give the estimated posterior probabilities that this observation belongsto the first, second,…, and g th component, respectively, of the mixture (j=1,…,n).

1.9 Cont. We can give an outright or hard clustering of these data by assigningeach yj to the component of the mixture to which it has the highestposterior probability of belonging. That is, we estimate thecomponent-label vector zj by , where is defined by for i=1,…,g; j=1,…,n. (14)

1.10 Testing for the Number of Components In some applications of mixture models, there is sufficient a prioriinformation for the number of components g in the mixture model tobe specified with no uncertainty. For example, this would be the case wherethe components correspond to externally existing groups in which thefeature vector is known to be normally distributed.

1.10 Cont. However, on many occasions, the number of components has to be inferred from the data, along with the parameters in the component densities. If, say, a mixture model is being usedto describe the distribution of some data, the number of components inthe final version of the model may be of interest beyond matters of atechnical or computational nature.

2. Application of EM algorithm2.1 Estimation of Mixing Proportions Suppose that the density of the random vector Yj has a g-component mixture from where Y=(p1,….,pg-1)T is the vector containing the unknown parameters, namely the g-1 mixing proportions p1,…,pg-1, since (15)

2.1 cont. In order to pose this problem as an incomplete-data one, we now introduceas the unobservable or missing data the vector where zj is the g-dimensional vector of zero-one indicator variables as defined above. If these zij were observable, then the MLE of pi is simply given by (18) (19)

2.1 Cont. The EM algorithm handles the addition of the unobservable data to the problem by working with Q(Y;Y(k)), which is the current conditional expectation of the complete-data log likelihood given the observed data.On defining the complete-data vector x as (20)

2.1 Cont. the complete-data log likelihood for Y has the multinomial form where does not depend on Y. (21)

2.1 Cont. As (21) is linear in the unobservable data zij, the E-step(on the (k+1) th iteration) simply requires the calculation of the currentconditional expectation of Zij given the observed data y, whereZij is the random variable corresponding to zij. Now (22)

2.1 Cont. (23) where by Bayes Theorem, for i=1,…,g; j=1,…,n. The quantityti(yj;Y(k)) is the posterior probability that the j thmember of the sample with observed value yj belongs to the i thcomponent of the mixture.

2.1 Cont. The M-step on the (k+1) th iteration simply requires replacing eachzij by tij(k) in (19) to give for i=1,…,g. (24)

2.2 Example 2.1: Synthetic Data Set 1 We generated a randomsample of n=50 observations y1,…,yn from a mixture of twounivariate normal densities with means m1=0 and m2=2 and commonvariance s2=1 in proportions p1=0.8 and p2=0.2.

Table 1: Results of EM Algorithm for Example on Estimation of Mixing Proportions

2.3 Univariate Normal Component Densities The normal mixture model to be fitted is thus where (28)

2.3 Cont. The complete-data log likelihood function for Y is given by (21), but where now

2.3 Cont. The E-Step is the same as before, requiring the calculation of (23). The M-step now requires the computation of not only (24), but also the values and s(k+1)2 that, along with maximize Q(Y;Y(k)). Now are the MLE’s of mi and s2 respectively, if the zij were observable. and (29)

2.3 Cont. As logLc(Y) is linear in the zij, it follows that the zij in (29) and (30) are replaced by their current conditional expectations , which here are the current estimates ti(yj;Y(k)) of the posterior probabilities of membership of the components of the mixture, given by

2.3 Cont. This yields and and is given by (24). (31) (32)

2.4 Multivariate Component Densities (34) (35)

2.4 Cont. In the case of normal components with arbitrary covariance matrices, equation (35) is replaced by (36)

2.5 Starting Values for EM Algorithm The EM algorithm is started from some initial value of Y, Y(0). Hence in practice we have to specify a value for Y(0). Hence an alternative approach is to perform the first E-step by specifying a value tj(0) for t(yj;Y) for each j (j=1,…,n), where is the vector containing the g posterior probabilities of component membership for yj,

2.5 Cont. The latter is usually undertaken by setting tj(0)=zj (0) for j=1,…,n, where defines an initial partition of the data into g groups.For example, an ad hoc way of initially partitioning thedata in the case of, say, a mixture of g=2 normal components with the samecovariance matrices, would be to plot the data for selections of two ofthe p variables, and then draw a line that divides the bivariate datainto two groups that have a scatter that appears normal.

2.5 Cont. For higher dimensional data, an initial value z(0) for z might beobtained through the use of some clustering algorithm, such as k-meansor, say, an hierarchical procedure if n is not too large. Another way of specifying an initial partition z(0) of the datais to randomly divide the data into g groups corresponding to the gcomponents of the mixture model.

2.6 Example 2.2: Synthetic Data Set 2

2.7 Example 2.3: Synthetic Data Set 3

Figure 7

Figure 8

2.8 Provision of Standard Errors One way of obtaining standard errors of the estimates of the parameters ina mixture model is to approximate the covariance matrix of by the inverse of the observed information matrix, which is given by thenegative of the Hessian matrix of the log likelihood evaluated at the MLE. It is important to emphasize that estimates of the covariance matrix ofthe MLE based on the expected or observed information matrices are guaranteed to be valid inferentially only asymptotically.

2.8 Cont. • In particular for mixture models, it is well known that the sample size n has to be very large before the asymptotic theory of maximum likelihood applies. • Hence we shall now consider a resampling approach, the bootstrap, to this problem. • Standard error estimation of may be implemented according to the bootstrap as follows.

Applying Finite Mixture Models