Applying Finite Mixture Models

Applying Finite Mixture Models Presenter: Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland

Institute for Molecular Bioscience Building, University of Queensland

Topics • Introduction • Application of EM algorithm • Examples of normal mixtures • Robust mixture modeling • Number of components in a mixture model • Number of nonnormal components • Mixture models for failure-time data • Mixture software

1.1 Flexible Method of Modeling • Astronomy • Biology • Economics • Engineering • Genetics • Marketing • Medicine • Psychiatry

1.2 Initial Approach to Mixture Analysis • Classic paper of Pearson (1894)

Figure 1: Plot of forehead to body length data on 1000 crabs and ofthe fitted one-component (dashed line) and two-component (solid line)normal mixture models.

1.3 Basic Definition We letY1,…. Yndenote a random sample of sizenwhereYjis a p-dimensional random vector with probability density functionf (yj) where thef i(yj) are densities and thepiare nonnegative quantities that sum to one. (1)

1.4 Interpretation of Mixture Models An obvious way of generating a random vectorYjwith theg-component mixture densityf (Yj), given by (1), is as follows. Let Zjbe a categorical randomvariable taking on the values1,…,gwith probabilitiesp1, … pg, respectively,and suppose that the conditional density ofYjgiven Zj=i isfi(yj) (i=1, … , g). Then the unconditional density ofYj,(that is, its marginal density) is given byf (yj).

1.5 Shapes of Some Univariate Normal Mixtures Consider where denotes the univariate normal density with mean m and variance s2. (5) (6)

Figure 2: Plot of a mixture density of two univariate normalcomponents in equal proportions with common variances2=1 D=1 D=2 D=3 D=4

Figure 3:Plot of a mixture density of two univariate normalcomponents in proportions 0.75 and 0.25 with common variance D=1 D=2 D=3 D=4

1.6 Parametric Formulation of Mixture Model In many applications, the component densitiesfi(yj) are specifiedto belong to some parametric family. In this case, the component densitiesfi(yj) are specified asfi(yj;qi), whereqiis the vector of unknownparameters in the postulated form for the ith component density in the mixture. The mixture densityf(yj) can then be written as

(7) (8) 1.6 cont. where the vector Y containing all the parameters in the mixture model can be written as where xis the vector containing all the parameters in q1,…qg known a priori to be distinct.

denotes the multivariate normal density with mean (vector) and covariance matrix In practice, the components are often taken to belong to the normal family, leading to normal mixtures. In the case of multivariate normal components, we have that (9) where (i =1,…,2)

In this case, the vector Y of unknown parameters is given by where consists of the elements of the component means and the distinct elements of the component-covariance matrices

In the case of normal homoscedastic components where the component covariance matrices are restricted to being equal, (i=1,…,g) (10) consists of the elements of the component means and the distinct elements of the common component-covariance matrix

(11) 1.7 Identifiability of Mixture Distributions In general, a parametric family ofdensitiesf (yj;Y) is identifiable if distinct valuesof the parameter Ydetermine distinct members of the family ofdensities where W is the specified parameter space; that is,

1.7 cont. if and only if identifiability for mixture distributions is defined slightly different. To see why this is necessary, suppose thatf(yj;Y) has twocomponentdensities, say,fi(y; qi) andfh(y; qh), that belong to the same parametric family.Then (11) will still hold when thecomponent labelsiandhare interchanged inY. (12)

1.8 Estimation of Mixture Distributions • In the 1960s, the fitting of finite mixture models by maximum likelihoodhad been studied in a number of papers, including the seminal papers byDay (1969) and Wolfe (1965, 1967, 1970). • However, it was the publication of the seminal paper ofDempster, Laird, and Rubin (1977) on theEM algorithm that greatly stimulated interest inthe use of finite mixture distributions to model heterogeneous data.

1.8 Cont. This is because the fitting of mixture models by maximum likelihood is a classic example of a problem that is simplified considerably by the EM's conceptual unification of maximum likelihood (ML) estimation from data that can be viewed as being incomplete.

1.9 Mixture Likelihood Approach to Clustering Suppose that the purpose of fitting the finite mixture model (7)is to cluster an observed random sampley1,…,ynintogcomponents. This problem can be viewed as wishing to infer the associated component labelsz1,…,znof these feature data vectors. That is, we wish to infer thezjon the basis of the feature datayj.

1.9 Cont. After we fit the g-component mixture model to obtain the estimate of the vector of unknown parameters in the mixture model, we can give aprobabilistic clustering of the nfeature observationsy1,…,yn in terms of their fitted posterior probabilitiesof component membership. For each yj, the gprobabilities t1(yj; ) ,…, tg(yj; )give the estimated posterior probabilities that this observation belongsto the first, second,…, and gth component, respectively, of the mixture (j=1,…,n).

1.9 Cont. We can give an outright or hard clustering of these data by assigningeach yjto the component of the mixture to which it has the highestposterior probability of belonging. That is, we estimate thecomponent-label vector zj by , where is defined by fori=1,…,g; j=1,…,n. (14)

1.10 Testing for the Number of Components In some applications of mixture models, there is sufficient a prioriinformation for the number of components gin the mixture model tobe specified with no uncertainty. For example, this would be the case wherethe components correspond to externally existing groups in which thefeature vector is known to be normally distributed.

1.10 Cont. However, on many occasions, the number of components has to be inferred from the data, along with the parameters in the component densities. If, say, a mixture model is being usedto describe the distribution of some data, the number of components inthe final version of the model may be of interest beyond matters of atechnical or computational nature.

(15) 2. Application of EM algorithm2.1 Estimation of Mixing Proportions Suppose that the density of the random vector Yj has a g-component mixture from where Y=(p1,….,pg-1)T is the vector containing the unknown parameters, namely the g-1 mixing proportions p1,…,pg-1, since

2.1 cont. In order to pose this problem as an incomplete-data one, we now introduceas the unobservable or missing data the vector where zjis the g-dimensional vector of zero-one indicator variables as defined above. If thesezij were observable, then the MLE ofpiis simply given by (18) (i=1,…,g), (19)

2.1 Cont. The EM algorithm handles the addition of the unobservable data to the problem by working with Q(Y;Y(k)), which is the current conditional expectation of the complete-data log likelihood given the observed data.On defining the complete-data vector x as (20)

2.1 Cont. the complete-data log likelihood for Y has the multinomial form where does not depend on Y. (21)

2.1 Cont. As (21)is linear in the unobservable data zij, the E-step(on the (k+1)th iteration) simply requires the calculation of the currentconditional expectation of Zij given the observed data y, whereZij is the random variable corresponding to zij. Now (22)

(23) 2.1 Cont. where by Bayes Theorem, fori=1,…,g; j=1,…,n. The quantityti(yj;Y(k)) is the posterior probability that thejthmember of the sample with observed valueyjbelongs to theithcomponent of the mixture.

2.1 Cont. The M-step on the (k+1)th iteration simply requires replacing eachzij by tij(k) in (19) to give for i=1,…,g. (24)

2.2 Example 2.1: Synthetic Data Set 1 We generated a randomsample of n=50 observations y1,…,yn from a mixture of twounivariate normal densities with means m1=0 and m2=2 and commonvariance s2=1 in proportions p1=0.8 and p2=0.2.

Table 1: Results of EM Algorithm for Example on Estimation of Mixing Proportions

(28) 2.3 Univariate Normal Component Densities The normal mixture model to be fitted is thus where

2.3 Cont. The complete-data log likelihood function for Y is given by (21), but where now

The E-Step is the same as before, requiring the calculation of (23). The M-step now requires the computation of not only (24), but also the values and s (k+1)2 that, along with maximize Q(Y;Y(k)). 2.3 Cont.

and (29) 2.3 Cont. Now are the MLE’s of miands2 respectively, if the zij were observable.

2.3 Cont. As logLc(Y) is linear in the zij, it follows that the zij in (29) and (30) are replaced by their current conditional expectations , which here are the current estimates ti(yj;Y(k)) of the posterior probabilities of membership of the components of the mixture, given by

2.3 Cont. This yields and and is given by (24). (31) (32)

2.4 Multivariate Component Densities (34) (35)

2.4 Cont. In the case of normal components with arbitrary covariance matrices, equation (35) is replaced by (36)

2.5 Starting Values for EM Algorithm The EM algorithm is started from some initial value ofY, Y(0). Hencein practice we have to specify a value forY(0). An alternative approach is to perform the first E-step by specifying a valuetj(0)fort(yj;Y) for eachj (j=1,…,n), where is the vector containing the g posterior probabilities of component membership foryj,

2.5 Cont. The latter is usually undertaken by settingtj(0)=zj (0) forj=1,…,n, where defines an initial partition of the data intog groups.For example, an ad hoc way of initially partitioning thedata in the case of, say, a mixture ofg=2normal components with the samecovariance matrices, would be to plot the data for selections of two ofthe p variables, and then draw a line that divides the bivariate datainto two groups that have a scatter that appears normal.

2.5 Cont. For higher dimensional data, an initial valuez(0)for zmight be obtained through the use of some clustering algorithm, such ask-means or, say, an hierarchical procedure ifnis not too large. Another way of specifying an initial partitionz(0)of the data is to randomly divide the data into g groups corresponding to thegcomponents of the mixture model.

2.6 Example 2.2: Synthetic Data Set 2

2.7 Example 2.3: Synthetic Data Set 3

Applying Finite Mixture Models