1 / 31

Maschinelles Lernen

Maschinelles Lernen. Expectation Maximization (EM) Algorithmus Clustering (unsupervidiertes Lernen). Expectation Maximization Algorithmus. Aufgabe: Gegeben sei eine Menge von Beobachtungen Χ = (x n ) n=1,…,N Lerne eine Mischverteilung.

Télécharger la présentation

Maschinelles Lernen

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Maschinelles Lernen Expectation Maximization (EM) AlgorithmusClustering (unsupervidiertes Lernen)

  2. Expectation Maximization Algorithmus Aufgabe: Gegeben sei eine Menge von Beobachtungen Χ=(xn)n=1,…,N Lerne eine Mischverteilung Dabei ist Θ die Menge aller Parameter der Mischverteilung, und Θj sind die Parameter der j-ten Komponente der Mischverteilung. p2( x|Θ2 ) p1( x|Θ1 ) Bem.: Lernen von Mischverteilungen ist nützlich, z.B. wenn man klassifizieren will: Sind die a priori Klassenwahr-scheinlichkeiten P(j) und die Parameter Θj, d.h. die Komponenten pj(x | Θj) bekannt, so kann optimal gemäß des Bayes-Klassifikators klassifiziert werden. Dieses Problem ist uns in der ersten Vorlesung bereits begegnet (Lernen der gemeinsamen Längenverteilung von Seebarsch gegen Lachs), allerdings sindin der jetzigen Situation die Klassenlabels unbekannt.

  3. Expectation Maximization Algorithmus Die Log-Likelihood beträgt Führe neue Zufallsvariablen Z = (zn)n=1,…,N ein, die die Klassenzugehörigkeit von Beobachtung xn anzeigen. Sind die zn bekannt, so berechnet sich die Likelihood zu Zur Berechnung dieses Ausdrucks benötigen wir die Parameter Θ und die Wahrscheinlichkeiten P(zn).

  4. Expectation Maximization Algorithmus

  5. Expectation Step Starte mit den alten Parametern Θjold und Pold(j) und berechne die a posteriori Wahrscheinlichkeiten für die Klassenzugehörigkeiten: Der Erwartungswert von E(X) bezüglich Z ist die gewichtete Summe

  6. Expectation Step Substituiere Also Dieser Ausdruck soll nach Θnewj und Pnew(j) maximiert werden. Dies kann wegen der Gestalt des Ausdrucks getrennt voneinander geschehen.

  7. Maximization Step

  8. Maximization Step

  9. Maximization Step

  10. Clustering Partitioning methods. These usually require the specification of the number of clusters. Then a mechanism for apportioning objects to clusters must be determined Advantage: provides clusters that satisfy an optimality criterion (approximately) Disadvantage: need initial K, long computation time • Hierarchical methods • These methods provide a hierarchy of clusters, from the smallest, where all objects are in one cluster, through to the largest set, where each observation is in its own cluster • Advantage: fast computation (agglomerative clustering) • Disadvantage: rigid, cannot correct later for erroneous decisions made earlier

  11. K-means Clustering

  12. K-means Clustering

  13. Partitionsmethoden If we measure the quality of an estimator with the expected prediction error using the zero-one loss function (which is standard for classification), the optimal classifier is the Bayes classifier (see chapter “classification”) and has the form Of course, we do not know the probability distributions Pj for each class Cj, but we can make some sensible assumptions on Pj : Let the predictor space IRd be equipped with some distance measure d(a,b). Let Pj be a mononodal distribution which is symmetrical around its mode μj , i.e. there is a non-decreasing function pj: [0,∞) → IR such that Pj(x) = pj ( d(x,μj) ). If we assume that all Pj have the same shape, i.e. p1=…=pk=p, then and consequently, In other words, our considerations lead to a very simple classification rule: Given x, search the class Cj whose mode μj is nearest to x, and classify x as Cj. Note that this rule remains unchanged for different choices of the function p (which determines the shape of the distributions Pj) !

  14. The maximum likelihood estimate is the parameter set for which the probability of observing the data (which is given by the xj and their optimal classification C(xj) ) becomes maximal. In general, this is impossible to do analytically. We therefore use an iterative strategy to find a local maximum of P(D|μ): Partitionsmethoden Still, we cannot classify x since we do not know the modes μj . Under our model assumptions, the set of parameters μ = (μ1,…, μk) completely determines the probability distribution P as well as the Bayes classifier C, and we can write down the likelihood of observing the data D = { (xj,C(xj)) , j=1,…,n }, given μ: We would like to find the maximum likelihood estimator for μ, where

  15. Set T = 0 and start with some arbitrary parameters μ(0)= (μ1 (0),…, μk (0)) • Repeat • Until convergence of ( μ(T) ). • For each point xn, calculate its label Ln(T)=C(xn) • For each cluster j = 1,...,k, calculate • (if there is a tie for the best μj(T+1), stay at μj(T) if possible, otherwise choose by random) • T ← T+1 Partitionsmethoden (update labels) (update centres) It can be shown that the sequence μ(T) , T=0,1,2,… converges (Exercise!) and the process stops. Since the corresponding sequence P(D| μ(T)), T=0,1,... is monotonically increasing and bounded by 1, it necessarily converges to a local maximum of P(D|μ).BUT: The above strategy does not guarantee to find a global maximum!

  16. K-means Clustering, Beispiel What happens if we take p(x) = c·exp(-x2/(2σ2)) for some σ and an appropriate normalizing constant c ?

  17. Start with some arbitrary parameters μ = (μ1 ,…, μk) • Repeat • Until convergence of the sequence of the (μ). • Update labels: Lm= argminCj d(xm,μj) , m = 1,...,n • Update centers: K-means Clustering, Beispiel Remember that in one of the early lectures, the term on the right side was used to define the “centre” of all the points xn which are labeled Cj. For d the Euclidean distance, we proved that the centre equals the arithmetic mean of the points involved (we proved this for one-dimensional data points xn, but this holds for higher dimensions as well). So letting d be the Euclidean distance, the procedure for the estimation of becomes the so-called k-means algorithm:

  18. K-means Clustering, Beispiel Genespressionsdaten Gene expression data on p genes (variables) for n mRNA samples (observations) Samples 1,…,p Genes 1,..,p gene expression level of gene j in mRNA sample i. Task: Find “interesting” clusters of genes, i.e. genes with similar behaviour across samples.

  19. K-means Clustering, Beispiel taken from Silicon Genetics

  20. K-means Clustering, Beispiel

  21. K-means Clustering Raw data Features were first standardized Giving all attributes equal influence (standardization) can obscure well-separated groups

  22. K-means Clustering • Advantages to using k-means • With a large number of variables, k-means may be computationally faster than hierarchical clustering (if k is small). • k-means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular. • Disadvantages to using k-means • Difficulty in comparing quality of the clusters (e.g. for different initial partitions or values of k affect outcome). • Fixed number of clusters can make it difficult to predict what k should be. • Does not work well with non-globular clusters. • Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.

  23. Hierarchisches Clustering, Dendrogramme Hierarchische Methoden produzieren als Ergebnis eine Baumstruktur, ein sog. Dendrogramm. Es wird im Voraus keine Clusterzahl festgelegt. Es gibt zwei prinzipielle Methoden der Clustergenerierung: divisive und agglomerative Verfahren.

  24. Hierarchisches Clustering, Dendrogramme

  25. Start with the family S(0) = { {x1},...,{xn} }. Each data point lies in a one-element set • For j = 1,...,n-1 • Choose the pair of sets G,H є S(j-1) for which d(G,H) becomes minimal • Define S(j) = S(j-1) \ {G,H} U {G U H}. (Merge the two sets G,H into one set G U H) Hierarchisches Clustering, Agglomerative Methoden Let d(G,H) be a function that maps any to subsets G,H of the set of all data points to a non-negative real value. Think of d as of a distance measure for sets of data points. The algorithm for agglomerative clustering is then It easy to see that S(j) is a partition of the data points that consists of n-j sets. Hence if we want to obtain a clustering with k classes, the partition S(n-k) provides a classification of the data points into k mutually disjoint classes.

  26. Hierarchisches Clustering, Agglomerative Methoden Single linkage • The distance between two clusters is the minimal distance between two objects, one from each cluster • Single linkage only requires that a single dissimilarity be small for two groups G and H to be considered close together, irrespective of the other observation dissimilarities between the groups. • It will therefore have a tendency to combine, at relatively low thresholds, observations linked by a series of close intermediate observations (chaining). • Disadvantage: The clusters produced by single linkage can violate the “compactness” property that all observations within each cluster tend to be similar to one another, based on the supplied observation dissimilarities.

  27. Hierarchisches Clustering, Agglomerative Methoden Complete linkage • The distance between two clusters is the maximum of the distances between two objects, one from each cluster • Two groups G and H are considered close only if all of the observations in their union are relatively similar. • It tends to produce compact clusters with small diameters, but can produce cluster that violate the “closeness” property.

  28. Hierarchisches Clustering, Agglomerative Methoden Average linkage • The distance between two clusters is the average of the pairwise distance between members of the two clusters • Represent a compromise between the two extremes of single and complete linkage. • Produce relative compact clusters that are relatively far apart. • Disadvantage: its results depend on the numerical scale on which the observation dissimilarities are measured. Centroid linkage • The distance between two clusters is distance between their centroids

  29. Hierarchisches Clustering, Agglomerative Methoden

  30. Beispiel: Two-way hierarchical clustering clustering of samples across genes:find groups of similar samples clustering of genes across samples: find groups of similar genes from: Eisen, Spellman, Botstein et al.Yeast compendium data

  31. Clustering Packete in R

More Related