Semi-supervised Learning

Semi-supervised Learning Rong Jin

Semi-supervised learning • Label propagation • Transductive learning • Co-training • Active learning

Label Propagation • A toy problem • Each node in the graph is an example • Two examples are labeled • Most examples are unlabeled • Compute the similarity between examples Sij • Connect examples to their most similar examples • How to predicate labels for unlabeled nodes using this graph? Two labeled examples wij Unlabeled example

Label Propagation • Forward propagation

Label Propagation • Forward propagation • Forward propagation

Label Propagation • Forward propagation • Forward propagation • Forward propagation • How to resolve conflicting cases What label should be given to this node ?

Label Propagation • Let S be the similarity matrix S=[Si,j]nxn • Let D be a diagonal matrix where Di = åi ¹ jSi,j • Compute normalized similarity matrix S’ S’=D-1/2SD-1/2 • Let Y be the initial assignment of class labels • Yi = 1 when the i-th node is assigned to the positive class • Yi = -1 when the i-th node is assigned to the negative class • Yi = 0 when the I-th node is not initially labeled • Let F be the predicted class labels • The i-th node is assigned to the positive class if Fi >0 • The i-th node is assigned to the negative class if Fi < 0

Label Propagation • Let S be the similarity matrix S=[Si,j]nxn • Let D be a diagonal matrix where Di = åi ¹ jSi,j • Compute normalized similarity matrix S’ S’=D-1/2SD-1/2 • Let Y be the initial assignment of class labels • Yi = 1 when the i-th node is assigned to the positive class • Yi = -1 when the i-th node is assigned to the negative class • Yi = 0 when the i-th node is not initially labeled • Let F be the predicted class labels • The i-th node is assigned to the positive class if Fi >0 • The i-th node is assigned to the negative class if Fi < 0

Label Propagation • One iteration • F = Y + aS’Y = (I + aS’)Y • a weights the propagation values • Two iteration • F =Y + aS’Y + a2S’2Y = (I + aS’ + a2S’2)Y • How about the infinite iteration F = (ån=01anS’n)Y = (I - aS’)-1Y • Any problems with such an approach?

Label Consistency Problem • Predicted vector F may not be consistent with the initially assigned class labels Y

Energy Minimization • Using the same notation • Si,j: similarity between the I-th node and j-th node • Y: initially assigned class labels • F: predicted class labels • Energy: E(F) = åi,jSi,j(Fi – Fj)2 • Goal: find label assignment F that is consistent with labeled examples Y and meanwhile minimizes the energy function E(F)

Harmonic Function • E(F) = åi,jSi,j (Fi – Fj)2 = FT(D-S)F • Thus, the minimizer for E(F) should be (D-S)F = 0, and meanwhile F should be consistent with Y. • FT = (FlT, FuT), YT = (YlT, YuT) • Fl = Yl

2 1 • Create a graph for images of digit letters Optical Character Recognition • Given an image of a digit letter, determine its value

Optical Character Recognition • #Labeled_Examples+#Unlabeled_Examples = 4000 • CMN: label propagation • 1NN: for each unlabeled example, using the label of its closest neighbor

Spectral Graph Transducer • Problem with harmonic function • Why this could happen ? • The condition (D-S)F = 0 does not hold for constrained cases

Spectral Graph Transducer minF FTLF + c (F-Y)TC(F-Y) s.t. FTF=n, FTe = 0 • C is the diagonal cost matrix, Ci,i = 1 if the i-th node is initially labeled, zero otherwise • Parameter c controls the balance between the consistency requirement and the requirement of energy minimization • Can be solved efficiently through the computation of eigenvector

Empirical Studies

Green’s Function • The problem of minimizing energy and meanwhile being consistent with initially assigned class labels can be formulated into Green’s function problem • Minimizing E(F) = FTLF  LF = 0 • Turns out L can be viewed as Laplacian operator in the discrete case • LF = 0  r2F=0 • Thus, our problem is find solution F r2F=0, s.t. F = Y for labeled examples • We can treat the constraint that F = Y for labeled examples as boundary condition (Von Neumann boundary condition) • A standard Green function problem

Why Energy Minimization? Final classification results

Cluster Assumption • Cluster assumption • Decision boundary should pass low density area • Unlabeled data provide more accurate estimation of local density

denotes +1 denotes -1 Cluster Assumption vs. Maximum Margin • Maximum margin classifier (e.g. SVM) wx+b • Maximum margin  low density around decision boundary  Cluster assumption • Any thought about utilizing the unlabeled data in support vector machine?

Transductive SVM • Decision boundary given a small number of labeled examples

Transductive SVM • Decision boundary given a small number of labeled examples • How will the decision boundary change given both labeled and unlabeled examples?

Transductive SVM • Decision boundary given a small number of labeled examples • Move the decision boundary to place with low local density

Transductive SVM • Decision boundary given a small number of labeled examples • Move the decision boundary to place with low local density • Classification results • How to formulate this idea?

Transductive SVM: Formulation • Labeled data L: • Unlabeled data D: • Maximum margin principle for mixture of labeled and unlabeled data • For each label assignment of unlabeled data, compute its maximum margin • Find the label assignment whose maximum margin is maximized

Tranductive SVM Different label assignment for unlabeled data  different maximum margin

A binary variables for label of each example Transductive SVM Original SVM Constraints for unlabeled data Transductive SVM: Formulation

Computational Issue • No longer convex optimization problem. (why?) • How to optimize transductive SVM? • Alternating optimization

Alternating Optimization • Step 1: fix yn+1,…, yn+m, learn weights w • Step 2: fix weights w, try to predict yn+1,…, yn+m (How?)

Empirical Study with Transductive SVM • 10 categories from the Reuter collection • 3299 test documents • 1000 informative words selected using MI criterion

Co-training for Semi-supervised Learning • Consider the task of classifying web pages into two categories: category for students and category for professors • Two aspects of web pages should be considered • Content of web pages • “I am currently the second year Ph.D. student …” • Hyperlinks • “My advisor is …” • “Students: …”

Co-training for Semi-Supervised Learning

Co-training for Semi-Supervised Learning It is easier to classify this web page using hyperlinks It is easy to classify the type of this web page based on its content

Co-training • Two representation for each web page Content representation: (doctoral, student, computer, university…) Hyperlink representation: Inlinks: Prof. Cheng Oulinks: Prof. Cheng

Co-training: Classification Scheme • Train a content-based classifier using labeled web pages • Apply the content-based classifier to classify unlabeled web pages • Label the web pages that have been confidently classified • Train a hyperlink based classifier using the web pages that are initially labeled and labeled by the classifier • Apply the hyperlink-based classifier to classify the unlabeled web pages • Label the web pages that have been confidently classified

Co-training • Train a content-based classifier

Co-training • Train a content-based classifier using labeled examples • Label the unlabeled examples that are confidently classified

Co-training • Train a content-based classifier using labeled examples • Label the unlabeled examples that are confidently classified • Train a hyperlink-based classifier • Prof. : outlinks to students

Co-training • Train a content-based classifier using labeled examples • Label the unlabeled examples that are confidently classified • Train a hyperlink-based classifier • Prof. : outlinks to students • Label the unlabeled examples that are confidently classified

Co-training • Train a content-based classifier using labeled examples • Label the unlabeled examples that are confidently classified • Train a hyperlink-based classifier • Prof. : outlinks to • Label the unlabeled examples that are confidently classified

Semi-supervised Learning