270 likes | 377 Vues
Learn about the motivations and challenges in social bookmarking and clustering analysis. Discover the principles, algorithms, and architectures of machine learning. Dive into topics like EM algorithm, Bayes' theorem, and dimension reduction.
E N D
Jong Youl Choi Computer Science Department (jychoi@cs.indiana.edu) Machine Learning and Statistical Analysis
Motivations • Social Bookmarking Socialized Bookmarks Tags
Collaborative Tagging System • Motivations • Social indexing or collaborative annotation • Collect knowledge from people • Extract information • Challenges • Vast amount of data Efficient indexing scheme • Very dynamic Temporal analysis • Unsupervised data Clustering, inference
Outlines • Principles of Machine Learning • Bayes’ theorem and maximum likelihood • Machine Learning Algorithms • Clustering analysis • Dimension reduction • Classification • Parallel Computing • General parallel computing architecture • Parallel algorithms
Machine Learning • Definition Algorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc. • Algorithm Types • Unsupervised learning • Supervised learning • Reinforcement learning • Topics • Models • Artificial Neural Network (ANN) • Support Vector Machine (SVM) • Optimization • Expectation-Maximization (EM) • Deterministic Annealing (DA)
Bayes’ Theorem • Posterior probability of i, given X • i 2 : Parameter • X : Observations • P(i) : Prior (or marginal) probability • P(X|i) : likelihood • Maximum Likelihood (ML) • Used to find the most plausible i2, given X • Computing maximum likelihood (ML) or log-likelihood Optimization problem
Maximum Likelihood (ML) Estimation • Problem Estimate hidden parameters (={, })from the given data extracted from k Gaussian distributions • Gaussian distribution • Maximum Likelihood • With Gaussian (P = N), • Solve either brute-force or numeric method (Mitchell , 1997)
EM algorithm • Problems in ML estimation • Observation X is often not complete • Latent (hidden) variable Z exists • Hard to explore whole parameter space • Expectation-Maximization algorithm • Object : To find ML, over latent distribution P(Z |X,) • Steps 0. Init – Choose a random old 1. E-step – Expectation P(Z |X, old) 2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating oldÃnew
Clustering Analysis • Definition Grouping unlabeled data into clusters, for the purpose of inference of hidden structures or information • Dissimilarity measurement • Distance : Euclidean(L2), Manhattan(L1), … • Angle : Inner product, … • Non-metric : Rank, Intensity, … • Types of Clustering • Hierarchical • Agglomerative or divisive • Partitioning • K-means, VQ, MDS, … (Matlab helppage)
K-Means • Find K partitions with the total intra-cluster variance minimized • Iterative method • Initialization : Randomized yi • Assignment of x (yi fixed) • Update of yi (x fixed) • Problem? Trap in local minima (MacKay, 2003)
Deterministic Annealing (DA) • Deterministically avoid local minima • No stochastic process (random walk) • Tracing the global solution by changing level of randomness • Statistical Mechanics • Gibbs distribution • Helmholtz free energy F = D– TS • Average Energy D = < Ex> • Entropy S = - P(Ex) ln P(Ex) • F = – T ln Z • In DA, we make F minimized (Maxima and Minima, Wikipedia)
Deterministic Annealing (DA) • Analogy to physical annealing process • Control energy (randomness) by temperature (high low) • Starting with high temperature (T = 1) • Soft (or fuzzy) association probability • Smooth cost function with one global minimum • Lowering the temperature (T !0) • Hard association • Revealing full complexity, clusters are emerged • Minimization of F, using E(x, yj) = ||x-yj||2 Iteratively,
Dimension Reduction • Definition Process to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises. • Curse of dimensionality • Complexity grows exponentially in volume by adding extra dimensions • Types • Feature selection : Choose representatives (e.g., filter,…) • Feature extraction : Map to lower dim. (e.g., PCA, MDS, … ) (Koppen, 2000)
Principle Component Analysis (PCA) • Finding a map of principle components (PCs) of data into an orthogonal space, such that y= W xwhere W 2Rd£h (hÀd) • PCs – Variables with the largest variances • Orthogonality • Linearity – Optimal least mean-square error • Limitations? • Strict linearity • specific distribution • Large variance assumption x2 PC2 PC1 x1
Random Projection • Like PCA, reduction of dimension by y= R x where R is a random matrix with i.i.d columns and R 2Rd£p (pÀd) • Johnson-Lindenstrauss lemma • When projecting to a randomly selected subspace, the distance are approximately preserved • Generating R • Hard to obtain orthogonalized R • Gaussian R • Simple approach choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively
Multi-Dimensional Scaling (MDS) • Dimension reduction preserving distance proximities observed in original data set • Loss functions • Inner product • Distance • Squared distance • Classical MDS: minimizing STRAIN, given • From , find inner product matrix B (Double centering) • From B, recover the coordinates X’ (i.e., B=X’X’T )
Multi-Dimensional Scaling (MDS) • SMACOF : minimizing STRESS • Majorization – for complex f(x), find auxiliary simple g(x,y) s.t.: • Majorization for STRESS • Minimize tr(XT B(Y) Y), known as Guttman transform (Cox, 2001)
Self-Organizing Map (SOM) • Competitive and unsupervised learning process for clustering and visualization • Result : similar data getting closer in the model space • Learning • Choose the best similar model vector mj with xi • Update the winner and its neighbors by • mk = mk + (t) (t)(xi – mk) • (t) : learning rate • (t) : neighborhood size Input Model
Classification • Definition A procedure dividing data into the given set of categories based on the training set in a supervised way • Generalization Vs. Specification • Hard to achieve both • Avoid overfitting(overtraining) • Early stopping • Holdout validation • K-fold cross validation • Leave-one-out cross-validation Underfitting Overfitting Validation Error Training Error (Overfitting, Wikipedia)
Artificial Neural Network (ANN) • Perceptron : A computational unit with binary threshold • Abilities • Linear separable decision surface • Represent boolean functions (AND, OR, NO) • Network (Multilayer) of perceptrons Various network architectures and capabilities Weighted Sum Activation Function (Jain, 1996)
Artificial Neural Network (ANN) • Learning weights – random initialization and updating • Error-correction training rules • Difference between training data and output: E(t,o) • Gradient descent (Batch learning) • With E = Ei , • Stochastic approach (On-line learning) • Update gradient for each result • Various error functions • Adding weight regularization term ( wi2) to avoid overfitting • Adding momentum (wi(n-1)) to expedite convergence
Support Vector Machine • Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin • Margin maximization • The distance between H+1 and H-1: • Thus, ||w|| should be minimized Margin
Support Vector Machine • Constraint optimization problem • Given training set {xi, yi} (yi2 {+1, -1}): • Minimize : • Lagrangian equation with saddle points • Minimized w.r.t the primal variable w and b: • Maximized w.r.t the dual variables i (all i¸ 0) • xi with i > 0 (not i = 0) is called support vector (SV)
Support Vector Machine • Soft Margin (Non-separable case) • Slack variables i < C • Optimization with additional constraint • Non-linear SVM • Map non-linear input to feature space • Kernel function k(x,y) = h(x), (y)i • Kernel classifier with support vectors si Input Space Feature Space
Parallel Computing • Memory Architecture • Decomposition Strategy • Task – E.g., Word, IE, … • Data – scientific problem • Pipelining – Task + Data Shared Memory Distributed Memory • Symmetric Multiprocessor (SMP) • OpenMP, POSIX, pthread, MPI • Easy to manage but expensive • Commodity, off-the-shelf processors • MPI • Cost effective but hard to maintain (Barney, 2007) (Barney, 2007)
Parallel SVM • Shrinking • Recall : Only support vectors (i>0) are used in SVM optimization • Predict if data is either SV or non-SV • Remove non-SVs from problem space • Parallel SVM • Partition the problem • Merge data hierarchically • Each unit finds support vectors • Loop until converge (Graf, 2005)
Thank you!! Questions?