LING 696B: Graph-based methods and Supervised learning

LING 696B: Graph-based methodsandSupervised learning

Road map • Types of learning problems: • Unsupervised: clustering, dimension reduction -- Generative models • Supervised: classification (today)-- Discriminative models • Methodology: • Parametric: stronger assumptions about the distribution (blobs, mixture model) • Non-parametric: weaker assumptions (neural nets, spectral clustering, Isomap)

Puzzle from several weeks ago • How do people learn categories from distributions? Liberman et al.(1952)

Graph-based non-parametric methods • “Learn locally, think globally” • Local learning produces a graph that reveals the underlying structure • Learning the neighbors • Graph is used to reveal global structure in the data • Isomap: geodesic distance through shortest path • Spectral clustering: connected components from graph spectrum (see demo)

Clustering as a graph partitioning problem • Normalized-cut problem: splitting the graph into two parts, so that • Each part is not too small • The edges being cut don’t carry too many weights Weights on edges from A to B A B Weights on edges within A

Normalized cut through spectral embedding • Exact solution of normalized-cut is NP-hard (explodes for large graph) • “Soft” version is solvable: looking for coordinates for the nodes x1, … xN to minimize • Strongly connected nodes stay nearby, weakly connected nodes stay faraway • Such coordinates are provided by eigenvectors of adjacency/laplacian matrix (recall MDS) -- Spectral embedding Neighborhood matrix

Is this relevant to how people learn categories? • Maye & Gerken: learning a bi-modal distribution on a curve (living in an abstract manifold) from /d/ to /(s)t/ • Mixture model: transform the signal, and approximate with two “dynamic blobs” • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Simple case: start from a uniform distribution (see demo)

Local learning from graphs • Can people learn categories from arbitrary manifolds following a “local learning” strategy? • Most likely no • What constrains the kind of manifolds that people can learn? • What are the reasonable metrics people use? • How does neighborhood size affect such type of learning? • Learning through non-uniform distributions?

Switch gear • Supervised learning: learning a function from input-output pairs • Arguably, something that people also do • Example: perceptron • Learning a function f(x)= sign(<w,x> + b) • Also called a “classifier”: machine with yes/no output

Speech perception as a classification problem • Speech perception is viewed as a bottom-up procedure involving many decisions • E.g. sonorant/consonant, voice/voiceless • See Peter’s presentation • A long-standing effort of building machines that do the same • Stevens’ view of distinctive features

Knowledge-based speech recognition • Mainstream method: • Front end: uniform signal representation • Back end: hidden Markov models • Knowledge based: • Front end: sound-specific features based on acoustic knowledge • Back end: a series of decisions on how lower level knowledge is integrated

The conceptual framework from (Liu, 96) and others • Each step is hard work Bypassed in Stevens 02

Implications of flow-chart architecture • Requires accurate low-level decisions • Mistakes can build up very quickly • Thought experiment: “linguistic” speech recognition through a sequence of distinctive feature classifiers • Hand-crafted decision rules often not robust/flexible • The need for good statistical classifiers

An unlikely marriage • Recent years have seen several sophisticated classification machines • Example: support vector machine by Vapnik (today) • Interest moving from neural nets to these new machines • Many have proposed to integrate the new classifiers as a back-end • Niyogi and Burges paper: building feature detectors with SVM

Generalization in classification • Experiment: you are learning a line that separates two classes

Generalization in classification • Question: Where does the yellow dot belong?

Margin and linear classifiers • We tend to draw a line that gives the most “room” between the two clouds margin

Margin • Margin needs to be defined on “border” points

Justification for maximum margin • Hopefully, they generalize well

Support vectors in the separable case • Data points that reaches the maximal margin from the separating line

Formalizing maximum margin -- optimization for SVM • Need constrained optimization • f(x) = sign(<w,x>+b) is the same as sign(<Cw,x>+Cb), for any C>0 • Two strategies to choose a constrained optimization problem: • Limit the length of w, and maximize margin • Fix the margin, and minimize the length of w w

SVM optimization (see demo) • Constrained quadratic programming problem • It can be shown (through Lagrange multiplier method) that solution looks like: Label Fixed margin A linear combination of training data!

SVM applied to non-separable data • What happens when data is not separable? • The optimization problem has no solution (recall the XOR problem) • See demo

Extension to non-separable data through new variables • Allow the data points to “encroach” the separating line(see demo) Original objective Tolerance +

When things become wild: Non-linear extensions • The majority of “real world” problems are not separable • This can be due to some deep underlying laws, e.g. XOR data • Non-linearity from Neural nets: • Hidden layers • Non-linear activations • SVM initiates a more trendy way of making non-linear machines -- kernels

Kernel methods • Model-fitting problems ill-posed without constraining the space • Avoid commitment to space: non-parametric method using kernels • Idea: let the space grow with data • How? Associate each data point with a little function, e.g. a blob, and set the space to be the linear combination of these • Connection to neural nets

Kernel extension of SVM • Recall the linear solution: • Substituting this into f: • Using general kernel function K(x, xi) in the place of <x, xi> What matters is the dot product

Kernel extension of SVM • This is very much like replacing linear with non-linear nodes in a neural net • Radial Basis Network: each K(x, xi) is a Gaussian centered at xi -- a small blob • “seeing” non-linearity: a theorem i.e. the kernel is still a dot product, except that it works in an infinite dimensional space of “features”

This is not a fairy tale • Hopefully, by throwing data into infinite dimensions, they will become separable • How can things work in infinite dimensions? • The infinite dimension is implicit • Only support vectors act as “anchors” for the separating plane in feature space • All the computation is done in finite dimensions by searching through support vectors and their weights • As a result, we can do lots of things with SVM by playing with kernels (see demo)

Reflections • How likely this is a human learning model?

Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification?

Reflections • How likely this is a human learning model? • Are all learning problems reducible to classification? • What learning models are appropriate for speech?

LING 696B: Graph-based methods and Supervised learning

LING 696B: Graph-based methods and Supervised learning

Presentation Transcript

Outcomes-based Teaching and Learning : What and How

Graph-based Algorithms in IR and NLP

Model Based Software Testing Preliminaries

Intersecting Evidence-Based Practice and Cultural Competence: Methods, Models, and Tools

Machine Learning Part 2: Intermediate and Active Sampling Methods

Models of Learning

Machine Learning Methods for Human-Computer Interaction

GraphX : Graph Analytics on Spark

LING / C SC 439/539 Statistical Natural Language Processing

Supervised Learning

Graph Theory and Spectral Methods for Pattern Recognition

GRAPH PROCESSING

LING / C SC 439/539 Statistical Natural Language Processing

Mesh-Based Methods for Multiresolution Representations

Outline

Teaching and learning methods

Graphical Models in Machine Learning

Chapter 7. Cluster Analysis

Reinforcement Learning

Scalable Methods for Graph-Based Unsupervised and Semi-Supervised Learning

Supervised Learning I: Perceptrons and LMS

GraphCut -based Optimisation for Computer Vision